Web Searching and Mining (eBook)

eBook Download: PDF

2018 | 1st ed. 2019
XVI, 166 Seiten
Springer Singapore (Verlag)
978-981-13-3053-7 (ISBN)

This book presents the basics of search engines and their components. It introduces, for the first time, the concept of Cellular Automata in Web technology and discusses the prerequisites of Cellular Automata. In today's world, searching data from the World Wide Web is a common phenomenon for virtually everyone. It is also a fact that searching the tremendous amount of data from the Internet is a mammoth task - and handling the data after retrieval is even more challenging. In this context, it is important to understand the need for space efficiency in data storage. Though Cellular Automata has been utilized earlier in many fields, in this book the authors experiment with employing its strong mathematical model to address some critical issues in the field of Web Mining.

Dr. Debajyoti Mukhopadhyay is currently the Dean (R&D) and Professor & Head of Computer Engineering at NHITM affiliated to Mumbai University (India). He previously worked in the IT industry for nineteen years, including at the well-known Bell Communications Research, USA, and in academia for sixteen years, including as the Dean (R&D) of Maharashtra Institute of Technology, Pune, India. He has published over 190 research papers and holds three patents. Dr. Mukhopadhyay previously worked in the corporate sector, holding top-level positions, such as the President & CEO, Director and General Manager and oversaw a large number of professionals managing multiple off-shore projects from India. Dr. Mukhopadhyay has been elected as the Distinguished Speaker of the Computer Society of India. He had held Visiting Positions at: Chonbuk National University (South Korea), George Mason University (USA), Thapar University (India). He holds a PhD (Engineering) from Jadavpur University (India), an MS in Computer Science from Stevens Institute of Technology (USA), Post Graduate Diploma in Computer Science from The Queen's University of Belfast (UK) and a BE (Electronics & Telecommunications Engineering) from Bengal Engineering College under the University of Calcutta. Dr. Mukhopadhyay is an FIE, FIETE, SMIEEE (USA), SMACM (USA), CEngg., MIMA (India), and Elected Member of Eta-Kappa-Nu (the EE Honor Society of the USA).

Dr. Debajyoti Mukhopadhyay is currently the Dean (R&D) and Professor & Head of Computer Engineering at NHITM affiliated to Mumbai University (India). He previously worked in the IT industry for nineteen years, including at the well-known Bell Communications Research, USA, and in academia for sixteen years, including as the Dean (R&D) of Maharashtra Institute of Technology, Pune, India. He has published over 190 research papers and holds three patents. Dr. Mukhopadhyay previously worked in the corporate sector, holding top-level positions, such as the President & CEO, Director and General Manager and oversaw a large number of professionals managing multiple off-shore projects from India. Dr. Mukhopadhyay has been elected as the Distinguished Speaker of the Computer Society of India. He had held Visiting Positions at: Chonbuk National University (South Korea), George Mason University (USA), Thapar University (India). He holds a PhD (Engineering) from Jadavpur University (India), an MS in Computer Science from Stevens Institute of Technology (USA), Post Graduate Diploma in Computer Science from The Queen’s University of Belfast (UK) and a BE (Electronics & Telecommunications Engineering) from Bengal Engineering College under the University of Calcutta. Dr. Mukhopadhyay is an FIE, FIETE, SMIEEE (USA), SMACM (USA), CEngg., MIMA (India), and Elected Member of Eta-Kappa-Nu (the EE Honor Society of the USA).

Preface 6
Contents 7
About the Editor 8
List of Figures 9
List of Tables 12
1 Introduction 14
1 Why Web Search Engine? 14
2 Web Search Engine: Some Basic Facts 15
3 Domain-Specific Web Search Engine Concepts 18
4 Survey of Existing Methodologies 19
4.1 Web Crawling 19
4.1.1 Domain-Specific Web Crawling 21
4.1.2 Ontology Basics 22
4.1.3 WordNet 24
4.1.4 Resource Structuring 24
4.2 Predicting Web-Pages at Runtime 25
4.3 Lucky Searching 26
4.3.1 Domain-Specific Lucky Searching 26
4.4 Indexing Web-Pages at Runtime 27
4.4.1 Back-of-the-Book-Style 28
4.4.2 Human-Produced Web-Page Index 28
4.4.3 Meta Search Web-Page Indexing 28
4.4.4 Cache-Based Web-Page Indexing 28
4.5 Product Searching 29
4.6 Image Searching 30
4.6.1 Existing Text to Image Search 31
4.6.2 Existing Image to Image Search 31
References 33
2 Preliminaries on Cellular Automata 41
1 What is Cellular Automata 41
2 Conceptualization of Cellular Automata 43
3 Applications of Cellular Automata 45
4 Conclusion 46
References 46
3 Design of SMACA 48
1 Introduction 48
2 Generation of SMACA 49
3 Synthesis of SMACA 51
4 Analysis of SMACA Through RVG 54
5 SLA Detection in RVG 59
6 Conclusion 60
References 61
4 SMACA Usage in Indexing Storage of a Search Engine 62
1 Introduction 62
2 Background of Search Engine 63
3 Existing Mechanism to Store Web-Data 63
4 Formation of Indexing Storage Using SMACA 64
5 Generation of SMACA for Each Website 65
6 Generation of Inverted Indexed File 66
7 Replacing Inverted Indexed File by SMACA 67
8 Searching Mechanism 68
9 Experimental Results 69
10 Conclusion 73
References 74
5 Cellular Automata in Web-Page Ranking 75
1 Introduction 75
2 Page Ranking Concept 76
3 Concept of Galois Field: GF(2) & GF(2P) Using CA
4 Mapping Link Structure of Web-Pages with Cellular Automata 79
5 Indexing in Ranking 81
6 Conclusion 83
References 84
6 Web-Page Indexing Based on the Prioritize Ontology Terms 85
1 Introduction 85
2 Rules and Definitions 86
3 Proposed Approach 86
3.1 Extraction of Dominating and Sub-dominating Ontology Terms 87
3.2 Proposed Algorithm of Web-Page Indexing 88
3.3 Complexity of Indexing Web-Pages 88
3.4 User Interface 89
3.5 Web-Page Retrieval Mechanism Based on the User Input 90
4 Experimental Analysis 91
4.1 Experiment Procedure 91
4.2 Time Complexity to Produce Resultant Web-Page List 91
4.3 Experimental Result 92
5 Conclusions 93
References 93
7 Domain-Specific Crawler Design 95
1 Introduction 95
2 Proposed Approach 96
2.1 Single Domain-Specific Web Search Crawler 97
2.1.1 Proposed Web-Page Content Relevance Calculation Algorithm for Single Domain 97
2.1.2 Domain-Specific Web-Page Repository Building 98
2.1.3 Challenges Faced While Crawling 98
2.1.4 Relevance Page Tree 100
2.1.5 Searching a Web-Page from RPaT Model 100
2.1.6 Generation of RPaT 100
2.2 Multiple Domains Specific Web Search Crawler 101
2.2.1 Proposed Web-Page Content Relevance Calculation Algorithm for Multiple Domains 102
2.2.2 Multiple Domains Specific Web-Page Repository Building 104
2.2.3 Relevance Page Graph 105
2.2.4 Searching a Web-Page from RPaG Model 106
2.3 Multilevel Domains Specific Web Search Crawler 107
2.3.1 Classifier 1: Web-Page Content Classifier 107
2.3.2 Classifier 2: Web-Page URL Classifier 108
2.3.3 User Interface 108
2.3.4 Proposed Multilevel Domain Specific Web Search Crawler Design Algorithm 110
2.3.5 Web-Page Retrieval Mechanism Based on the User Input 111
3 Experimental Analyzes 111
3.1 Single Domain-Specific Web Search Crawler 111
3.1.1 Test Settings 112
Seed URLs 112
Weight Table 112
3.1.2 Test Results 113
Harvest Rate for Unfocused Crawling 113
Harvest Rate for Single Domain-Specific Web-Page Crawling 113
3.2 Multiple Domains Specific Web Search Crawler 114
3.2.1 Test Settings 115
Seed URLs 115
Syntable 115
Weight Table 116
3.2.2 Test Results 116
Page Distribution in Different Domains 117
Multiple Domains Crawler Performance Over Single Domain Crawler 117
3.3 Multilevel Domains Specific Web Search Crawler 118
3.3.1 Experiment Procedure 118
3.3.2 Complexity Analysis 118
3.3.3 Experimental Result 118
Accuracy Testing of Our Prototype 118
Parallel Crawling Performance Report 119
4 Conclusions 120
References 121
8 Structural Change of Domain-Specific Web-Page Repository for Efficient Searching 123
1 Introduction 123
2 Proposed Approach 124
2.1 HERT Model 124
2.1.1 Searching a Web-Page from HERT Model 125
2.1.2 Challenges Faced While Constructing HERT 126
2.1.3 Algorithm for Construction of HERT from RPaT 126
2.2 IBAG Model 132
2.2.1 Searching a Web-Page from IBAG Model 134
2.2.2 Construction of IBAG from RPaG 134
2.2.3 User Interface 139
2.2.4 Procedure for Web-Page Selection and Its Related Dynamic Ranking 140
2.2.5 Reason of Introducing Multilevel Indexing Concept 141
2.3 M-IBAG Model 142
2.3.1 Construction of M-IBAG Model from IBAG Model 144
3 Experimental Analysis 145
3.1 Sample HERT Construction 145
3.1.1 RPaT Web-Pages 145
3.1.2 HERT Web-Pages 146
3.2 Performance of HERT Searching Over RPaT Searching 147
3.3 Comparative Study of Time Complexity for Different Models 148
3.3.1 RPaG Model Complexity 148
Best-case Time Complexity 148
Worst-Case Time Complexity 149
Average-case Time Complexity 149
IBAG Model Complexity: Ideal Case 149
Best-case Time Complexity 149
Worst-case Time Complexity 149
Average-case Time Complexity 150
3.3.2 IBAG Model Complexity: While All the Web-pages Belong to Same Level 150
Best-case Time Complexity 150
Worst-Case Time Complexity 150
Average-Case Time Complexity 151
3.3.3 M-IBAG Model Complexity 151
Best-case Time Complexity 151
Worst-Case Time Complexity 151
Average-Case Time Complexity 152
3.4 Comparative Study of Time Complexity for the Above Given Models 152
4 Conclusions 153
References 154
9 Domain-Specific Web-Page Prediction 155
1 Introduction 155
2 Web-Page Prediction 156
3 Proposed Approach 156
3.1 Bit Pattern Generation Algorithm 156
3.2 Find Predicted Web-Page List 157
4 Performance Analysis 159
4.1 Testing Procedure 159
4.2 Test Results 159
4.2.1 Average Number of Predicted Web-Page List for a Set of Search String 160
4.2.2 Accuracy Measure 160
4.2.3 Discussion of Average-Case Time Complexity for Generating Search Results from Both IBAG Model 162
4.2.4 Average Time Taken for a Set of Search String 163
5 Conclusions 163
References 163
10 Domain-Specific Lucky Searching 165
1 Introduction 165
2 Proposed Approach 166
2.1 DSLSDB Construction 166
2.1.1 Ontology Terms 166
2.1.2 DSLSDB Construction Algorithm 166
2.2 Lucky URL Search from DSLSDB 168
2.3 User Interface 170
3 Experimental Results 170
3.1 Test Settings 171
3.1.1 Seed URLs 171
3.1.2 Ontology Terms 171
3.1.3 Weight Value 171
3.1.4 Syntable 172
3.1.5 Web-Page Content 172
3.2 Test Results 173
3.2.1 DSLSDB Records 173
3.2.2 Testing Procedure 173
3.2.3 Lucky Searching for Invalid Search String 173
3.2.4 Lucky Search for Valid Search String 174
3.2.5 Comparative Study Between Regular Search Engine and Domain-Specific Search Engine 174
4 Conclusion 175
References 175

Erscheint lt. Verlag	12.12.2018
Reihe/Serie	Cognitive Intelligence and Robotics
Reihe/Serie	Cognitive Intelligence and Robotics
Zusatzinfo	XVI, 166 p. 82 illus., 5 illus. in color.
Verlagsort	Singapore
Sprache	englisch
Themenwelt	Informatik ► Datenbanken ► Data Warehouse / Data Mining
Schlagworte	Cellular Automata • Domain Specific Search • Ontology • Search Engine • Web Page Predition
ISBN-10	981-13-3053-0 / 9811330530
ISBN-13	978-981-13-3053-7 / 9789811330537

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 6,9 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Hardcover

160,49 €