Intelligent Document Retrieval (eBook)
XVI, 198 Seiten
Springer Netherland (Verlag)
978-1-4020-3768-9 (ISBN)
Collections of digital documents can nowadays be found everywhere in institutions, universities or companies. Examples are Web sites or intranets. But searching them for information can still be painful. Searches often return either large numbers of matches or no suitable matches at all.
Such document collections can vary a lot in size and how much structure they carry. What they have in common is that they typically do have some structure and that they cover a limited range of topics. The second point is significantly different from the Web in general.
The type of search system that we propose in this book can suggest ways of refining or relaxing the query to assist a user in the search process. In order to suggest sensible query modifications we would need to know what the documents are about. Explicit knowledge about the document collection encoded in some electronic form is what we need. However, typically such knowledge is not available. So we construct it automatically.
Collections of digital documents can nowadays be found everywhere in institutions, universities or companies. Examples are Web sites or intranets. But searching them for information can still be painful. Searches often return either large numbers of matches or no suitable matches at all. Such document collections can vary a lot in size and how much structure they carry. What they have in common is that they typically do have some structure and that they cover a limited range of topics. The second point is significantly different from the Web in general. The type of search system that we propose in this book can suggest ways of refining or relaxing the query to assist a user in the search process. In order to suggest sensible query modifications we would need to know what the documents are about. Explicit knowledge about the document collection encoded in some electronic form is what we need. However, typically such knowledge is not available. So we construct it automatically.
Contents 6
Foreword 10
Preface 12
List of Figures 14
List of Tables 16
1 Introduction 18
1.1 Introductory Examples 21
1.2 Using Markup to Extract Knowledge 25
1.3 Applying the Extracted Knowledge 32
1.4 Structure of the Book 34
Part I The Model 38
2 Related Work 40
2.1 Information Retrieval 41
2.2 Information Extraction 43
2.3 Clustering 44
2.4 Classi.cation 46
2.5 Web Search Techniques 48
2.6 Ontologies 51
2.7 Layout Analysis 53
2.8 Web Search Studies 53
2.9 Navigating Concept Hierarchies 55
2.10 Dialogue Systems 58
2.11 Usability Issues 59
2.12 Concluding Remarks on Related Work 60
3 Data Analysis and Domain Model Construction 62
3.1 Documents 62
3.2 Concepts 64
3.3 A Domain Model Based on Concepts 68
3.4 Model Structure 70
3.5 Model Construction 71
3.6 Using the Model for Query Modi.cation 75
3.7 Implementational Issues 77
4 Incorporating Additional Knowledge 80
4.1 Internal Knowledge 80
4.2 External Knowledge 84
5 A Dialogue System for Partially Structured Data 86
5.1 Dialogue as Movement in Space 87
5.2 Dialogue Example 88
5.3 Static 90
Dynamic Clusters 90
5.4 Real User Queries 90
5.5 Properties 92
5.6 Dialogue 95
Part II Practical Applications 109
6 UKSearch - Intelligent Web Search 110
6.1 Indexing Web Pages 111
6.2 The UKSearch System 115
6.3 Sample Domain 1: Essex University 124
6.4 Sample Domain 2: BBC News 129
6.5 Implementational Issues 134
7 UKSearch - Evaluation and Discussion 138
7.1 Log Analysis 138
7.2 Investigating Domain Model Relations 142
7.3 Task-Based Evaluation: Essex University 146
7.4 Task-Based Evaluation: BBC News 158
8 YPA - Searching Classified Directories 174
8.1 System Overview 175
8.2 Indexing Classi.ed Advertisements 176
8.3 Dialogue Strategy in the YPA 179
8.4 Implementational Issues 188
9 Future Directions and Conclusions 190
9.1 Towards Evolving Domain Models 190
9.2 Dialogue Management 193
9.3 An Outlook on Future Evaluations 194
9.4 Conclusions 195
References 198
Index 210
6 UKSearch - Intelligent Web Search (p.93-94)
Finding information on the Web is normally a straightforward task. For most user requests the information can be located by applying a standard search engine using simple pattern matching techniques. However, by restricting the search to some smaller document collection (one that is still too large to be searched without appropriate tools) this can become a tedious task. Examples of such collections are corporate intranets or university Web sites. Typically a search will return large numbers of matching documents even in smaller document collections. If no matching document can be found, the user is usually either left alone with a great number of partially matching documents or with no results at all.
These are well known problems and approaches for more sophisticated search systems exist to overcome them (see Chap. 2). But those approaches tend to rely very much on a given document structure or expensively created concept hierarchies. While this is appropriate for fairly well structured domains such as product catalogues and other applications where the information is stored in database formats, it is no help if the document collection is heterogeneous.
Surprisingly perhaps, the problem of not .nding any document in the collection for a user query (a form of "data sparsity") is not necessarily a major problem in small domains. The log .les of the search engine installed at the University of Essex Web site prove that the majority of queries that users submit result in a large number of matching documents despite the fairly small size of the collection. But unlike in general Web search where scalability issues prevent the application of more sophisticated indexing steps, we can build domain-speci.c concept hierarchies easily and rapidly in such well-de.ned document collections using the techniques introduced in the earlier chapters. These automatically created knowledge sources re.ect the relations between documents or terms within those documents simply based on the available data.
A part from that, collections of Web pages are well suited to verify the techniques introduced in this book, as these documents are typically marked up using HTML tags. This type of markup mixes visual markup and semantic representation (as found in the meta tags for example). We turn this implicit knowledge into explicit relations.
The earlier chapters presented the conceptual framework. Here we discuss the practical steps that lead to an explicitly structured representation of a Web document collection. Frequently used HTML tags are used to de.ne markup contexts (the fundamental units to extract concepts which are then arranged in a domain model). The structure imposed on the data collection is employed in a dialogue system which assists the user with handling those queries that do not retrieve documents or result in large numbers of matches.
We will see how the general dialogue manager introduced earlier is set up to work with the data collections discussed in this chapter. We will however not focus on the links between concepts and individual documents or directories. The more interesting aspect is the construction of domain models that are not closely tied to the individual documents, mainly because a separable domain model is more .exible. The reason is that despite the ever-changing nature of a collection of Web documents we will not need to constantly update the model. A domain model that is not linked to the individual documents will still be usable once the document collection has been updated. It can simply be plugged into a search system.
Erscheint lt. Verlag | 9.1.2006 |
---|---|
Reihe/Serie | The Information Retrieval Series | The Information Retrieval Series |
Zusatzinfo | XVI, 198 p. |
Verlagsort | Dordrecht |
Sprache | englisch |
Themenwelt | Informatik ► Datenbanken ► Data Warehouse / Data Mining |
Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik | |
Sozialwissenschaften ► Kommunikation / Medien ► Buchhandel / Bibliothekswesen | |
Schlagworte | classification • Data Analysis • Dom • Information Retrieval • Management • Ontologie • System |
ISBN-10 | 1-4020-3768-6 / 1402037686 |
ISBN-13 | 978-1-4020-3768-9 / 9781402037689 |
Haben Sie eine Frage zum Produkt? |
Größe: 4,6 MB
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich