Apache Solr (eBook)

A Practical Approach to Enterprise Search
eBook Download: PDF
2015 | 1st ed.
XXVI, 299 Seiten
Apress (Verlag)
978-1-4842-1070-3 (ISBN)

Lese- und Medienproben

Apache Solr -  Dikshant Shahi
Systemvoraussetzungen
56,99 inkl. MwSt
  • Download sofort lieferbar
  • Zahlungsarten anzeigen

Apache Solr: A Practical Approach to Enterprise Search teaches you how to build an enterprise search engine using Apache Solr. You'll soon learn how to index and search your documents; ingest data from varied sources; pre-process, transform and enrich your data; and build the processing pipeline.

You will understand the concepts and internals of Apache Solr and tune the results for your client's search needs. The book explains each essential concept-backed by practical and industry examples-to help you attain expert-level knowledge.

The book, which assumes a basic knowledge of Java, starts with an introduction to Solr, followed by steps to setting it up, indexing your first set of documents, and searching them. It then covers the end-to-end process of data ingestion from varied sources, pre-processing the data, transformation and enrichment of data, building the processing pipeline, query parsing, and scoring the document. It also teaches you how to make your system intelligent and able to learn through feedback loops.

After covering out-of-the-box features, Solr expert Dikshant Shahi dives into ways you can customize Solr for your business and its specific requirements, along with ways to plug in your own components. Most important, you will learn to handle user queries and retrieve meaningful results. The book explains how each user query is different and how to address them differently to get the best result. And because document ranking doesn

't work the same for all applications, the book shows you how to tune Solr for the application at hand and re-rank the results.

You'll see how to influence user experience by providing suggestions and recommendations, and leveraging other interesting features of Solr. You'll also see how to integrate Solr with important related technologies like OpenNLP, Apache Tika, and Apache UIMA, among others, to take your search capabilities to the next level.

This book concludes with case studies and industry examples, the knowledge of which will be helpful in designing components and putting the bits together. By the end of Apache Solr, you will be proficient in designing, architecting, and developing your search engine and be able to integrate it with other systems.



Dikshant Shahi manages the search and platforms team at OnMobile Global Limited. He has been responsible for developing several vertical search engines for categories including music metadata, voice, audio fingerprinting, channel intelligence, log file processing, building analytics, finding deals like Groupon etc. He has also been responsible for handling multi-lingual contents, natural language processing and recommendation. Shahi specializes in Search Engine, Information Retrieval, Data Extraction and Analysis, Application Development, Web Services, and Mobile Applications.    


Build anenterprise search engine using Apache Solr: index and search documents; ingestdata from varied sources; apply various text processing techniques; utilizedifferent search capabilities; and customize Solr to retrieve the desiredresults. Apache Solr: APractical Approach to Enterprise Search explains each essentialconcept-backed by practical and industry examples--to help you attainexpert-level knowledge.The book,which assumes a basic knowledge of Java, starts with an introduction to Solr,followed by steps to setting it up, indexing your first set of documents, andsearching them. It then introduces you to information retrieval and itsimplementation in Apache Solr; this will help you understand your searchproblem, decide the approach to build an effective solution, and use variousmetrics to evaluate the results.The booknext covers the schema design and techniques to build a text analysis chain forcleansing, normalizing and enriching your documents and addressing differenttypes of search queries. It describes various popular matching techniques whichare generally applied to improve the precision and recall of searches. You willlearn the end-to-end process of data ingestion from varied sources, metadataextraction, pre-processing and transformation of content, various searchcomponents, query parsers and other advanced search capabilities. Aftercovering out-of-the-box features, Solr expert Dikshant Shahi dives into waysyou can customize Solr for your business and its specific requirements, alongwith ways to plug in your own components. Most important, you will learn aboutimplementations for Solr scoring, factors affecting the document score, andtuning the score for the application at hand. The book explains why textualscoring is not sufficient for practical ranking of documents and ways tointegrate real-world factors for contributing to the document ranking.You'll seehow to influence user experience by providing suggestions and recommendations. You'll also see integration of Solr with important related technologies such asOpenNLP and Tika. Additionally, you will learn about scaling Solr usingSolrCloud. This book concludes withcoverage of semantic search capabilities, which is crucial for taking thesearch experience to the next level. By the end of Apache Solr, you will be proficient in designing anddeveloping your search engine. 

Dikshant Shahi manages the search and platforms team at OnMobile Global Limited. He has been responsible for developing several vertical search engines for categories including music metadata, voice, audio fingerprinting, channel intelligence, log file processing, building analytics, finding deals like Groupon etc. He has also been responsible for handling multi-lingual contents, natural language processing and recommendation. Shahi specializes in Search Engine, Information Retrieval, Data Extraction and Analysis, Application Development, Web Services, and Mobile Applications.    

Contents at a Glance 5
Contents 6
About the Author 17
About the Technical Reviewer 18
Acknowledgments 19
Introduction 20
Chapter 1: Apache Solr: An Introduction 22
Overview 22
What Makes Apache Solr So Popular 24
Major Building Blocks 25
History 25
What’s New in Solr 5. x 26
Beyond Search 26
Solr vs. Other Options 27
Relational Databases 27
Elasticsearch 28
Related Technologies 29
Summary 29
Resources 30
Chapter 2: Solr Setup and Administration 31
Stand-Alone Server 31
Prerequisites 32
Download 33
Terminology 33
General Terminology 33
SolrCloud Terminology 34
Important Configuration Files 35
Directory Structure 35
Solr Installation 36
Solr Home 39
Hands-On Exercise 40
Start Solr 40
Create a Core 42
Index Some Data 42
Search for Results 43
Solr Script 45
Starting Solr 45
Using Solr Help 46
Stopping Solr 46
Restarting Solr 47
Determining Solr Status 47
Configuring Solr Start 47
Admin Web Interface 48
Core Management 49
Config Sets 49
Create Configset 49
Create Core 50
bin/solr Script 51
Core Admin REST API 51
Admin Interface 52
Manually 52
Core Status 52
Unload Core 53
Delete Core 53
Core Rename 53
Core Swap 53
Core Split 54
Index Backup 54
Index Restore 54
Instance Management 55
Setting Solr Home 55
Memory Management 55
Log Management 56
Log Location 56
Log Level 56
Log Implementation 57
Common Exceptions 57
OutOfMemoryError—Java Heap Space 57
OutOfMemoryError—PermGen Space 57
TooManyOpenFiles 58
UnSupportedClassVersionException 58
Summary 58
Chapter 3: Information Retrieval 59
Introduction to Information Retrieval 59
Search Engines 60
Data and Its Categorization 61
Structured 61
Unstructured 61
Semistructured 62
Content Extraction 62
Text Processing 63
Cleansing and Normalization 64
Enrichment 64
Metadata Generation 65
Inverted Index 67
Retrieval Models 68
Boolean Model 68
Vector Space Model 69
Probabilistic Model 70
Language Model 70
Information Retrieval Process 70
Plan 71
Know the Vertical 71
Know the End User 71
Know the Content 72
Know the Medium 72
Execute 73
Evaluate 73
True Positive 73
False Positive 74
True Negative 74
False Negative 74
Evaluation Metrics 74
Accuracy 74
Precision and Recall 75
F-Measure 76
Summary 76
Chapter 4: Schema Design and Text Analysis 77
Schema Design 77
Documents 78
schema.xml File 78
Fields 79
Field Attributes 79
name 79
type 80
default 80
Reserved Field Names 80
fieldType 80
Implementation Class 81
fieldType Attributes 82
indexed 82
stored 83
required 83
multiValued 84
docValues 84
sortMissingFirst/sortMissingLast 84
positionIncrementGap 84
precisionStep 85
omitNorms 85
omitTermFreqAndPositions 85
omitPosition 85
termVectors 85
termPositions 85
termOffsets 86
termPayloads 86
Text Analysis, If Applicable 86
copyField 86
copyField Attributes 87
source 87
dest 87
maxChars 87
Define the Unique Key 87
Dynamic Fields 87
defaultSearchField 88
solrQueryParser 88
Similarity 89
Text Analysis 89
Tokens 92
Terms 92
Analyzers 93
Simple Analysis 93
Analysis Chain 93
Analysis Phases 93
Indexing 94
Querying 95
Analysis Tools 95
Solr Admin Console 95
Luke 96
Analyzer Components 96
CharFilters 97
Tokenizers 97
TokenFilters 97
Common Text Analysis Techniques 98
Synonym Matching 98
Parameters 98
Phonetic Matching 99
N-Grams 100
Shingling 101
Parameters 102
Stemming 102
KeywordMarkerFilter 104
StemmerOverrideFilter 104
Blacklist (Stop Words) 105
Whitelist (Keep Words) 106
Other Normalization 106
Lowercasing 106
Convert to Closest ASCII Character 107
Remove Duplicate Tokens 107
Multilingual Support 107
Going Schemaless 109
What Makes Solr Schemaless 109
Automatic Field Type Identification 109
Automatic Field Addition 110
Managed Schema and REST API 110
Dynamic Fields 110
Configuration 110
Limitations 112
REST API for Managing Schema 112
Configuration 113
REST Endpoints 113
Other Managed Resources 115
Usage Steps 115
solrconfig.xml File 115
Frequently Asked Questions 116
This section provides answers to some of the questions that developers often ask while defining schema.xml . How do I ha... 116
Why is my Schema Change Not Reflected in Solr? 116
I Have Created a Core in Solr 5.0, but Schema.xml is Missing. Where Can I find it? 116
Summary 116
Chapter 5: Indexing Data 118
Indexing Tools 119
Post Script 119
SimplePostTool 119
curl 119
SolrJ Java Library 120
Other Libraries 120
Indexing Process 120
UpdateRequestHandler 122
UpdateRequestProcessorChain 122
UpdateRequestProcessor vs. Analyzer/Tokenizer 124
Indexing Operations 124
XML Documents 124
Add 125
Update 126
Delete 127
Commit 127
Hard Commit 128
Soft Commit 128
Optimize 128
Rollback 129
JSON Documents 129
CSV Documents 130
Index Rich Documents 130
DataImportHandler 132
Import from RDBMS 132
Document Preprocessing 134
Language Detection 134
Generate Unique ID 135
Deduplication 136
Document Expiration 136
Indexing Performance 138
Custom Components 139
Custom UpdateRequestProcessor 139
Frequently Occurring Problems 141
Copying Multiple Fields to a Single-Valued Field 141
Document Is Missing Mandatory uniqueKey Field 142
Data Not Indexed 142
Indexing Is Slow 142
OutOfMemoryError—Java Heap Space 143
Summary 143
Chapter 6: Searching Data 144
Search Basics 144
Prerequisites 145
Solr Search Process 145
SearchHandler 146
Registered Components 146
Declare Parameters 148
defaults 148
appends 148
invariants 149
SearchComponent 149
QueryParser 150
QueryResponseWriter 151
Solr Query 151
Default Query 152
Query a Default Field 152
Query a Specified Field 152
Match Tokens in Multiple Fields 153
Query Operators 153
Phrase Query 153
Proximity Query 154
Fuzzy Query 154
Wildcard Query 155
Range Query 155
Function Query 156
Filter Query 156
Query Boosting 157
Global Query Parameters 157
q 157
fq 157
rows 157
start 157
def Type 158
sort 158
fl 158
wt 158
debugQuery 158
explainOther 158
timeAllowed 158
omitHeader 159
cache 159
Query Parsers 159
Standard Query Parser 159
DisMax Query Parser 159
Using the DisMax Query Parser 160
Query Parameters 160
q 160
qf 160
q.alt 160
mm 161
qs 161
pf 161
ps 162
tie 162
bq 162
bf 162
Sample DisMax Query 162
eDisMax Query Parser 163
lowercaseOperators 163
boost 164
pf2/pf3 164
ps2/ps3 164
stopwords 164
uf 164
alias 165
JSON Request API 165
Customizing Solr 167
Custom SearchComponent 168
Extend SearchComponent 168
Override the Abstract Methods 168
Get the Request and Response Objects 169
Add the JAR to the Library 169
Register to the Handler 169
Sample Component 169
Java Source Code 170
solrconfig.xml 171
Query 171
Response 171
Frequently Asked Questions 172
I have used KeywordTokenizerFactory in fieldType definition but why is my query string getting tokenized on whitespace? 172
How can I find all the documents that contain no value? 173
How can I apply negative boost on terms? 173
Which are the special characters in query string . How should they be handled? 173
Summary 173
Chapter 7: Searching Data: Part 2 174
Local Parameters 175
Syntax 175
Specifying the Query Parser 175
Specifying the Query Inside the LocalParams Section 175
Using Parameter Dereferencing 175
Example 175
Result Grouping 176
Prerequisites 176
Request Parameters 176
Example 178
Statistics 180
Request Parameters 180
Supported Methods 181
LocalParams 182
Example 182
Faceting 184
Prerequisites 185
Tokenization 185
Lowercasing 185
Syntax 186
JSON API 186
Example 186
Faceting Types 187
General Parameter 187
facet 187
Field Faceting 187
Specific Parameters 187
facet.field 187
facet.prefix 187
facet.contains 187
facet.contains.ignoreCase 187
facet. sort 187
facet.offset 188
facet.limit 188
facet.mincount 188
facet.missing 188
facet.method 188
facet.enum.cache.minDF 188
facet.threads 189
facet.overrequest.count 189
facet.overrequest.ratio 189
Query Faceting 189
Specific Parameters 189
facet.query 189
Range Faceting 189
Specific Parameters 190
facet.range 190
facet.range.start 190
facet.range.end 190
facet.range.gap 190
facet.range.hardend 190
facet.range.include 190
facet.range.other 191
facet.mincount 191
Example 191
Interval Faceting 191
Specific Parameters 191
The following are the request parameters specific to interval faceting.facet.interval 191
facet.interval.set 192
Example 192
Pivot Faceting: Decision Tree 192
Specific Parameters 192
facet.pivot 193
facet.pivot.mincount 193
Example 193
Reranking Query 194
Request Parameters 194
reRankQuery 194
reRankDocs 194
reRankWeight 194
Example 194
Join Query 195
Limitations 195
Example 195
Block Join 195
Prerequisites 195
Example 196
Function Query 197
Prerequisites 198
Usage 199
Function Categories 200
Example 201
Caution 201
Custom Function Query 201
Java Source Code 204
Referencing an External File 206
Usage 206
Summary 207
Chapter 8: Solr Scoring 208
Introduction to Solr Scoring 208
Default Scoring 210
Implementation 211
Scoring Factors 211
Scoring Formula 212
tf(t in d) 213
idf(t) 213
coord(q,d) 213
queryNorm(q) 214
t.getBoost() 214
norm(t,d) 214
Limitations 215
Explain Query 215
Alternative Scoring Models 218
BM25Similarity 218
DFRSimilarity 220
Basic Model 220
After-Effect Model 221
Normalization 221
Usage 222
Other Similarity Measures 222
Per Field Similarity 223
Custom Similarity 224
Summary 226
Chapter 9: Additional Features 227
Sponsored Search 227
Usage 228
Spell-Checking 230
Generic Parameters 231
Implementations 231
IndexBasedSpellChecker 231
DirectSolrSpellChecker 232
FileBasedSpellChecker 233
WordBreakSolrSpellChecker 233
How It Works 234
Usage 234
Autocomplete 237
Traditional Approach 238
TermsComponent 238
Benefits 238
Limitations 239
Usage 239
Facets 241
Benefits 241
Limitations 241
Usage 241
EdgeNGram 242
Benefits 242
Limitations 243
Usage 243
SuggestComponent 244
Dictionary 244
DocumentDictionaryFactory 244
DocumentExpressionDictionaryFactory 244
HighFrequencyDictionary 244
FileDictionaryFactory 245
Algorithm 245
TSTLookupFactory 246
FSTLookupFactory 246
WFSTLookupFactory 246
JaspellLookupFactory 246
AnalyzingLookupFactory 246
FuzzyLookupFactory 247
AnalyzingInfixLookupFactory 247
BlendedInfixLookupFactory 248
FreeTextLookupFactory 248
How It Works 249
Usage 249
Document Similarity 252
Prerequisites 252
Implementations 253
Generic Parameters 253
MoreLikeThisComponent 254
How It Works 254
Usage 254
MoreLikeThisHandler 255
How It Works 255
Usage 256
MLTQParserPlugin 256
How It Works 256
Usage 257
Summary 257
Chapter 10: Traditional Scaling and SolrCloud 258
Stand-Alone Mode 258
Sharding 259
Master-Slave Architecture 261
Master 263
Slave 264
Shards with Master-Slave 264
SolrCloud 266
Understanding the Terminology 266
Node 266
Cluster 267
Core 268
Collection 268
Shards 268
Replica 269
Leader 269
Starting SolrCloud 270
Interactive Mode 270
Standard Mode 271
Restarting a Node 271
Creating a Collection 271
Uploading to ZooKeeper 272
Deleting a Collection 273
Indexing a Document 273
Load Balancing 274
Document Routing 274
Working with a Transaction Log 275
Performing a Shard Health Check 275
Querying Results 275
Performing a Recovery 276
Shard Splitting 277
Adding a Replica 277
ZooKeeper 277
Frequently Asked Questions 278
Why is the size of my data/tlog directory growing drastically? How can I handle that? 278
Can I totally disable transaction logs ? What would be the impact? 279
I have recently migrated from traditional architecture to SolrCloud . Is there anything that I should be careful of and no... 279
I am migrating to SolrCloud, but it fails to upload the configurations to ZooKeeper . What could be the reason? 279
Summary 279
Chapter 11: Semantic Search 280
Limitations of Keyword Systems 281
Semantic Search 282
Tools 284
OpenNLP 284
Apache UIMA 284
Apache Stanbol 285
Techniques Applied 285
Part-of-Speech Tagging 287
Solr Plug-in for POS Tagging 288
Named-Entity Extraction 291
Using Rules and Regex 292
Using a Dictionary or Gazetteer 293
Using a Trained Model 294
Solr Plug-in for Entity Extraction 295
Semantic Enrichment 298
Synonym Expansion 299
WordNet 300
Solr Plug-in for Synonym Expansion 300
Synonym Expansion Using WordNet 300
Custom Token Filter for Synonym Expansion 302
Summary 307
Index 308

Erscheint lt. Verlag 26.12.2015
Zusatzinfo XXVI, 299 p. 56 illus.
Verlagsort Berkeley
Sprache englisch
Themenwelt Mathematik / Informatik Informatik Datenbanken
Mathematik / Informatik Informatik Theorie / Studium
Mathematik / Informatik Informatik Web / Internet
Schlagworte Advanced querying • Enterprise search server • Indexing data • online searching • SolrCloud
ISBN-10 1-4842-1070-0 / 1484210700
ISBN-13 978-1-4842-1070-3 / 9781484210703
Haben Sie eine Frage zum Produkt?
Wie bewerten Sie den Artikel?
Bitte geben Sie Ihre Bewertung ein:
Bitte geben Sie Daten ein:
PDFPDF (Wasserzeichen)
Größe: 4,5 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasser­zeichen und ist damit für Sie persona­lisiert. Bei einer missbräuch­lichen Weiter­gabe des eBooks an Dritte ist eine Rück­ver­folgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seiten­layout eignet sich die PDF besonders für Fach­bücher mit Spalten, Tabellen und Abbild­ungen. Eine PDF kann auf fast allen Geräten ange­zeigt werden, ist aber für kleine Displays (Smart­phone, eReader) nur einge­schränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Mehr entdecken
aus dem Bereich
Das umfassende Handbuch

von Wolfram Langer

eBook Download (2023)
Rheinwerk Computing (Verlag)
49,90
Das umfassende Handbuch

von Jürgen Sieben

eBook Download (2023)
Rheinwerk Computing (Verlag)
89,90
der Grundkurs für Ausbildung und Praxis

von Ralf Adams

eBook Download (2023)
Carl Hanser Fachbuchverlag
29,99