Should I reindex documents in Elasticsearch when I change the Stemmer? - search

I am using Elasticsearch to index my documents (although I believe my question can apply to any other search engine such as Lucene or Solr as well).
I am using Porter stemmer and a list of stop words at the index time. I know that I should apply the same stemmer and stop word removal at the search time to get correct results.
My question is that what if I decide to change my stemmer or add/remove couple of words to/from the list of stop words? Should I reindex all the documents (or all the text fields) to apply the changes? Or is there any other approach to deal with this situation?

Yes, if you need to change your analyzer significantly you must reindex your documents. If you don't, changes will only affect query analysis. You might be able to get away with that on a change to a StopFilter, but not when changing a stemmer. Reindexing is the only way to apply new analysis rules to indexed data, whether you reindex by dumping the whole thing and rebuilding it from scratch, or by updating the documents.
As far as other approaches, if you don't want to reindex, you are stuck limiting your analysis changes to query time, which limits what you can do drastically (you could make a SynonymFilter work, but again, changes to the stemmer are definitely out).

Related

Azure Cognitive Search - When would you use different search and index analyzers?

I'm trying to understand what is the purpose of configuring a different analyzer for searching and indexing in Azure Search. See: https://learn.microsoft.com/en-us/rest/api/searchservice/create-index#-field-definitions-
According to my understanding, the job of the indexing analyzer is to breakup the input document into individual tokens. Through this process, it might apply multiple transformations like lower-casing the content, removing punctuation and white-spaces, and even removing entire words.
If the tokens are already processed, what is the use of the search analyzer?
Initially, I thought it would apply a similar process on the search query itself, but wouldn't setting a different analyzer than the one used to index the document at this stage completely breaks the search results? If the indexing analyzer lower-cased everything, but the search analyzer doesn't lower-case the query, wouldn't that means you'll never get matches for queries with upper case characters? What if the search analyzer doesn't split tokens on white-spaces? Won't you ever get a match the moment the query includes a space?
Assuming that this is indeed how the two analyzers works together, then why would you ever want to set two different ones?
Your understanding of the difference between index and search analyzer is correct. An example scenario where that's valuable is using ngrams for indexing but not for search terms. So this would allow a document with "cat" to produce "c", "ca", "cat" but you wouldn't necessarily want to apply ngrams on the search term as that would make the query less performant and isn't necessary since the documents already produced the ngrams. Hopefully that makes sense!

If I drop the same document into ElasticSearch again, is it going to reindex it?

This is obviously a question about ES internals.
What I have is a custom search engine built on top of ES feeding it with data from multiple vendors. In order to find out if particular document has changed since last indexing (during e.g. periodic re-pulling the documents from vendors - there's no way to ask some vendors "give me only documents changed since that date"), I'd have to check it somehow for modification and drop it into ES for indexing iff the document changed.
Question: does ES keep track of document checksums internally to see if it actually needs to re-index it? (of course I'm presuming that it's not some HTML where some fields, timestamps, etc. are updated dynamically on each GET).
If it did (that is, re-indexing identical documents has negligible amortized cost), that would simplify updates for me, obviously.
If you use the Update API, you can detect no ops https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#_detecting_noop_updates. You can see the source code for the no op here. https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/action/update/UpdateRequestBuilder. Note the "extra work" comment. That's definitely something to consider.
Keep in mind the update API tends to be a lot slower than plain vanilla bulk inserts. Regular inserts in which you let ES increment the _version number when you index a document in the same index with the same id will be faster... but they'll also create GC and indexing pressure.

Reduce the query time in solr

I am using solr for searching. My index size is getting larger hour by hour. So the query time is also getting higher. Many people suggested for sharding. Is this the last option. What should I do now?
before rushing into sharding, which will definetly make your search faster, you might have a look at your schema, and see if you can do any optimisations there.
Use Stop words: stop words are very common words that might inflate the index size un-necessarily. Try to use stop words whenever you need them.
Avoid Synonyms with 'Expand' option if you can. Those also expand the index enormously.
Avoid using N-Grams with large range. This will generate too many combinations if you have large size.
Use query filters (fq parameter) when you just need a filter. Filter queries are faster than normal queries, and they don't apply any scoring. It is just a filter. So if you need to AND queries together, put the filter queries in the fq parameter.
Run "Optimise Index" from time to time to get rid of deleted docs in the index, and to reduce index size.
use debugQuery=on and see if you can spot any thing that is taking long time.
try to use documentCache if you have large document size
try to use filterCache if you have repeated filter queries
try to use queryResultCache if you have repeated queries.
If non of the above resulted in any performance gains, then you might consider sharding/distributed search

showing search results more efficiently?

I want to implement the auto-complete feature provided by various e-commerce stores. Functionality is pretty simple, when you type some characters, it start showing relevant suggestions.
I implemented it using solr (django-haystack), using the autocomplete method provided in haystack.query.SearchQuerySet . Basically, i get a list of results sorted by the score. Showing top n results as suggestions.
Solr document contains $product_name, $category_name and other fields. So the results which i generated looks like list of " in ".
Problem arise when i change the category name. If i change the category name, i have to update all the product belong to that particular category to reflect these changes in the auto-complete (update all documents in solr for products of this category).
Another way to do this is by putting just the id of the categories with product in the solr document. In that case, I have do look-up for category name each time, and this is not efficient.
Is there any other efficient way to do this?
Since you are changing the underlying data, the same has to be propogated to SOLR.
There are different approaches to do this:
Update the database, and reindex - Pros: Simple enough, Cons: Indexing time can be large.
Update database and Solr in tandem - Pros: Quick updates, almost instantaneous, Cons: Can lead to data inconsistency (if one update fails)
Update database, and schedule a delta-import in Solr. This is like a middle ground between the two above.
I would recommend the 3rd approach, but this would require some upfront schema design. Read more about delta import here, in context of DataImportHandler.

Can Solr index sentences instead of web pages?

I've just set up Solr, indexed some pages (crawled using Nutch) and I can now search.
I now need to change it to index sentences instead of web pages. The result I need is, for example, to do a search for "one word" and get a list of all sentences that contain "one" and/or "word".
I'm new to Solr so any pointers to where I should start from to achieve this would be extremely helpful. Is it at all possible? Or is there an easy way of doing this I've missed?
Yes. Solr indexes 'documents'. You define what a document is by what you post to it via the REST-ful endpoint. If you push one sentence at a time, it indexes one sentence at a time.
If you meant, 'can I push a document, have solr split into sentences and index each one individually', then the answer is, I think, not very easily inside Solr. If you are using Nutch, I'd recommend putting the splitting into Nutch so that it presents solr with one sentence at a time.
Neither the analysis chain nor update request processors provide for splitting a document into littler documents. You might also contemplate the Elastic Search alternative, though I have no concrete knowledge that there's a greased pole to slide down that leads to your solution there.

Resources