I am using solr for searching. My index size is getting larger hour by hour. So the query time is also getting higher. Many people suggested for sharding. Is this the last option. What should I do now?
before rushing into sharding, which will definetly make your search faster, you might have a look at your schema, and see if you can do any optimisations there.
Use Stop words: stop words are very common words that might inflate the index size un-necessarily. Try to use stop words whenever you need them.
Avoid Synonyms with 'Expand' option if you can. Those also expand the index enormously.
Avoid using N-Grams with large range. This will generate too many combinations if you have large size.
Use query filters (fq parameter) when you just need a filter. Filter queries are faster than normal queries, and they don't apply any scoring. It is just a filter. So if you need to AND queries together, put the filter queries in the fq parameter.
Run "Optimise Index" from time to time to get rid of deleted docs in the index, and to reduce index size.
use debugQuery=on and see if you can spot any thing that is taking long time.
try to use documentCache if you have large document size
try to use filterCache if you have repeated filter queries
try to use queryResultCache if you have repeated queries.
If non of the above resulted in any performance gains, then you might consider sharding/distributed search
Related
I am using MemoryIndex in lucene Java API to index a text content in memory and run queries over it. There can be hundreds of such queries running on a single doc to get matches. I would like to know the efficient way to do this.
Currently I am creating multiple Query objects and looping over them to see which match my text in memory.
The text can be few KBs in size.
Queries will be complex boolean and phrases combined.
Size of a query might be around 1KB max.
This question has been up for quite some time and I will try to answer it myself.
I implemented this by storing all my parsed query objects in a list. I will frame the query using Query Parser and have this stored in my list in memory.
It will improve performance as I do not have to keep building my queries every time a new text comes.
In my case, we had hundreds of complex queries but these were static and do not change. Hence, it made sense to have the parsed queries stored in memory and not build them everytime.
I have implemented this more than a year back at my previous company using apache lucene and java.
Note: One major problem I faced was with the default stopword filter in lucene which was trimming out some parts of the text and this was not the behaviour I needed.
I do not have access to the code anymore, sorry if the answer seems very ambiguous.
Useful classes:
https://lucene.apache.org/core/6_6_2/memory/org/apache/lucene/index/memory/MemoryIndex.html
http://lucene.apache.org/core/6_6_2/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#parse-java.lang.String-
Suppose there's a table with columns (UserID, FieldID, Value), with half a million records. I want to see if some search term T(N) occurs anywhere in each Value (i.e. Value.Contains( T(N) ) ).
I think I'm just hitting a wall volume wise, just too many values to sift through. I don't think a Full Text index will help, because it's only useful for StartsWith queries that look at individual words, not occurrences anywhere within the string at all.
Is there a good approach to indexing this kind of data for such a search in SQL Server?
A half-million records is not terribly large, although I don't know the size of the field contents. A couple of ideas - this was too long for a comment or else I may have posted as such.
You could implement a full-text search engine like Elastic, Solr, etc and use it as a sidecar. If when you are doing text searches, you are not otherwise making much use of the other data, this might be easy enough. Note that you could put other data for searching into Elastic or Solr, but I'm not sure if you'd want to duplicate all your data, and those tools aren't really great for a transactional data store.
Another option for volumes this small, assuming you only need basic "contains" searching: create two more tables: keywords and keyword_index (or whatever). When saving, tokenize your text content and write out any new keywords to keywords table and then add the data to the join table. Index everything, and then do your search off the keywords table, joining back to the master via the intermediate keyword_index table.
This is fairly hackish, and getting your keyword handling really dialed in (for stemming, etc) may be a pain. It is a reasonable quick & dirty solution for smaller-scale needs though.
I am using Elasticsearch to index my documents (although I believe my question can apply to any other search engine such as Lucene or Solr as well).
I am using Porter stemmer and a list of stop words at the index time. I know that I should apply the same stemmer and stop word removal at the search time to get correct results.
My question is that what if I decide to change my stemmer or add/remove couple of words to/from the list of stop words? Should I reindex all the documents (or all the text fields) to apply the changes? Or is there any other approach to deal with this situation?
Yes, if you need to change your analyzer significantly you must reindex your documents. If you don't, changes will only affect query analysis. You might be able to get away with that on a change to a StopFilter, but not when changing a stemmer. Reindexing is the only way to apply new analysis rules to indexed data, whether you reindex by dumping the whole thing and rebuilding it from scratch, or by updating the documents.
As far as other approaches, if you don't want to reindex, you are stuck limiting your analysis changes to query time, which limits what you can do drastically (you could make a SynonymFilter work, but again, changes to the stemmer are definitely out).
My problem is I have n fields (say around 10) in Solr that are searchable, they all are indexed and stored. I would like to run a query first on my whole index of say 5000 docs which will hit around an average of 500 docs. Next I would like to query using a different set of keywords on these 500 docs and NOT on the whole index.
So the first time I send a query a score will be generated, the second time I run a query the new score generated should be based on the 500 documents of the previous query, or in other words Solr should consider only these 500 docs as the whole index.
To summarise this, Index of 5000 will be filtered to 500 and then 50 (5000>500>50). Its basically filtering but I would like to do this in Solr.
I have reasonable basic knowledge and still learning.
Update: If represented mathematically it would look like this:
results1=f(query1)
results2=f(query2, results1)
final_results=f(query3, results2)
I would like this to be accomplish using a program and end-user will only see 50 results. So faceting is not an option.
Two likely implementations occur to me. The simplest approach would be to just add the first query to the second query;
+(first query) +(new query)
This is a good approach if the first query, which you want to filter on, changes often. If the first query is something like a category of documents, or something similar where you can benefit from reuse of the same filter, then a filter query is the better approach, using the fq parameter, something like:
q=field:query2&fq=categoryField:query1
filter queries cache a set of document ids to filter against, so for commonly used searches, like categories, common date ranges, etc., a significant performance benefit can be gained from it (for uncommon searches, or user-entered search strings, it may just incur needless overhead to cache the results, and pollute the cache with a useless result set)
Filter queries (fq) are specifically designed to do quick restriction of the result set by not doing any score calculation.
So, if you put your first query into fq parameter and your second score-generating query in the normal 'q' parameter, it should do what you ask for.
See also a question discussing this issue from the opposite direction.
I believe you want to use a nested query like this:
text:"roses are red" AND _query_:"type:poems"
You can read more about nested queries here:
http://searchhub.org/2009/03/31/nested-queries-in-solr/
Should take a look at "faceted search" from Solr: http://wiki.apache.org/solr/SolrFacetingOverview This will help you in this kind of "iterative" search.
I'm using Solr to search for a long list of IDs like so:
ID:("4d0dbdd9-d6e1-b3a4-490a-6a9d98e276be"
"4954d037-f2ee-8c54-c14e-fa705af9a316"
"0795e3d5-1676-a3d4-2103-45ce37a4fb2c"
"3e4c790f-5924-37b4-9d41-bca2781892ec"
"ae30e57e-1012-d354-15fb-5f77834f23a9"
"7bdf6790-de0c-ae04-3539-4cce5c3fa1ff"
"b350840f-6e53-9da4-f5c2-dc5029fa4b64"
"fd01eb56-bc4c-a444-89aa-dc92fdfd3242"
"4afb2c66-cec9-8b84-8988-dc52964795c2"
"73882c65-1c5b-b3c4-0ded-cf561be07021"
"5712422c-12f8-ece4-0510-8f9d25055dd9"...etc
This works up to a point, but above a certain size fails with the message: too many boolean clauses. You can increase the limit in solrconfig.xml, but this will only take it so far - and I expect the limit is there for a reason:
<maxBooleanClauses>1024</maxBooleanClauses>
I could split the query into several little ones, but that would prevent me then sorting the results. There must be a more appropriate way of doing this?
You should be using a Lucene filter instead of building up the huge boolean query. Try using FieldCacheTermsFilter and pass that filter in to your Searcher. FieldCacheTermsFilter will translate your UID's to a Lucene DocIdSet, and it'll do it fast since it's doing it via the FieldCache.