I'm using Solr to search for a long list of IDs like so:
ID:("4d0dbdd9-d6e1-b3a4-490a-6a9d98e276be"
"4954d037-f2ee-8c54-c14e-fa705af9a316"
"0795e3d5-1676-a3d4-2103-45ce37a4fb2c"
"3e4c790f-5924-37b4-9d41-bca2781892ec"
"ae30e57e-1012-d354-15fb-5f77834f23a9"
"7bdf6790-de0c-ae04-3539-4cce5c3fa1ff"
"b350840f-6e53-9da4-f5c2-dc5029fa4b64"
"fd01eb56-bc4c-a444-89aa-dc92fdfd3242"
"4afb2c66-cec9-8b84-8988-dc52964795c2"
"73882c65-1c5b-b3c4-0ded-cf561be07021"
"5712422c-12f8-ece4-0510-8f9d25055dd9"...etc
This works up to a point, but above a certain size fails with the message: too many boolean clauses. You can increase the limit in solrconfig.xml, but this will only take it so far - and I expect the limit is there for a reason:
<maxBooleanClauses>1024</maxBooleanClauses>
I could split the query into several little ones, but that would prevent me then sorting the results. There must be a more appropriate way of doing this?
You should be using a Lucene filter instead of building up the huge boolean query. Try using FieldCacheTermsFilter and pass that filter in to your Searcher. FieldCacheTermsFilter will translate your UID's to a Lucene DocIdSet, and it'll do it fast since it's doing it via the FieldCache.
Related
We have several Azure Search indexes that use a Cosmos DB collection of 25K documents as a source and each index has a large number of document properties that can be used for sorting and filtering.
We have a requirement to allow users to sort and filter the documents and then search and jump to a specific documents page in the paginated result set.
Is it possible to query an Azure Search index with sorting and filtering and get the position/rank of a specific document id from the result set? Would I need to look at an alternative option? I believe there could be a way of doing this with a SQL back-end but obviously that would be a major undertaking to implement.
I've yet to find a way of doing this other than writing a query to paginate through until I find the required document which would be a relatively expensive and possibly slow task in terms of processing on the server.
There is no mechanism in Azure Search for filtering within the resultset of another query. You'd have to page through results, looking for the document ID on the client side. If your queries aren't very selective and produce many pages of results, this can be slow as $skip actually re-evaluates all results up to the page you specify.
You could use caching to make this faster. At least one Azure Search customer is using Redis to cache search results. If your queries are selective enough, you could even cache the results in memory so you'd only pay the cost of paging once.
Trying this at the moment. I'm using a two step process:
Generate your query but set $count=true and $top=0. The query result should contain a field named #odata.count.
You can then pick an index, then use $top=1 and $skip=<index> to return a single entry. There is one caveat: $skip will only accept numbers less than 100000
I have an Elasticsearch index that contains around 2.5 billion documents with around 18 million different terms in an analyzed field. Is it possible to quickly get a count of the number of documents that contain a term without searching the index?
It seems like ES would store that information while analyzing the field, or perhaps be able to count the length of an inverted index. If there is a way to search for multiple terms and get the document frequency for each of the terms, that would be even better. I want to do this thousands of times on a regular basis, and I can't tell if there is an efficient way to do that.
You can use the Count API to just return the count from a query, instead of a full document listing.
As far as whether Elasticsearch gives you a way to do this without a query: I'm reasonably confident Elasticsearch doesn't have a store of that information outside the index, because that is exactly what a lucene index already does. That's what an inverted index is, a map of documents indexed by term. Lucene is designed around making looking up documents by term efficient.
Lets say I have my list of ingredients:
{'potato','rice','carrot','corn'}
and I want to return lists from a database that are most similar to mine:
{'beans','potato','oranges','lettuce'},
{'carrot','rice','corn','apple'}
{'onion','garlic','radish','eggs'}
My query would return this first:
{'carrot','rice','corn','apple'}
I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.
In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.
What technology should I use to accomplish what I want to do?
Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.
With so much data I can't really loop through it, I need to query everything at once.
I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match.
If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki
Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.
I am using solr for searching. My index size is getting larger hour by hour. So the query time is also getting higher. Many people suggested for sharding. Is this the last option. What should I do now?
before rushing into sharding, which will definetly make your search faster, you might have a look at your schema, and see if you can do any optimisations there.
Use Stop words: stop words are very common words that might inflate the index size un-necessarily. Try to use stop words whenever you need them.
Avoid Synonyms with 'Expand' option if you can. Those also expand the index enormously.
Avoid using N-Grams with large range. This will generate too many combinations if you have large size.
Use query filters (fq parameter) when you just need a filter. Filter queries are faster than normal queries, and they don't apply any scoring. It is just a filter. So if you need to AND queries together, put the filter queries in the fq parameter.
Run "Optimise Index" from time to time to get rid of deleted docs in the index, and to reduce index size.
use debugQuery=on and see if you can spot any thing that is taking long time.
try to use documentCache if you have large document size
try to use filterCache if you have repeated filter queries
try to use queryResultCache if you have repeated queries.
If non of the above resulted in any performance gains, then you might consider sharding/distributed search
My problem is I have n fields (say around 10) in Solr that are searchable, they all are indexed and stored. I would like to run a query first on my whole index of say 5000 docs which will hit around an average of 500 docs. Next I would like to query using a different set of keywords on these 500 docs and NOT on the whole index.
So the first time I send a query a score will be generated, the second time I run a query the new score generated should be based on the 500 documents of the previous query, or in other words Solr should consider only these 500 docs as the whole index.
To summarise this, Index of 5000 will be filtered to 500 and then 50 (5000>500>50). Its basically filtering but I would like to do this in Solr.
I have reasonable basic knowledge and still learning.
Update: If represented mathematically it would look like this:
results1=f(query1)
results2=f(query2, results1)
final_results=f(query3, results2)
I would like this to be accomplish using a program and end-user will only see 50 results. So faceting is not an option.
Two likely implementations occur to me. The simplest approach would be to just add the first query to the second query;
+(first query) +(new query)
This is a good approach if the first query, which you want to filter on, changes often. If the first query is something like a category of documents, or something similar where you can benefit from reuse of the same filter, then a filter query is the better approach, using the fq parameter, something like:
q=field:query2&fq=categoryField:query1
filter queries cache a set of document ids to filter against, so for commonly used searches, like categories, common date ranges, etc., a significant performance benefit can be gained from it (for uncommon searches, or user-entered search strings, it may just incur needless overhead to cache the results, and pollute the cache with a useless result set)
Filter queries (fq) are specifically designed to do quick restriction of the result set by not doing any score calculation.
So, if you put your first query into fq parameter and your second score-generating query in the normal 'q' parameter, it should do what you ask for.
See also a question discussing this issue from the opposite direction.
I believe you want to use a nested query like this:
text:"roses are red" AND _query_:"type:poems"
You can read more about nested queries here:
http://searchhub.org/2009/03/31/nested-queries-in-solr/
Should take a look at "faceted search" from Solr: http://wiki.apache.org/solr/SolrFacetingOverview This will help you in this kind of "iterative" search.