Autocomplete and Fuzzy search across multiple indecies in Elasticsearch - search

I have multiple indices populated in my elasticsearch engine. And I have one text search box which is supposed to query all indices for possible hits. I am planning to query these indices fuzzy and autocomplete. Any suggestion on how the implementation should look like?

Use either GET /_all/_search endpoint or create an alias that gathers under it all the indices you want and use GET /[alias_name]/_search.
As to which field to search, I think _all field could be a good match, depending on how you have your mappings configured (disabling _all or not).

Related

Azure Cognitive Search - When would you use different search and index analyzers?

I'm trying to understand what is the purpose of configuring a different analyzer for searching and indexing in Azure Search. See: https://learn.microsoft.com/en-us/rest/api/searchservice/create-index#-field-definitions-
According to my understanding, the job of the indexing analyzer is to breakup the input document into individual tokens. Through this process, it might apply multiple transformations like lower-casing the content, removing punctuation and white-spaces, and even removing entire words.
If the tokens are already processed, what is the use of the search analyzer?
Initially, I thought it would apply a similar process on the search query itself, but wouldn't setting a different analyzer than the one used to index the document at this stage completely breaks the search results? If the indexing analyzer lower-cased everything, but the search analyzer doesn't lower-case the query, wouldn't that means you'll never get matches for queries with upper case characters? What if the search analyzer doesn't split tokens on white-spaces? Won't you ever get a match the moment the query includes a space?
Assuming that this is indeed how the two analyzers works together, then why would you ever want to set two different ones?
Your understanding of the difference between index and search analyzer is correct. An example scenario where that's valuable is using ngrams for indexing but not for search terms. So this would allow a document with "cat" to produce "c", "ca", "cat" but you wouldn't necessarily want to apply ngrams on the search term as that would make the query less performant and isn't necessary since the documents already produced the ngrams. Hopefully that makes sense!

Azure Cognitive Search: search query with two search profiles

Our search service uses Azure Cognitve Search in the following way:
Search non-fuzzy (i.e. with full match of query string).
Search fuzzy (i.e. it's allowed to change 1-2 letters in a query string)
Join results by certain rule.
This way we want to achieve that full match results will always be on the top.
But now we want to introduce a pagination. And to do it with two separate queries is a difficult and not effective task.
An alternative would be to somehow create a single query which will combine in itself both fuzzy and non-fuzzy search but with different scoring profiles, one with higher weights for full-match search and another with lower weights for fuzzy search.
Like
search=rabbit&scoringProfile=highWeightsProfile | seacrh=rabbit~&scoringProfile=lowWeightsProfile
Is there any way to do this, either in API or in SDK?
Is there any other alternative solutions to the problem of fuzzy search but with higher scores for full-match?
Boosting individual subqueries with Lucene query syntax worked for me as a good solution. Maybe not that flexible as separate search profiles for fuzzy and non-fuzzy parts, but still good.

How to search only inside one string of a Collection in Azure Search?

I've a collection fields like:
["city of god"]
["god of war", "city of war"]
I want to perform a search on the field with 'city' AND 'god' and I want only 'city of god' to be returned.
Yet, the second field is also return regardless of the terms being in two different strings within the collection.
Anyway to make the search strict to within strings and not to the entire collection?
Each searchable field in the index is treated as a bag of terms, so for “city AND god” you’re matching on all terms of that field in the whole document, not only the terms within sub-documents (in this case individual strings in the collection).
One way to get around this would be to specify a reasonable estimate of distance between these terms within a single string of a collection and use that to issue a proximity search query to get the desired result. For your specific example, assuming that the terms would be within 5 words of each other, the following query should work -
&queryType=full&search=fieldName:"city god"~5
Using proximity search is really useful as it helps that words don’t have to be in the provided order in the phrase query with large enough proximity value e.g., “city god”~5 would also match “god bless the city”.
Make sure to include queryType=full in your query string as proximity search is part of the full query syntax and would not work otherwise. You can check some other examples here.

What indexer do I use to find the list in the collection that is most similar to my list?

Lets say I have my list of ingredients:
{'potato','rice','carrot','corn'}
and I want to return lists from a database that are most similar to mine:
{'beans','potato','oranges','lettuce'},
{'carrot','rice','corn','apple'}
{'onion','garlic','radish','eggs'}
My query would return this first:
{'carrot','rice','corn','apple'}
I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.
In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.
What technology should I use to accomplish what I want to do?
Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.
With so much data I can't really loop through it, I need to query everything at once.
I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match.
If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki
Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.

Best way to support wildcard search on a large dictionary?

I am working on a project to search in a large dictionary (100k~1m words). The dictionary items look like {key,value,freq}. Myy task is the development of an incremental search algoritm to support exact match, prefix match and wildcard match. The results should be ordered by freq.
For example:
the dictionary looks like
key1=a,value1=v1,freq1=4
key2=ab,value2=v2,freq2=2
key3=abc,value3=v3 freq3=1
key4=abcd,value4=v4,freq4=3
when a user types 'a', return v1,v4,v2,v3
when a user types 'a?c', return v4,v3
Now my best choice is a suffix tree represented by DAWG data struct, but this method does not support wildcard matches effectively.
Any suggestion?
You need to look at n-grams for indexing your content. If you want to something Out-of-the box, you might want to look at Apache Solr which does a lot of the hard work for you. It also supports prefix, wildcard queries etc.

Resources