Fuzziness settings in ElasticSearch - search

Need a way for my search engine to handle small typos in search strings and still return the right results.
According to the ElasticSearch docs, there are three values that are relevant to fuzzy matching in text queries: fuzziness, max_expansions, and prefix_length.
Unfortunately, there is not a lot of detail available on exactly what these parameters do, and what sane values for them are. I do know that fuzziness is supposed to be a float between 0 and 1.0, and the other two are integers.
Can anyone recommend reasonable "starting point" values for these parameters? I'm sure I will have to tune by trial and error, but I'm just looking for ballpark values to correctly handle typos and misspellings.

I found it helpful when using the fuzzy query to actually use both a term query and a fuzzy query(with the same term) in order to both retrieve results for typos, but also ensure that instances of the entered search word appeared highest in the results.
I.E.
{
"query": {
"bool": {
"should": [
{
"match": {
"_all": search_term
}
},
{
"match": {
"_all": {
"query": search_term,
"fuzziness": "1",
"prefix_length": 2
}
}
}
]
}
}
}
a few more details listed here: https://medium.com/#wampum/fuzzy-queries-ae47b66b325c

According to the Fuzzy Query doc, default values are 0.5 for min_similarity (which looks like your fuzziness option), "unbounded" for max_expansions and 0 for prefix_length.
This answer should help you understand the min_similarity option. 0.5 seems to be a good start.
prefix_length and max_expansions will affect performance: you can try and develop with the default values, but be sure it will not scale (lucene developers were even considering setting a default value of 2 for prefix_length). I would recommend to run benchmarks to find the right values for your specific case.

Related

Is there a common pattern for handling pagination where results from search index may be expanded into multiple rows?

This is a contrived / made up example that may not make sense practically, but I'm trying to paint a picture:
There is a web service / search API that supports Relay style pagination for searching for products
A user searches makes a search and requests 3 documents back (e.g. first: 3...)
The service takes the request and passes it to a search index
The search index returns:
[
{
"name": "Shirt",
"sizes": ["S"]
},
{
"name": "Jacket",
"sizes": ["XL"]
},
{
"name": "Hat",
"sizes": ["S", "M"]
}
]
The result of this should be expanded, so that each product shows up as an individual record in the result set with one size per result record, so the above example would split the Hat product result into two results, so the final result would be:
[
{
"name": "Shirt",
"sizes": ["S"]
},
{
"name": "Jacket",
"sizes": ["XL"]
},
{
"name": "Hat",
"sizes": ["S"]
}
]
If the SECOND page was requested, the second page would actually start with the second Hat size (M):
[
...
{
"name": "Hat",
"sizes": ["M"]
},
...
]
I'm wondering if there is a common strategy for handling this, or common libraries that I might use to handle some of this logic.
I'm using (https://opensearch.org)[OpenSearch] and Elasticsearch has a "collapse" and "expand" feature that sounds like it almost does what I'd want at the search backend level, but unfortunately I don't think this is actually the case.
In reality what I want to do is likely not even possible 100%, because if the search results change in between queries you might not end up seeing the correct thing on a subsequent page for example, but I still feel like this might be a common enough issue to have some discussion or solution around it.
I'm thinking that one somewhat certain way of handling this is by denormalizing the data in the search index a bit, and just sticking (for my example) a separate document in the index for both the S and M Hat products (even though the rest of the data would be the same). I'd just need to make sure to remove all documents, and would need to come up with unique identifiers in the index for the documents (so somehow encode the Size in the indexed documents ID).

How to fuzzy query against multiple fields in elasticsearch?

Here's my query as it stands:
"query":{
"fuzzy":{
"author":{
"value":query,
"fuzziness":2
},
"career_title":{
"value":query,
"fuzziness":2
}
}
}
This is part of a callback in Node.js. Query (which is being plugged in as a value to compare against) is set earlier in the function.
What I need it to be able to do is to check both the author and the career_title of a document, fuzzily, and return any documents that match in either field. The above statement never returns anything, and whenever I try to access the object it should create, it says it's undefined. I understand that I could write two queries, one to check each field, then sort the results by score, but I feel like searching every object for one field twice will be slower than searching every object for two fields once.
https://www.elastic.co/guide/en/elasticsearch/guide/current/fuzzy-match-query.html
If you see here, in a multi match query you can specify the fuzziness...
{
"query": {
"multi_match": {
"fields": [ "text", "title" ],
"query": "SURPRIZE ME!",
"fuzziness": "AUTO"
}
}
}
Somewhat like this.. Hope this helps.

In a MongoDB will an index help when a field is just being tested on its length?

I am creating a routine to check for interrupted processing and to carry on, during the startup I'm performing the following search:
.find({"DocumentsPath": {$exists: true, $not: {$size: 0}}})
I want it to be as fast as possible, however the documentation suggests that the index is for scanning within the data. I never need to search within the "DocumentsPath" just use it if its there. Creating an index seems like an overhead I don't want. However having the index might speed up the size test.
My question is whether this field should be indexed within the DB?
Thought of commenting but this does deserve an answer. Should this be indexed? Well probably, but for other purposes. Does this make a difference here? No it does not.
The big point to make is your query terms are redundant ( or could be better ) in this case. Let's look at the example:
{ "DocumentsPath": { "$exists": true } }
That will tell you if there is actually an element in a document that matches the property specified. No it does not an cannot use an index. You can use a "sparse" index though and not even need to call that.
{ "DocumentsPath": { "$not": { "$size" : 0 } } }
This is cute one. Yes it tests the length of an array, but what you are really asking here is "I don't want the array to be empty".
So for the better solution.
Use a "sparse" index:
db.collection.ensureIndex({ "DocumentsPath": 1 }, { "sparse": true })
Query for the zeroth element of an index
{ "DocumentsPath.0": { "$exists": true } }
Still no index for "matching" really, but at least the "sparse" index sorted out some of that my excluding documents and the "dot notation" form here is actually more efficient than evaluating via $size.

Elastic Search size to unlimited

Am new to elastic search. Am facing a problem to write a search query returning all matched records in my collection. Following is my query to search record
{
"size":"total no of record" // Here i need to get total no of records in collection
"query": {
"match": {
"first_name": "vineeth"
}
}
}
By running this query i am only getting maximum 10 records, am sure there is more than 10 matching records in my collection. I searched a lot and finally got size parameter in query. But in my case i dont know the total count of records. I think giving an unlimited number to size variable is not a good practice, so how to manage this situation please help me to solve this issue, Thanks
It's not very common to display all results, but rather use fromand size to specify a range of results to fetch. So your query (for fetching the first 10 results) should look something like this:
{
"from": 0,
"size": 10,
"query": {
"match": {
"first_name": "vineeth"
}
}
}
This should work better than setting size to a ridiculously large value. To check how many documents matched your query you can get the hits.total (total number of hits) from the response.
To fetch all the records you can also use scroll concept.. It's like cursor in db's..
If you use scroll, you can get the docs batch by batch.. It will reduce high cpu usage and also memory usage..
For more info refer
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html
To get all records, per de doc, you should use scroll.
Here is the doc:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html
But the idea is to specify your search and indicate that you want to scroll it:
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
{
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}'
in the scroll param you specify how long you want the search results available.
Then you can retrieve them with the returned scroll_id and the scroll api.
in new versions of elastic (e.g. 7.X), it is better to use pagination than scroll (deprecated):
https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html
deprecated in 7.0.0:
GET /_search/scroll/<scroll_id>

elasticsearch prefix query for multiple words to solve the autocomplete use case

How do I get elastic search to work to solve a simple autocomplete use case that has multiple words?
Lets say I have a document with the following title - Elastic search is a great search tool built on top of lucene.
So if I use the prefix query and construct it with the form -
{
"prefix" : { "title" : "Elas" }
}
It will return that document in the result set.
However if I do a prefix search for
{
"prefix" : { "title" : "Elastic sea" }
}
I get no results.
What sort of query do I need to construct so as to present to the user that result for a simple autocomplete use case.
A prefix query made on Elastic sea would match a term like Elastic search in the index, but that doesn't appear in your index if you tokenize on whitespaces. What you have is elastic and search as two different tokens. Have a look at the analyze api to find out how you are actually indexing your text.
Using a boolean query like in your answer you wouldn't take into account the position of the terms. You would get as a result the following document for example:
Elastic model is a framework to store your Moose object and search
through them.
For auto-complete purposes you might want to make a phrase query and use the last term as a prefix. That's available out of the box using the match_phrase_prefix type in a match query, which was made available exactly for your usecase:
{
"match" : {
"message" : {
"query" : "elastic sea",
"type" : "phrase_prefix"
}
}
}
With this query your example document would match but mine wouldn't since elastic is not close to search there.
To achieve that result, you will need to use a Boolean query. The partial word needs to be a prefix query and the complete word or phrase needs to be in a match clause. There are other tweaks available to the query like must should etc.. that can be applied as needed.
{
"query": {
"bool": {
"must": [
{
"prefix": {
"name": "sea"
}
},
{
"match": {
"name": "elastic"
}
}
]
}
}
}

Resources