Elasticsearch: boost the absence of certain terms - search

How do I positive-boost the absence of certain terms? I've asked this question before here but the response was not satisfactory because it wasn't generalizable enough.
Lets try again, with more nuances.
I want to be able to distinguish laptops from their accessories. In human language this is done by the absense of terms. That is, when you say lenovo thinkpad you know that by omitting the word battery you mean you want the actual laptop. Compare this with when a person says lenovo thinkpad battery, where they mean the battery.
So suppose we have the index:
PUT test_index
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}
}
with mapping:
PUT test_index/_mapping/merchant
{
"properties": {
"title": {
"type": "string"
},
"category": {
"type": "string",
"index": "not_analyzed"
}
}
}
put two items into it:
PUT test_index/merchant/3
{
"title": "macbook battery",
"category": "laptops accessories"
}
PUT test_index/merchant/2
{
"title": "lenovo thinkpad battery",
"category": "laptops accessories"
}
PUT test_index/merchant/1
{
"title": "lenovo thinkpad white/black",
"category": "laptops"
}
Now search lenovo thinkpad:
POST test_index/_search
{
"query":{
"match": { "title": "lenovo thinkpad" }
}
}
The result is:
"hits": [
{
"_index": "test_index",
"_type": "merchant",
"_id": "2",
"_score": 0.70710677,
"_source": {
"title": "lenovo thinkpad battery",
"category": "laptops accessories"
}
},
{
"_index": "test_index",
"_type": "merchant",
"_id": "1",
"_score": 0.70710677,
"_source": {
"title": "lenovo thinkpad white/black",
"category": "laptops"
}
}
]
where notice that lenovo thinkpad battery is higher up than lenovo thinkpad white/black.
Now, I can see at least two reasonable ways to do this.
A) Use term frequency on a per-category basis to influence relevance of title match. For example, if for each category you extract the 95% percentile terms, you get that battery is a high frequency term in laptops accessories and so the word battery should be negative-boosted on all title queries.
B) Use term frequency on a per-category basis to influence relevance of category match. For example, in addition of the title match, you automatically negative-boost results whose categories have 95% percentile terms which aren't contained in your title match.
A and B aren't quite the same, but they both rely on the idea that certain absent words should be taken into account for relevance.
So...... thoughts?

My vote would be
C)
Fix the categories so that a battery doesn't have 'laptops' as a category (it's a 'laptopAccessory' or just 'accessory') Alternatively create an additional category (not called 'laptops') to indicate the actual machines themselves.
In your search, instead of trying to down-rank the accessories, you apply a boost to the 'laptops' category (no longer ambiguous). This will cause initial searches as in your example of 'lenovo thinkpad' to bring the actual machines up above the accessories. A more precise search ('lenovo thinkpad battery') will still work as you'd expect also.
Another nice UI/UX experience is to take the total set of categories returned in your results, and provide easy filter links. So if your initial search returns 'laptops' 'accessories' 'payment plans', then you'd have each of those as a link to a re-query that uses the original search plus a filter on that category.
Good luck!

Boost "that" category.
GET /test_index/merchant/_search
{
"from": 0,
"query": {
"bool": {
"must": [
{"match": {"title": "lenovo thinkpad"}}
],
"should": [
{
"match": {
"category": {
"boost": "2",
"query": "laptops"
}
}
}
]
}
},
"size": "10"
}
Result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.573319,
"hits": [
{
"_index": "index",
"_type": "merchant",
"_id": "1",
"_score": 1.573319,
"_source": {
"title": "lenovo thinkpad white/black",
"category": "laptops"
}
},
{
"_index": "index",
"_type": "merchant",
"_id": "2",
"_score": 0.15889977,
"_source": {
"title": "lenovo thinkpad battery",
"category": "laptops accessories"
}
}
]
}
}
More on boosting, can be found here

We can update the absence of certain terms using boost property which was provided while query for that term.
Please check below query with boost property set to 10.
GET /test_index/students/_search
{
"from": 0,
"query": {
"bool": {
"must": [
{"match": {"age": "20"}}
],
"should": [
{
"match": {
"category": {
"boost": "10",
"query": "students"
}
}
}
]
}
},
"size": "10"
}

Related

python Elastic search BadRequestError while making insensitive match analyzer

I'm trying to build an index that is searchable for a possible case insensitive exact match. The Elasticsearch version is 8.6.2 with Lucene version is 9.4.2. The code is run in Python with Python's elasticsearch library.
settings = {"settings": {
"analysis": {
"analyzer": {"lower_analizer": {"tokenizer": "whitespace", "filter": [ "lowercase" ]} }
}
}
}
mappings = {"properties": {
"title": {"type": "text", "analyzer": "standard"},
"article": {"type": "text", "analyzer": "lower_analizer"},
"sentence_id": {"type": "integer"},
}
}
I copied the settings from Elasticsearch's tutorial. However, it returned the following error:
BadRequestError: BadRequestError(400, 'illegal_argument_exception',
'unknown setting [index.settings.analysis.analyzer.lower_analizer.filter]
please check that any required plugins are installed, or check the
breaking changes documentation for removed settings')
I'm not sure where to proceed, as it implies lowercase function does not exist?
In standard analyzer there is lowercase filter in default.
Text field types uses standard analyzer
PUT test_stackoverflow/_doc/1
{
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
GET test_stackoverflow/_search
{
"query": {
"match": {
"text": "quick"
}
}
}
Response:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "test_stackoverflow",
"_id": "1",
"_score": 0.2876821,
"_source": {
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
}
]
}
}
Result as image:

Elasticsearch - Query based on text length

I'm using the official Elasticsearch NodeJS client library, to query the following index structure:
{
"_index": "articles",
"_type": "context",
"_id": "1",
"_version": 1,
"found": true,
"_source": {
"article": "this is a paragraph",
"topic": "topic A"
}
}
{
"_index": "articles",
"_type": "context",
"_id": "2",
"_version": 1,
"found": true,
"_source": {
"article": "this is a paragraph this is a paragraph this is a paragraph",
"topic": "topic B"
}
}
I would like to query my index using the term "this is a paragraph" and boost the result with the most similar text length, IE: document _id:1
Can I do this without re-indexing and adding a field to my index (as described here)?
The below query uses Groovy to look at the length of the actual text indexed into ES (using _source.article.length()) and at the length of the text to be searched. As a very simple basic query, I used match_phrase and then rescored the documents based on how long the text to search is compared to how long the original text is.
GET /articles/context/_search
{
"query": {
"function_score": {
"query": {
"match_phrase": {
"article": "this is a paragraph"
}
},
"functions": [
{
"script_score": {
"script": {
"inline": "text_to_search_length=text_to_search.length(); compared_length=_source.article.length();return (compared_length-text_to_search_length).abs()",
"params": {
"text_to_search": "this is a paragraph"
}
}
}
}
]
}
},
"sort": [
{
"_score": {
"order": "asc"
}
}
]
}

nGram partial matching & limiting nGram results in multiple field query

Background: I've implemented a partial search on a name field by indexing the tokenized name (name field) as well as a trigram analyzed name (ngram field).
I've boosted the name field to have exact token matches bubble up to the top of the results.
Problem: I am trying to implement a query that limits the nGram matches to ones that only match some threshold (say 80%) of the query string. I understand that minimum_should_match seems to be what I am looking for, but my problem is forming the query to actually produce those results.
My exact token matches are boosted to the top but I still get every document that has a single matching trigram in the ngram field.
GIST: Index settings and mapping
Index Settings
{
"my_index": {
"settings": {
"index": {
"number_of_shards": "5",
"max_result_window": "30000",
"creation_date": "1475853851937",
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": "3",
"max_gram": "3"
}
},
"analyzer": {
"ngram_analyzer": {
"filter": [
"lowercase",
"ngram_filter"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "AuCjcP5sSb-m59bYrprFcw",
"version": {
"created": "2030599"
}
}
}
}
}
Index Mappings
{
"my_index": {
"mappings": {
"my_type": {
"properties": {
"acw": {
"type": "integer"
},
"pcg": {
"type": "integer"
},
"date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"dob": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"id": {
"type": "string"
},
"name": {
"type": "string",
"boost": 10
},
"ngram": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"bdk": {
"type": "integer"
},
"mmw": {
"type": "integer"
},
"mpi": {
"type": "integer"
},
"sex": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Solution Attempts
[GIST: Query Attempts] unlinkifying due to 2 link limit :(
(https://gist.github.com/jordancardwell/2e690013666e7e1da6ef1acee314b4e6)
I tried a multi-match query, which gives me correct search results, but I haven't had luck omitting results for names that only match a single trigram (say "odo" trigram inside "theodophilus")
//this matches 'frodo' and sends results to the top, since `name` field is boosted
// but also matches 'theodore' and 'rodolpho'
{
"size":100,
"from":0,
"query":{
"multi_match":{
"query":"frodo",
"fields":[
"name",
"ngram"
],
"type":"best_fields"
}
}
}
.
//I then tried to throw in the `minimum_must_match` option
// hoping it would filter out large strings that only had one matching trigram for instance
{
"size":100,
"from":0,
"query":{
"multi_match":{
"query":"frodo",
"fields":[
"name",
"ngram"
],
"type":"best_fields",
"minimum_should_match": "90%",
}
}
}
I've tried playing around in sense, to manually produce the match queries that this produces to allow me to only apply minimum_must_match to the ngram field but can't seem to get the syntax right.
// I then tried to contruct a custom query to just return the `minimum_should_match`d results on the ngram field
// I started with a query produced by using bodybuilder to `and` and `or` my other search criteria together
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
//each separate field's criteria `must`/`and`ed together
{
"query": {
"bool": {
"filter": {
"bool": {
"should": [
//each critereon for a specific field `should`/`or`ed together
{
//my attempt at getting `ngram` field results..
// should theoretically only return when field
// contains nothing but matching ngrams
// (i.e. exact matches and other fluke matches)
"query": {
"match": {
"ngram": {
"query": "frodo",
"minimum_should_match": "100%"
}
}
}
}
//... other critereon to be `should`/`or`ed together
]
}
}
}
}
}
//... other criteria to be `must`/`and`ed together
]
}
}
}
}
}
Can anyone see what I'm doing wrong?
It seems like this should be fairly straightforward to accomplish, but I must be missing something obvious.
UPDATE
I ran a query with _explain=true (using sense UI) to try to understand my results.
I queried for a match on the ngram field for "frod" with minimum_should_match = 100%, yet I still get every record that matches at least one ngram.
(e.g. rodolpho even though it doesn't contain fro)
GIST: test query and results
note: cross-posted from [discuss.elastic.co]
will make a link later, can't post more than 2 yet : /
(https://discuss.elastic.co/t/ngram-partial-match-limiting-ngram-results-in-multiple-field-query/62526)
I used your settings and mappings to create an index. And you queries seem to be working fine for me. I would suggest doing an explain on one of the "unexpected" documents which is being returned and see why it is being matched and returned with other results.
Here is what I did:
Run the analyze api on your analyzer to see how the query will be split into tokens.
curl -XGET 'localhost:9200/my_index/_analyze' -d '
{
"analyzer" : "ngram_analyzer",
"text" : "frodo"
}'
frodo will be split into 3 tokens with your analyzer.
{
"tokens": [
{
"token": "fro",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "rod",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "odo",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}
I indexed 3 documents for testing (only used ngrams field) . Here are the docs:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 1,
"_source": {
"ngram": "theodore"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"ngram": "frodo"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "3",
"_score": 1,
"_source": {
"ngram": "rudolpho"
}
}
]
}
}
The first query you mentioned, it matches frodo and theodore, but not rudolpho like you mentioned - which makes sense, since rudolpho does not produce any trigrams which match trigrams from frodo
frodo -> fro, rod, odo
rudolpho -> rud, udo, dol, olp, lph, pho
Using your second query, I get back only frodo (None of the other two) .
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.53148466,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 0.53148466,
"_source": {
"ngram": "frodo"
}
}
]
}
}
I then ran an explain (localhost:9200/my_index/my_type/2/_explain) on other two docs (theodore and rudolpho) and I see this (I have clipped the response)
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"matched": false,
"explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0,
"description": "no match on required clause ((ngram:fro ngram:rod ngram:odo)~2)",
"details": [
The above is expected since atleast two out of three tokens from frodo should match.

Modifying elasticsearch score based on nested field value

I want to modify scoring in ElasticSearch (v2+) based on the weight of a field in a nested object within an array.
For instance, using this data:
PUT index/test/0
{
"name": "red bell pepper",
"words": [
{"text": "pepper", "weight": 20},
{"text": "bell","weight": 10},
{"text": "red","weight": 5}
]
}
PUT index/test/1
{
"name": "hot red pepper",
"words": [
{"text": "pepper", "weight": 15},
{"text": "hot","weight": 11},
{"text": "red","weight": 5}
]
}
I want a query like {"words.text": "red pepper"} which would rank "red bell pepper" above "hot red pepper".
The way I am thinking about this problem is "first match the 'text' field, then modify scoring based on the 'weight' field". Unfortunately I don't know how to achieve this, if it's even possible, or if I have the right approach for something like this.
If proposing alternative approach, please try and keep a generalized idea where there are tons of different similar cases (eg: simply modifying the "red bell pepper" document score to be higher isn't really a suitable alternative).
The approach you have in mind is feasible. It can be achieved via function score in a nested query .
An example implementation is shown below :
PUT test
PUT test/test/_mapping
{
"properties": {
"name": {
"type": "string"
},
"words": {
"type": "nested",
"properties": {
"text": {
"type": "string"
},
"weight": {
"type": "long"
}
}
}
}
}
PUT test/test/0
{
"name": "red bell pepper",
"words": [
{"text": "pepper", "weight": 20},
{"text": "bell","weight": 10},
{"text": "red","weight": 5}
]
}
PUT test/test/1
{
"name": "hot red pepper",
"words": [
{"text": "pepper", "weight": 15},
{"text": "hot","weight": 11},
{"text": "red","weight": 5}
]
}
post test/_search
{
"query": {
"bool": {
"disable_coord": true,
"must": [
{
"match": {
"name": "red pepper"
}
}
],
"should": [
{
"nested": {
"path": "words",
"query": {
"function_score": {
"functions": [
{
"field_value_factor": {
"field" : "words.weight",
"missing": 0
}
}
],
"query": {
"match": {
"words.text": "red pepper"
}
},
"score_mode": "sum",
"boost_mode": "replace"
}
},
"score_mode": "total"
}
}
]
}
}
}
Result :
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "0",
"_score": 26.030865,
"_source": {
"name": "red bell pepper",
"words": [
{
"text": "pepper",
"weight": 20
},
{
"text": "bell",
"weight": 10
},
{
"text": "red",
"weight": 5
}
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 21.030865,
"_source": {
"name": "hot red pepper",
"words": [
{
"text": "pepper",
"weight": 15
},
{
"text": "hot",
"weight": 11
},
{
"text": "red",
"weight": 5
}
]
}
}
]
}
The query in a nutshell would score a document that satisfies the must clause as follows : sum up the weights of the matched nested documents with the score of the must clause.

How to force Elasticsearch "terms" query to be not_analyzed

I want to make exact matches ids in a doc field. I have mapped the fields to index them not_analyzed but it seems like in the query each term is tokenizde or at least lowercased. How do I make the query also not_analyzed? Using ES 1.4.4, 1.5.1, and 2.0.0
Here is a doc:
{
"_index": "index_1446662629384",
"_type": "docs",
"_id": "Cat-129700",
"_score": 1,
"_source": {
"similarids": [
"Cat-129695",
"Cat-129699",
"Cat-129696"
],
"id": "Cat-129700"
}
}
Here is a query:
{
"size": 10,
"query": {
"bool": {
"should": [{
"terms": {
"similarids": ["Cat-129695","Cat-129699","Cat-129696"]
}
}]
}
}
}
The query above does not work. If I remove caps and dashes from the doc ids it works. I can't do that for many reasons. Is there a way to make the similarids not_analyzed like the doc fields?
If I'm understanding you correctly, all you need to do is set "index":"not_analyzed" on the "similarids" in your mapping. If you have that setting correct already, then there is something else going on that isn't apparent from what you posted (the "terms" query doesn't do any analysis on your search terms). You may want to check your mapping to make sure it is set up the way you think.
To test it, I set up a simple index like this:
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"id": {
"type": "string",
"index": "not_analyzed"
},
"similarids": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Then added your document:
PUT /test_index/doc/1
{
"similarids": [
"Cat-129695",
"Cat-129699",
"Cat-129696"
],
"id": "Cat-129700"
}
And your query works just fine.
POST /test_index/_search
{
"size": 10,
"query": {
"bool": {
"should": [
{
"terms": {
"similarids": [
"Cat-129695",
"Cat-129699",
"Cat-129696"
]
}
}
]
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.53148466,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.53148466,
"_source": {
"similarids": [
"Cat-129695",
"Cat-129699",
"Cat-129696"
],
"id": "Cat-129700"
}
}
]
}
}
I used ES 2.0 here, but it shouldn't matter which version you use. Here is the code I used to test:
http://sense.qbox.io/gist/562ccda28dfaed2717b43739696b88ea861ad690

Resources