ElasticSearch - search hyphens, underscores, colons

ElasticSearch - search hyphens, underscores, colons - node.js

I'm connecting to an elasticsearch server via nodejs and the npm package #elastic/elasticsearch ver 8.1.0
To create the index:
const response = await client.indices.create({
index: 'foods',
body: {
mappings: {
properties: {
id: { type: 'integer' },
color: { type: 'text' },
name: { type: 'text' }
}
}
}
});
My search query:
const response = await client.search({
index: 'foods',
body: {
query: {
multi_match: {
fields: ["color", "name"],
query: 'apple:na',
type: "phrase_prefix"
}
}
}
});
This won't return anything, as elasticsearch won't search the colon (or underscores, or hyphens). If the query is simply the letter a I get the following results:
[
{
"_index": "foods",
"_type": "_doc",
"_id": "MN8Hs38B5UePBFS0feQD",
"_source": {
"id": 12,
"name": "apple:na",
"color": "red"
}
},
{
"_index": "foods",
"_type": "_doc",
"_id": "euAHs38B5UePBFS0fQEj",
"_source": {
"id": 13,
"name": "apple:euro",
"color": "red"
}
}
]

As you know, text fields are analyzed by standard analyzer if you don't specify the analyzer, I think in your case, if you don't want to change and configure the advance custom analyzer which works on several special characters (hyphen, underscore etc), you can simply use the same query but on .keyword field of your text fields.
Search query
{
"query": {
"multi_match": {
"fields": [
"color.keyword", // Note `.keyword` field
"name.keyword"
],
"query": "apple",
"type": "bool_prefix"
}
}
}
Search results
"hits": [
{
"_index": "71994544",
"_id": "1",
"_score": 1.0,
"_source": {
"id": 12,
"name": "apple:na",
"color": "red"
}
},
{
"_index": "71994544",
"_id": "2",
"_score": 1.0,
"_source": {
"id": 13,
"name": "apple:euro",
"color": "red"
}
}
]
if you provide entire apple:na it produces single search result
{
"query": {
"multi_match": {
"fields": [
"color.keyword",
"name.keyword"
],
"query": "apple:na",
"type": "bool_prefix"
}
}
}
Search result
"hits": [
{
"_index": "71994544",
"_id": "1",
"_score": 1.0,
"_source": {
"id": 12,
"name": "apple:na",
"color": "red"
}
}
]
Hope this helps.

Related

How to implement fuzzy search in multi fields

I was implementing fuzzy search in my existing elasticsearch where I can't change mappings, I was hoping if there is any way I can convert the following query in fuzzy one i.e add fuzzy search on fields lower_name and album
{
"query": {
"bool": {
"must": [
{
"term": {
"user": "userId"
}
},
{
"bool": {
"should": [
{
"terms": {
"lower_name": ["search", "Text"]
}
},
{
"terms": {
"album": ["search","Text"]
}
}
]
}
}
]
}
}
}
I tried this :
{
"query": {
"bool": {
"must": [
{
"term": {
"user": "userId"
}
},
{
"bool": {
"should": [
{
"fuzzy": {
"lower_name": ["search","Text"]
}
},
{
"fuzzy": {
"album": ["search","Text"]
}
}
]
}
}
]
}
}
}
But this is giving error: [fuzzy] query doesn't support multiple fields
Please help!
Using Elasticsearch 6.3

You can use a multi_match query with fuzziness. Try out the below query
Index Data:
{
"user": "ben",
"lower_name": "def",
"album": "Brenda"
}
{
"user": "ben",
"lower_name": "abc",
"album": "Brenda"
},
{
"user": "ben",
"lower_name": "fgh",
"album": "honda"
}
Search Query:
{
"query": {
"bool": {
"must": [
{
"term": {
"user": "ben"
}
},
{
"bool": {
"should": [
{
"multi_match": {
"query": "abc dey",
"fields": [
"lower_name"
],
"fuzziness": "auto"
}
},
{
"multi_match": {
"query": "brenda",
"fields": [
"album"
],
"fuzziness": "auto"
}
}
]
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "66311552",
"_type": "_doc",
"_id": "2",
"_score": 0.7497801,
"_source": {
"user": "ben",
"lower_name": "def",
"album": "Brenda"
}
},
{
"_index": "66311552",
"_type": "_doc",
"_id": "1",
"_score": 0.7497801,
"_source": {
"user": "ben",
"lower_name": "abc",
"album": "Brenda"
}
}
]

You can easily use the "fuzziness": "AUTO". param in your search query. Refer fuzziness in match query official example

nGram partial matching & limiting nGram results in multiple field query

Background: I've implemented a partial search on a name field by indexing the tokenized name (name field) as well as a trigram analyzed name (ngram field).
I've boosted the name field to have exact token matches bubble up to the top of the results.
Problem: I am trying to implement a query that limits the nGram matches to ones that only match some threshold (say 80%) of the query string. I understand that minimum_should_match seems to be what I am looking for, but my problem is forming the query to actually produce those results.
My exact token matches are boosted to the top but I still get every document that has a single matching trigram in the ngram field.
GIST: Index settings and mapping
Index Settings
{
"my_index": {
"settings": {
"index": {
"number_of_shards": "5",
"max_result_window": "30000",
"creation_date": "1475853851937",
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": "3",
"max_gram": "3"
}
},
"analyzer": {
"ngram_analyzer": {
"filter": [
"lowercase",
"ngram_filter"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "AuCjcP5sSb-m59bYrprFcw",
"version": {
"created": "2030599"
}
}
}
}
}
Index Mappings
{
"my_index": {
"mappings": {
"my_type": {
"properties": {
"acw": {
"type": "integer"
},
"pcg": {
"type": "integer"
},
"date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"dob": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"id": {
"type": "string"
},
"name": {
"type": "string",
"boost": 10
},
"ngram": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"bdk": {
"type": "integer"
},
"mmw": {
"type": "integer"
},
"mpi": {
"type": "integer"
},
"sex": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Solution Attempts
[GIST: Query Attempts] unlinkifying due to 2 link limit :(
(https://gist.github.com/jordancardwell/2e690013666e7e1da6ef1acee314b4e6)
I tried a multi-match query, which gives me correct search results, but I haven't had luck omitting results for names that only match a single trigram (say "odo" trigram inside "theodophilus")
//this matches 'frodo' and sends results to the top, since `name` field is boosted
// but also matches 'theodore' and 'rodolpho'
{
"size":100,
"from":0,
"query":{
"multi_match":{
"query":"frodo",
"fields":[
"name",
"ngram"
],
"type":"best_fields"
}
}
}
.
//I then tried to throw in the `minimum_must_match` option
// hoping it would filter out large strings that only had one matching trigram for instance
{
"size":100,
"from":0,
"query":{
"multi_match":{
"query":"frodo",
"fields":[
"name",
"ngram"
],
"type":"best_fields",
"minimum_should_match": "90%",
}
}
}
I've tried playing around in sense, to manually produce the match queries that this produces to allow me to only apply minimum_must_match to the ngram field but can't seem to get the syntax right.
// I then tried to contruct a custom query to just return the `minimum_should_match`d results on the ngram field
// I started with a query produced by using bodybuilder to `and` and `or` my other search criteria together
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
//each separate field's criteria `must`/`and`ed together
{
"query": {
"bool": {
"filter": {
"bool": {
"should": [
//each critereon for a specific field `should`/`or`ed together
{
//my attempt at getting `ngram` field results..
// should theoretically only return when field
// contains nothing but matching ngrams
// (i.e. exact matches and other fluke matches)
"query": {
"match": {
"ngram": {
"query": "frodo",
"minimum_should_match": "100%"
}
}
}
}
//... other critereon to be `should`/`or`ed together
]
}
}
}
}
}
//... other criteria to be `must`/`and`ed together
]
}
}
}
}
}
Can anyone see what I'm doing wrong?
It seems like this should be fairly straightforward to accomplish, but I must be missing something obvious.
UPDATE
I ran a query with _explain=true (using sense UI) to try to understand my results.
I queried for a match on the ngram field for "frod" with minimum_should_match = 100%, yet I still get every record that matches at least one ngram.
(e.g. rodolpho even though it doesn't contain fro)
GIST: test query and results
note: cross-posted from [discuss.elastic.co]
will make a link later, can't post more than 2 yet : /
(https://discuss.elastic.co/t/ngram-partial-match-limiting-ngram-results-in-multiple-field-query/62526)

I used your settings and mappings to create an index. And you queries seem to be working fine for me. I would suggest doing an explain on one of the "unexpected" documents which is being returned and see why it is being matched and returned with other results.
Here is what I did:
Run the analyze api on your analyzer to see how the query will be split into tokens.
curl -XGET 'localhost:9200/my_index/_analyze' -d '
{
"analyzer" : "ngram_analyzer",
"text" : "frodo"
}'
frodo will be split into 3 tokens with your analyzer.
{
"tokens": [
{
"token": "fro",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "rod",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "odo",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}
I indexed 3 documents for testing (only used ngrams field) . Here are the docs:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 1,
"_source": {
"ngram": "theodore"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"ngram": "frodo"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "3",
"_score": 1,
"_source": {
"ngram": "rudolpho"
}
}
]
}
}
The first query you mentioned, it matches frodo and theodore, but not rudolpho like you mentioned - which makes sense, since rudolpho does not produce any trigrams which match trigrams from frodo
frodo -> fro, rod, odo
rudolpho -> rud, udo, dol, olp, lph, pho
Using your second query, I get back only frodo (None of the other two) .
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.53148466,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 0.53148466,
"_source": {
"ngram": "frodo"
}
}
]
}
}
I then ran an explain (localhost:9200/my_index/my_type/2/_explain) on other two docs (theodore and rudolpho) and I see this (I have clipped the response)
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"matched": false,
"explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0,
"description": "no match on required clause ((ngram:fro ngram:rod ngram:odo)~2)",
"details": [
The above is expected since atleast two out of three tokens from frodo should match.

Synonyms, storing weights in document for relevance scoring in Elastic Search

The story: Given the example documents below and by extending them, is it possible to get the following ranking:
A search on "Cereals" results in the following ranking
Cornflakes
Rice Krispies
A search on "Rice" results in the following ranking
Basmati
Rice Krispies
The documents against the search is performed:
[{
name: "Cornflakes"
},
{
name: "Basmati"
},
{
name: "Rice Krispies"
}]
Of course, some of them does not even held the search term, so an option is to add an array of synonyms with a text value and weight with would help in computing the ranking:
[{
name: "Cornflakes",
synonyms: [
{t: 'Cereals', weight: 100},
{t: 'Sugar', weight: 100}]
},
{
name: "Basmati",
synonyms: [
{t: 'Cereals', weight: 1},
{t: 'Rice', weight: 1000}]
},
{
name: "Rice Krispies",
synonyms: [
{t: 'Cereals', weight: 10},
{t: 'Rice', weight: 1}]
}]
Is it the right approach?
What is the Elastic Search query for taking into account weighted synonyms?

I think "tags" would be a more appropriate name for the field than "synonyms".
You could use a nested type to store tags and use function score to combine the value of the tags.weight field (of the best matching tag if any) with the match score on the name field.
One such implementation could look as follows:
put test
put test/tag_doc/_mapping
{
"properties" : {
"tags" : {
"type" : "nested" ,
"properties": {
"t" : {"type" : "string"},
"weight" : {"type" : "double"}
}
}
}
}
put test/tag_doc/_bulk
{ "index" : { "_index" : "test", "_type" : "tag_doc", "_id":1} }
{"name": "Cornflakes","tags": [{"t": "Cereals", "weight":100},{"t": "Sugar", "weight": 100}]}
{ "index" : { "_index" : "test", "_type" : "tag_doc","_id":2} }
{ "name": "Basmati","tags": [{"t": "Cereals", "weight": 1},{"t": "Rice", "weight": 1000}]}
{ "index" : { "_index" : "test", "_type" : "tag_doc","_id":3} }
{ "name": "Rice Krispies", "tags": [{"t": "Cereals", "weight": 10},{"t": "Rice", "weight": 1}]}
post test/_search
{
"query": {
"dis_max": {
"queries": [
{
"match": {
"name": {
"query": "cereals",
"boost": 100
}
}
},
{
"nested": {
"path": "tags",
"query": {
"function_score": {
"functions": [
{
"field_value_factor": {
"field": "tags.weight"
}
}
],
"query": {
"match": {
"tags.t": "cereals"
}
},
"boost_mode": "replace",
"score_mode": "max"
}
},
"score_mode": "max"
}
}
]
}
}
}
Result :
"hits": {
"total": 3,
"max_score": 100,
"hits": [
{
"_index": "test",
"_type": "tag_doc",
"_id": "1",
"_score": 100,
"_source": {
"name": "Cornflakes",
"tags": [
{
"t": "Cereals",
"weight": 100
},
{
"t": "Sugar",
"weight": 100
}
]
}
},
{
"_index": "test",
"_type": "tag_doc",
"_id": "3",
"_score": 10,
"_source": {
"name": "Rice Krispies",
"tags": [
{
"t": "Cereals",
"weight": 10
},
{
"t": "Rice",
"weight": 1
}
]
}
},
{
"_index": "test",
"_type": "tag_doc",
"_id": "2",
"_score": 1,
"_source": {
"name": "Basmati",
"tags": [
{
"t": "Cereals",
"weight": 1
},
{
"t": "Rice",
"weight": 1000
}
]
}
}
]
}

How to force Elasticsearch "terms" query to be not_analyzed

I want to make exact matches ids in a doc field. I have mapped the fields to index them not_analyzed but it seems like in the query each term is tokenizde or at least lowercased. How do I make the query also not_analyzed? Using ES 1.4.4, 1.5.1, and 2.0.0
Here is a doc:
{
"_index": "index_1446662629384",
"_type": "docs",
"_id": "Cat-129700",
"_score": 1,
"_source": {
"similarids": [
"Cat-129695",
"Cat-129699",
"Cat-129696"
],
"id": "Cat-129700"
}
}
Here is a query:
{
"size": 10,
"query": {
"bool": {
"should": [{
"terms": {
"similarids": ["Cat-129695","Cat-129699","Cat-129696"]
}
}]
}
}
}
The query above does not work. If I remove caps and dashes from the doc ids it works. I can't do that for many reasons. Is there a way to make the similarids not_analyzed like the doc fields?

If I'm understanding you correctly, all you need to do is set "index":"not_analyzed" on the "similarids" in your mapping. If you have that setting correct already, then there is something else going on that isn't apparent from what you posted (the "terms" query doesn't do any analysis on your search terms). You may want to check your mapping to make sure it is set up the way you think.
To test it, I set up a simple index like this:
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"id": {
"type": "string",
"index": "not_analyzed"
},
"similarids": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Then added your document:
PUT /test_index/doc/1
{
"similarids": [
"Cat-129695",
"Cat-129699",
"Cat-129696"
],
"id": "Cat-129700"
}
And your query works just fine.
POST /test_index/_search
{
"size": 10,
"query": {
"bool": {
"should": [
{
"terms": {
"similarids": [
"Cat-129695",
"Cat-129699",
"Cat-129696"
]
}
}
]
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.53148466,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.53148466,
"_source": {
"similarids": [
"Cat-129695",
"Cat-129699",
"Cat-129696"
],
"id": "Cat-129700"
}
}
]
}
}
I used ES 2.0 here, but it shouldn't matter which version you use. Here is the code I used to test:
http://sense.qbox.io/gist/562ccda28dfaed2717b43739696b88ea861ad690

Elasticsearch match with stemming

How do I do a search for a stemmed match?
I.e. at the moment I have many documents that contain the word "skateboard" in the item_title field, but only 3 documents that contain the word "skateboards". Because of this, when I do the following search:
POST /my_index/my_type/_search
{
"size": 100,
"query" : {
"multi_match": {
"query": "skateboards",
"fields": [ "item_title^3" ]
}
}
}
I only get 3 results. However, I would like also documents with the word "skateboard" to be returned.
From what I understand from Elasticsearch I would expect that this is done by specifying a mapping on the item_title field that contains an analyser which indexes the stemmed version of each word, but I can't seem to find the documentation on how to do this, which suggests that it's done in a different way.
Suggestions?

Here's one example:
PUT /stem
{
"settings": {
"analysis": {
"filter": {
"filter_stemmer": {
"type": "stemmer",
"language": "english"
}
},
"analyzer": {
"tags_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"filter_stemmer"
],
"tokenizer": "standard"
}
}
}
},
"mappings": {
"test": {
"properties": {
"item_title": {
"analyzer": "tags_analyzer",
"type": "text"
}
}
}
}
}
Index some sample docs:
POST /stem/test/1
{
"item_title": "skateboards"
}
POST /stem/test/2
{
"item_title": "skateboard"
}
POST /stem/test/3
{
"item_title": "skate"
}
Perform the query:
GET /stem/test/_search
{
"query": {
"multi_match": {
"query": "skateboards",
"fields": [
"item_title^3"
]
}
},
"fielddata_fields": [
"item_title"
]
}
And see the results:
"hits": [
{
"_index": "stem",
"_type": "test",
"_id": "1",
"_score": 1,
"_source": {
"item_title": "skateboards"
},
"fields": {
"item_title": [
"skateboard"
]
}
},
{
"_index": "stem",
"_type": "test",
"_id": "2",
"_score": 1,
"_source": {
"item_title": "skateboard"
},
"fields": {
"item_title": [
"skateboard"
]
}
}
]
I have added, also, the fielddata_fields element so that you can see how the content of the field has been indexed. As you can see, in both cases, the indexed term is skateboard.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ElasticSearch - search hyphens, underscores, colons - node.js

Related

How to implement fuzzy search in multi fields

nGram partial matching & limiting nGram results in multiple field query

Synonyms, storing weights in document for relevance scoring in Elastic Search

How to force Elasticsearch "terms" query to be not_analyzed

Elasticsearch match with stemming

Categories

Resources