python Elastic search BadRequestError while making insensitive match analyzer - python-3.x

I'm trying to build an index that is searchable for a possible case insensitive exact match. The Elasticsearch version is 8.6.2 with Lucene version is 9.4.2. The code is run in Python with Python's elasticsearch library.
settings = {"settings": {
"analysis": {
"analyzer": {"lower_analizer": {"tokenizer": "whitespace", "filter": [ "lowercase" ]} }
}
}
}
mappings = {"properties": {
"title": {"type": "text", "analyzer": "standard"},
"article": {"type": "text", "analyzer": "lower_analizer"},
"sentence_id": {"type": "integer"},
}
}
I copied the settings from Elasticsearch's tutorial. However, it returned the following error:
BadRequestError: BadRequestError(400, 'illegal_argument_exception',
'unknown setting [index.settings.analysis.analyzer.lower_analizer.filter]
please check that any required plugins are installed, or check the
breaking changes documentation for removed settings')
I'm not sure where to proceed, as it implies lowercase function does not exist?

In standard analyzer there is lowercase filter in default.
Text field types uses standard analyzer
PUT test_stackoverflow/_doc/1
{
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
GET test_stackoverflow/_search
{
"query": {
"match": {
"text": "quick"
}
}
}
Response:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "test_stackoverflow",
"_id": "1",
"_score": 0.2876821,
"_source": {
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
}
]
}
}
Result as image:

Related

Unable to search a query with symbols in elasticsearch

I have been trying to match a query using the elasticsearch python client but I am unable to match it even after using escape characters and setting up some custom analyzers and mapping them. I want to search using & and its not giving any response.
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
doc1 = {
'name': 'numb',
'band': 'linkin_park',
'year': '2006'
}
doc2 = {
'name': 'Powerless &',
'band': 'linkin_park',
'year': '2006'
}
doc3 = {
'name': 'Crawling !',
'band': 'linkin_park',
'year': '2006'
}
doc =[doc1, doc2, doc3]
'''
create_index = {
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "whitespace"
}
}
}
}
}
es.indices.create(index="idx_temp", body=create_index)
'''
for i in range(3):
es.index(index="idx_temp", doc_type='_doc', id=i, body=doc[i])
my_mapping = {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
'ignore_above': 256
}
},
"analyzer": "my_analyzer"
"search_analyzer": "my_analyzer"
},
"band": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "my_analyzer"
"search_analyzer": "my_analyzer"
},
"year": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "my_analyzer"
"search_analyzer": "my_analyzer"
}
}
}
es.indices.put_mapping(index='idx_temp', body=my_mapping, doc_type='_doc', include_type_name=True)
res = es.search(index='idx_temp', body={
"query": {
"match": {
"name": {
"query": "powerless &",
"fuzziness": 3
}
}
}
})
for hit in res['hits']['hits']:
print(hit['_source'])
The expected output was 'name': 'Poweeerless &', but i got 0 hits and no value returned.
So I have fixed the problem by adding another field
"search_quote_analyzer": "my_analyzer"
to the settings field after
"analyzer": "my_analyzer"
"search_analyzer": "my_analyzer"
And then I'm getting my output by searching with & in the query as
'name': 'Poweeerless &'
I just tried it using your index settings, mapping, and query and was able to get the results. Below are 2 different things which I did.
Escape the special char &, when I was trying to index the doc using ES REST API directly, using below the body in postman:
{
"content": "Powerless \&" }
Then ES gave me the Unrecognized character escape '&' exception and even Postman, popular REST client was also giving me warning about not a proper string.
Then I changed above payload to below and was able to index the doc:
{
"content": "Powerless \\&" :-> Notice I added a another `\` to escape the `&`
}
I changed the query to use the same field, which was having the value &, in your case it is name field, not the content field., As match query is analyzed and uses the same analyzer which is used for indexing time. And was able to get the result.
PS: I also verified your analyzer using _analyze api and it's generating the below tokens for text Powerless \\&
{
"tokens": [
{
"token": "powerless",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "\\&",
"start_offset": 10,
"end_offset": 12,
"type": "word",
"position": 1
}
]
}

Can search Elasticsearch 5.0 (Windows) using simple query, but prefixed with field name fails (search by example)

I am trying to get "search by example" functionality out of ElasticSearch.
I have a number of objects which have fields, e.g. name, description, objectID, etc.
I want to perform a search where, for example, "name=123" and "description=ABC"
Mapping:
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 3,
"refresh_interval": "5s",
"index.mapping.total_fields.limit": "500"
},
"mappings": {
"CFS": {
"_routing": {
"required": true
},
"properties": {
"objectId": {
"store": true,
"type": "keyword",
"index": "not_analyzed"
},
"name": {
"type": "text",
"analyzer": "standard"
},
"numberOfUpdates": {
"type": "long"
},
"dateCreated": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"lastModified": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis",
"index": "not_analyzed"
}
}
}
}
}
Trying a very simple search, without field name, gives correct result:
Request: GET http://localhost:9200/repository/CFS/_search?routing=CFS&q=CFS3
Returns:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.7831944,
"hits": [
{
"_index": "repository",
"_type": "CFS",
"_id": "589a9a62-1e4d-4545-baf9-9cc7bf4d582a",
"_score": 0.7831944,
"_routing": "CFS",
"_source": {
"doc": {
"name": "CFS3",
"description": "CFS3Desc",
"objectId": "589a9a62-1e4d-4545-baf9-9cc7bf4d582a",
"lastModified": 1480524291530,
"dateCreated": 1480524291530
}
}
}
]
}
}
But trying to prefix with a field name fails (and this happens on all fields, e.g. objectId):
Request: GET http://localhost:9200/repository/CFS/_search?routing=CFS&q=name:CFS3
Returns:
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
Eventually I want to do something like:
{
"bool" : {
"must" : [
{
"wildcard" : {
"name" : {
"wildcard" : "*CFS3*",
"boost" : 1.0
}
}
},
{
"wildcard" : {
"description" : {
"wildcard" : "*CFS3Desc*",
"boost" : 1.0
}
}
}
]
}
}
Maybe related? When I try to use a "multi_match" to do this, I have to prefix my field name with a wildcard, e.g.
POST http://localhost:9200/repository/CFS/_search?routing=CFS
{
"query": {
"multi_match" : {
"query" : "CFS3",
"fields" : ["*name"]
}
}
}
If I don't prefix it, it doesn't find anything. I've spent 2 days searching StackOverflow and the ElasticSearch documentation. But these issues don't seem to be mentioned.
There's lots about wildcards for search terms, and even mention of wildcards AFTER the field name, but nothing about BEFORE the field name.
What piece of information am I missing from the field name, that I need to deal with by specifying a wildcard?
I think the types of my fields in the mapping are correct. I'm specifying an analyzer.
I found out the answer to this :(
I had been keen to utilise "upserts", to avoid having to check if the object already existed, and to therefore keep performance high.
As you see at this link https://www.elastic.co/guide/en/elasticsearch/guide/current/partial-updates.html and this one https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html when calling the Update REST call, you specify your payload as:
{
"doc" : {
"tags" : [ "testing" ],
"views": 0
}
}
When implementing the equivalent using the Java client, I didn't follow the examples exactly. Instead of what was suggested:
UpdateRequest updateRequest = new UpdateRequest();
updateRequest.index("index");
updateRequest.type("type");
updateRequest.id("1");
updateRequest.doc(jsonBuilder()
.startObject()
.field("gender", "male")
.endObject());
client.update(updateRequest).get();
I had implemented:
JsonObject state = extrapolateStateFromEvent( event );
JsonObject doc = new JsonObject();
doc.add( "doc", state );
UpdateRequest updateRequest = new UpdateRequest( indexName, event.getEntity().getType(), event.getEntity().getObjectId() );
updateRequest.routing( event.getEntity().getType() );
updateRequest.doc( doc.toString() );
updateRequest.upsert( doc.toString() );
UpdateResponse response = client.update( updateRequest ).get();
I wrapped my payload/"state" with a "doc" object, thinking it was needed.
But this had a large impact on how I interacted with my data, and at no point was I warned about it.
I guess I had accidentally created a nested object. Although I wonder why it affects the search APIs so much?
How this could be improved? Maybe the mapping could default to disallow nested objects? Or there could be some kind of validation that a programmer could perform?

nGram partial matching & limiting nGram results in multiple field query

Background: I've implemented a partial search on a name field by indexing the tokenized name (name field) as well as a trigram analyzed name (ngram field).
I've boosted the name field to have exact token matches bubble up to the top of the results.
Problem: I am trying to implement a query that limits the nGram matches to ones that only match some threshold (say 80%) of the query string. I understand that minimum_should_match seems to be what I am looking for, but my problem is forming the query to actually produce those results.
My exact token matches are boosted to the top but I still get every document that has a single matching trigram in the ngram field.
GIST: Index settings and mapping
Index Settings
{
"my_index": {
"settings": {
"index": {
"number_of_shards": "5",
"max_result_window": "30000",
"creation_date": "1475853851937",
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": "3",
"max_gram": "3"
}
},
"analyzer": {
"ngram_analyzer": {
"filter": [
"lowercase",
"ngram_filter"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "AuCjcP5sSb-m59bYrprFcw",
"version": {
"created": "2030599"
}
}
}
}
}
Index Mappings
{
"my_index": {
"mappings": {
"my_type": {
"properties": {
"acw": {
"type": "integer"
},
"pcg": {
"type": "integer"
},
"date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"dob": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"id": {
"type": "string"
},
"name": {
"type": "string",
"boost": 10
},
"ngram": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"bdk": {
"type": "integer"
},
"mmw": {
"type": "integer"
},
"mpi": {
"type": "integer"
},
"sex": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Solution Attempts
[GIST: Query Attempts] unlinkifying due to 2 link limit :(
(https://gist.github.com/jordancardwell/2e690013666e7e1da6ef1acee314b4e6)
I tried a multi-match query, which gives me correct search results, but I haven't had luck omitting results for names that only match a single trigram (say "odo" trigram inside "theodophilus")
//this matches 'frodo' and sends results to the top, since `name` field is boosted
// but also matches 'theodore' and 'rodolpho'
{
"size":100,
"from":0,
"query":{
"multi_match":{
"query":"frodo",
"fields":[
"name",
"ngram"
],
"type":"best_fields"
}
}
}
.
//I then tried to throw in the `minimum_must_match` option
// hoping it would filter out large strings that only had one matching trigram for instance
{
"size":100,
"from":0,
"query":{
"multi_match":{
"query":"frodo",
"fields":[
"name",
"ngram"
],
"type":"best_fields",
"minimum_should_match": "90%",
}
}
}
I've tried playing around in sense, to manually produce the match queries that this produces to allow me to only apply minimum_must_match to the ngram field but can't seem to get the syntax right.
// I then tried to contruct a custom query to just return the `minimum_should_match`d results on the ngram field
// I started with a query produced by using bodybuilder to `and` and `or` my other search criteria together
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
//each separate field's criteria `must`/`and`ed together
{
"query": {
"bool": {
"filter": {
"bool": {
"should": [
//each critereon for a specific field `should`/`or`ed together
{
//my attempt at getting `ngram` field results..
// should theoretically only return when field
// contains nothing but matching ngrams
// (i.e. exact matches and other fluke matches)
"query": {
"match": {
"ngram": {
"query": "frodo",
"minimum_should_match": "100%"
}
}
}
}
//... other critereon to be `should`/`or`ed together
]
}
}
}
}
}
//... other criteria to be `must`/`and`ed together
]
}
}
}
}
}
Can anyone see what I'm doing wrong?
It seems like this should be fairly straightforward to accomplish, but I must be missing something obvious.
UPDATE
I ran a query with _explain=true (using sense UI) to try to understand my results.
I queried for a match on the ngram field for "frod" with minimum_should_match = 100%, yet I still get every record that matches at least one ngram.
(e.g. rodolpho even though it doesn't contain fro)
GIST: test query and results
note: cross-posted from [discuss.elastic.co]
will make a link later, can't post more than 2 yet : /
(https://discuss.elastic.co/t/ngram-partial-match-limiting-ngram-results-in-multiple-field-query/62526)
I used your settings and mappings to create an index. And you queries seem to be working fine for me. I would suggest doing an explain on one of the "unexpected" documents which is being returned and see why it is being matched and returned with other results.
Here is what I did:
Run the analyze api on your analyzer to see how the query will be split into tokens.
curl -XGET 'localhost:9200/my_index/_analyze' -d '
{
"analyzer" : "ngram_analyzer",
"text" : "frodo"
}'
frodo will be split into 3 tokens with your analyzer.
{
"tokens": [
{
"token": "fro",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "rod",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "odo",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}
I indexed 3 documents for testing (only used ngrams field) . Here are the docs:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 1,
"_source": {
"ngram": "theodore"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"ngram": "frodo"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "3",
"_score": 1,
"_source": {
"ngram": "rudolpho"
}
}
]
}
}
The first query you mentioned, it matches frodo and theodore, but not rudolpho like you mentioned - which makes sense, since rudolpho does not produce any trigrams which match trigrams from frodo
frodo -> fro, rod, odo
rudolpho -> rud, udo, dol, olp, lph, pho
Using your second query, I get back only frodo (None of the other two) .
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.53148466,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 0.53148466,
"_source": {
"ngram": "frodo"
}
}
]
}
}
I then ran an explain (localhost:9200/my_index/my_type/2/_explain) on other two docs (theodore and rudolpho) and I see this (I have clipped the response)
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"matched": false,
"explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0,
"description": "no match on required clause ((ngram:fro ngram:rod ngram:odo)~2)",
"details": [
The above is expected since atleast two out of three tokens from frodo should match.

How to force Elasticsearch "terms" query to be not_analyzed

I want to make exact matches ids in a doc field. I have mapped the fields to index them not_analyzed but it seems like in the query each term is tokenizde or at least lowercased. How do I make the query also not_analyzed? Using ES 1.4.4, 1.5.1, and 2.0.0
Here is a doc:
{
"_index": "index_1446662629384",
"_type": "docs",
"_id": "Cat-129700",
"_score": 1,
"_source": {
"similarids": [
"Cat-129695",
"Cat-129699",
"Cat-129696"
],
"id": "Cat-129700"
}
}
Here is a query:
{
"size": 10,
"query": {
"bool": {
"should": [{
"terms": {
"similarids": ["Cat-129695","Cat-129699","Cat-129696"]
}
}]
}
}
}
The query above does not work. If I remove caps and dashes from the doc ids it works. I can't do that for many reasons. Is there a way to make the similarids not_analyzed like the doc fields?
If I'm understanding you correctly, all you need to do is set "index":"not_analyzed" on the "similarids" in your mapping. If you have that setting correct already, then there is something else going on that isn't apparent from what you posted (the "terms" query doesn't do any analysis on your search terms). You may want to check your mapping to make sure it is set up the way you think.
To test it, I set up a simple index like this:
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"id": {
"type": "string",
"index": "not_analyzed"
},
"similarids": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Then added your document:
PUT /test_index/doc/1
{
"similarids": [
"Cat-129695",
"Cat-129699",
"Cat-129696"
],
"id": "Cat-129700"
}
And your query works just fine.
POST /test_index/_search
{
"size": 10,
"query": {
"bool": {
"should": [
{
"terms": {
"similarids": [
"Cat-129695",
"Cat-129699",
"Cat-129696"
]
}
}
]
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.53148466,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.53148466,
"_source": {
"similarids": [
"Cat-129695",
"Cat-129699",
"Cat-129696"
],
"id": "Cat-129700"
}
}
]
}
}
I used ES 2.0 here, but it shouldn't matter which version you use. Here is the code I used to test:
http://sense.qbox.io/gist/562ccda28dfaed2717b43739696b88ea861ad690

Return field where text was found in ElasticSearch

I need help. I have these documents on elasticsearch 1.6
{
"name":"Sam",
"age":25,
"description":"Something"
},
{
"name":"Michael",
"age":23,
"description":"Something else"
}
with this query:
GET /MyIndex/MyType/_search?q=Michael
Elastic return this object:
{
"name":"Michael",
"age":23,
"description":"Something else"
}
... That's right, but I want to get the exactly key where text "Michael" was found. Is that possible? Thanks a lot.
I assume that by key you mean the document ID.
When indexing the following documents:
PUT my_index/my_type/1
{
"name":"Sam",
"age":25,
"description":"Something"
}
PUT my_index/my_type/2
{
"name":"Michael",
"age":23,
"description":"Something else"
}
And searching for:
GET /my_index/my_type/_search?q=Michael
You'll get the following response:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.15342641,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 0.15342641,
"_source": {
"name": "Michael",
"age": 23,
"description": "Something else"
}
}
]
}
}
As you can see, the hits array contains an object for each search hit.
The key for Michael in this case is "_id": "2" which its his document id.
Hope it helps.

Resources