I have been trying to match a query using the elasticsearch python client but I am unable to match it even after using escape characters and setting up some custom analyzers and mapping them. I want to search using & and its not giving any response.
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
doc1 = {
'name': 'numb',
'band': 'linkin_park',
'year': '2006'
}
doc2 = {
'name': 'Powerless &',
'band': 'linkin_park',
'year': '2006'
}
doc3 = {
'name': 'Crawling !',
'band': 'linkin_park',
'year': '2006'
}
doc =[doc1, doc2, doc3]
'''
create_index = {
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "whitespace"
}
}
}
}
}
es.indices.create(index="idx_temp", body=create_index)
'''
for i in range(3):
es.index(index="idx_temp", doc_type='_doc', id=i, body=doc[i])
my_mapping = {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
'ignore_above': 256
}
},
"analyzer": "my_analyzer"
"search_analyzer": "my_analyzer"
},
"band": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "my_analyzer"
"search_analyzer": "my_analyzer"
},
"year": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "my_analyzer"
"search_analyzer": "my_analyzer"
}
}
}
es.indices.put_mapping(index='idx_temp', body=my_mapping, doc_type='_doc', include_type_name=True)
res = es.search(index='idx_temp', body={
"query": {
"match": {
"name": {
"query": "powerless &",
"fuzziness": 3
}
}
}
})
for hit in res['hits']['hits']:
print(hit['_source'])
The expected output was 'name': 'Poweeerless &', but i got 0 hits and no value returned.
So I have fixed the problem by adding another field
"search_quote_analyzer": "my_analyzer"
to the settings field after
"analyzer": "my_analyzer"
"search_analyzer": "my_analyzer"
And then I'm getting my output by searching with & in the query as
'name': 'Poweeerless &'
I just tried it using your index settings, mapping, and query and was able to get the results. Below are 2 different things which I did.
Escape the special char &, when I was trying to index the doc using ES REST API directly, using below the body in postman:
{
"content": "Powerless \&" }
Then ES gave me the Unrecognized character escape '&' exception and even Postman, popular REST client was also giving me warning about not a proper string.
Then I changed above payload to below and was able to index the doc:
{
"content": "Powerless \\&" :-> Notice I added a another `\` to escape the `&`
}
I changed the query to use the same field, which was having the value &, in your case it is name field, not the content field., As match query is analyzed and uses the same analyzer which is used for indexing time. And was able to get the result.
PS: I also verified your analyzer using _analyze api and it's generating the below tokens for text Powerless \\&
{
"tokens": [
{
"token": "powerless",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "\\&",
"start_offset": 10,
"end_offset": 12,
"type": "word",
"position": 1
}
]
}
Related
We have a table with this type of structure:
{_id:15_0, createdAt: 1/1/1, task_id:[16_0, 17_0, 18_0], table:”details”, a:b, c: d, more}
We created indexes using
{
"index": {},
"name": "paginationQueryIndex",
"type": "text"
}
It auto created
{
"ddoc": "_design/28e8db44a5a0862xxx",
"name": "paginationQueryIndex",
"type": "text",
"def": {
"default_analyzer": "keyword",
"default_field": {
},
"selector": {
},
"fields": [
],
"index_array_lengths": true
}
}
We are using the following query
{
"selector": {
"createdAt": { "$gt": 0 },
"task_id": { "$in": [ "18_0" ] },
"table": "details"
},
"sort": [ { "createdAt": "desc" } ],
"limit”: 20
}
It takes 700-800 ms for first time, after that it decreases to 500-600 ms
Why does it take longer the first time?
Any way to speed up the query?
Any way to add indexes to specific fields if type is “text”? (instead of indexing all the fields in these records)
You could try creating the index more explicitly, defining the type of each field you wish to index e.g.:
{
"index": {
"fields": [
{
"name": "createdAt",
"type": "string"
},
{
"name": "task_id",
"type": "string"
},
{
"name": "table",
"type": "string"
}
]
},
"name": "myindex",
"type": "text"
}
Then your query becomes:
{
"selector": {
"createdAt": { "$gt": "1970/01/01" },
"task_id": { "$in": [ "18_0" ] },
"table": "details"
},
"sort": [ { "createdAt": "desc" } ],
"limit": 20
}
Notice that I used strings where the data type is a string.
If you're interested in performance, try removing clauses from your query one at-a-time to see if one is causing the performance problem. You can also look at the explanation of your query to see if it using your index correctly.
Documentation on creating an explicit text query index is here
Is dynamic mapping for geo point still working in Elastic Search 2.x/5.x?
This is the template:
{
"template": "*",
"mappings": {
"_default_": {
"dynamic_templates": [
{
"geo_point_type": {
"match_mapping_type": "string",
"match": "t_gp_*",
"mapping": {
"type": "geo_point"
}
}
}
]
}
}
}
This is the error I get when I query the field:
"reason": "failed to parse [geo_bbox] query. field [t_gp_lat-long#en] is expected to be of type [geo_point], but is of [string] type instead"
I seems to remember that I saw somewhere in the documentation that this doesn't work, but I thought that's only when there is no dynamic template at all.
Any idea?
Update 1
Here's a sample of the document. The actual document is very big so I took the only relevant part of it.
{
"_index": "route",
"_type": "route",
"_id": "583a014edd76239997fca5e4",
"_score": 1,
"_source": {
"t_b_highway#en": false,
"t_n_number-of-floors#en": 33,
"updatedBy#_id": "58059fe368d0a739916f0888",
"updatedOn": 1480196430596,
"t_n_ceiling-height#en": 2.75,
"t_gp_lat-long#en": "13.736248,100.5604997"
}
}
Data looks correct to me since you can also index Geo Point field with lat/long string.
Update 2
Mapping is definitely wrong. That's why I'm wondering if you can dynamically map Geo Point field.
"t_gp_lat-long#en": {
"type": "string",
"fields": {
"english": {
"type": "string",
"analyzer": "english"
},
"raw": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 256
}
}
},
Background: I've implemented a partial search on a name field by indexing the tokenized name (name field) as well as a trigram analyzed name (ngram field).
I've boosted the name field to have exact token matches bubble up to the top of the results.
Problem: I am trying to implement a query that limits the nGram matches to ones that only match some threshold (say 80%) of the query string. I understand that minimum_should_match seems to be what I am looking for, but my problem is forming the query to actually produce those results.
My exact token matches are boosted to the top but I still get every document that has a single matching trigram in the ngram field.
GIST: Index settings and mapping
Index Settings
{
"my_index": {
"settings": {
"index": {
"number_of_shards": "5",
"max_result_window": "30000",
"creation_date": "1475853851937",
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": "3",
"max_gram": "3"
}
},
"analyzer": {
"ngram_analyzer": {
"filter": [
"lowercase",
"ngram_filter"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "AuCjcP5sSb-m59bYrprFcw",
"version": {
"created": "2030599"
}
}
}
}
}
Index Mappings
{
"my_index": {
"mappings": {
"my_type": {
"properties": {
"acw": {
"type": "integer"
},
"pcg": {
"type": "integer"
},
"date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"dob": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"id": {
"type": "string"
},
"name": {
"type": "string",
"boost": 10
},
"ngram": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"bdk": {
"type": "integer"
},
"mmw": {
"type": "integer"
},
"mpi": {
"type": "integer"
},
"sex": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Solution Attempts
[GIST: Query Attempts] unlinkifying due to 2 link limit :(
(https://gist.github.com/jordancardwell/2e690013666e7e1da6ef1acee314b4e6)
I tried a multi-match query, which gives me correct search results, but I haven't had luck omitting results for names that only match a single trigram (say "odo" trigram inside "theodophilus")
//this matches 'frodo' and sends results to the top, since `name` field is boosted
// but also matches 'theodore' and 'rodolpho'
{
"size":100,
"from":0,
"query":{
"multi_match":{
"query":"frodo",
"fields":[
"name",
"ngram"
],
"type":"best_fields"
}
}
}
.
//I then tried to throw in the `minimum_must_match` option
// hoping it would filter out large strings that only had one matching trigram for instance
{
"size":100,
"from":0,
"query":{
"multi_match":{
"query":"frodo",
"fields":[
"name",
"ngram"
],
"type":"best_fields",
"minimum_should_match": "90%",
}
}
}
I've tried playing around in sense, to manually produce the match queries that this produces to allow me to only apply minimum_must_match to the ngram field but can't seem to get the syntax right.
// I then tried to contruct a custom query to just return the `minimum_should_match`d results on the ngram field
// I started with a query produced by using bodybuilder to `and` and `or` my other search criteria together
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
//each separate field's criteria `must`/`and`ed together
{
"query": {
"bool": {
"filter": {
"bool": {
"should": [
//each critereon for a specific field `should`/`or`ed together
{
//my attempt at getting `ngram` field results..
// should theoretically only return when field
// contains nothing but matching ngrams
// (i.e. exact matches and other fluke matches)
"query": {
"match": {
"ngram": {
"query": "frodo",
"minimum_should_match": "100%"
}
}
}
}
//... other critereon to be `should`/`or`ed together
]
}
}
}
}
}
//... other criteria to be `must`/`and`ed together
]
}
}
}
}
}
Can anyone see what I'm doing wrong?
It seems like this should be fairly straightforward to accomplish, but I must be missing something obvious.
UPDATE
I ran a query with _explain=true (using sense UI) to try to understand my results.
I queried for a match on the ngram field for "frod" with minimum_should_match = 100%, yet I still get every record that matches at least one ngram.
(e.g. rodolpho even though it doesn't contain fro)
GIST: test query and results
note: cross-posted from [discuss.elastic.co]
will make a link later, can't post more than 2 yet : /
(https://discuss.elastic.co/t/ngram-partial-match-limiting-ngram-results-in-multiple-field-query/62526)
I used your settings and mappings to create an index. And you queries seem to be working fine for me. I would suggest doing an explain on one of the "unexpected" documents which is being returned and see why it is being matched and returned with other results.
Here is what I did:
Run the analyze api on your analyzer to see how the query will be split into tokens.
curl -XGET 'localhost:9200/my_index/_analyze' -d '
{
"analyzer" : "ngram_analyzer",
"text" : "frodo"
}'
frodo will be split into 3 tokens with your analyzer.
{
"tokens": [
{
"token": "fro",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "rod",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "odo",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}
I indexed 3 documents for testing (only used ngrams field) . Here are the docs:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 1,
"_source": {
"ngram": "theodore"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"ngram": "frodo"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "3",
"_score": 1,
"_source": {
"ngram": "rudolpho"
}
}
]
}
}
The first query you mentioned, it matches frodo and theodore, but not rudolpho like you mentioned - which makes sense, since rudolpho does not produce any trigrams which match trigrams from frodo
frodo -> fro, rod, odo
rudolpho -> rud, udo, dol, olp, lph, pho
Using your second query, I get back only frodo (None of the other two) .
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.53148466,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 0.53148466,
"_source": {
"ngram": "frodo"
}
}
]
}
}
I then ran an explain (localhost:9200/my_index/my_type/2/_explain) on other two docs (theodore and rudolpho) and I see this (I have clipped the response)
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"matched": false,
"explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0,
"description": "no match on required clause ((ngram:fro ngram:rod ngram:odo)~2)",
"details": [
The above is expected since atleast two out of three tokens from frodo should match.
We have an index of items with which I'm attempting to do fuzzy wildcard on the items name.
the query
{
"from": 0,
"size": 10,
"query": {
"bool": {
"must": {
"query_string": {
"fields": [
"name.suggest"
],
"query": "avacado*",
"fuzziness": 0.7
}
}
}
}
}
the field in the index and the analyzers at play
"
suggest_analyzer":{
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "shingle", "punctuation"]
}
"punctuation" : {
"type" : "word_delimiter",
"preserve_original": "true"
}
"name": {
"fields": {
"name": {
"type": "string",
"analyzer": "stem"
},
"suggest":{
"type": "string",
"analyzer": "suggest_analyzer"
},
"untouched": {
"include_in_all": false,
"index": "not_analyzed",
"index_options": "docs",
"omit_norms": true,
"type": "string"
},
"untouched_lowercase": {
"type": "string",
"index_analyzer": "lowercase",
"search_analyzer": "lowercase"
}
},
"type": "multi_field"
},
The problem is this
An item with the name "Avocado Test" will match for the following
avocado*
avo*
avacado
but fails to match for
avacado*
ava*
ava~2
I cant seem to make fuzzy work with wildcards, it seems to be either fuzzy works or wildcards work but not in combination.
Es version is 1.3.1
Note that my query is simplified and we have other filtering going on but I boiled it down to just the query to take any ambiguity out of the results. I've attempted to use the suggest features but they won't allow the level of filtering we need.
Is there any other way to handle doing suggest/typeahead style searching with fuzziness to catch misspellings?
You could try EdgeNgramTokenFilter, use it on a analyzer applied on the desired field and do a fuzzy search on it.
I am using the river plugin for CouchDB and when I execute the following curl command:
curl -XPUT 'localhost:9200/_river/blog/_meta' -d '{
"type": "couchdb",
"couchdb": {
"host": "localhost",
"port": 5984,
"db": "blog",
"filter": null
},
"index": {
"analysis": {
"analyzer": {
"whitespace": {
"type": "whitespace",
"filter": "lowercase"
},
"ox_edgeNGram": {
"type": "custom",
"tokenizer": "ox_t_edgeNGram",
"filter": [
"lowercase"
]
},
"ox_NGram": {
"type": "custom",
"tokenizer": "ox_t_NGram",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"ox_t_edgeNGram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 25,
"side": "front"
},
"ox_t_NGram": {
"type": "NGram",
"min_gram": 2,
"max_gram": 25
}
}
}
}
}'
receive the response:
{
"ok": true,
"_index": "_river",
"_type": "blog",
"_id": "_meta",
"_version": 1
}
The problem I have, is when I want to view the settings in the browser and go to:
http://localhost:9200/blog/_settings?pretty=true
The json that is returned is as follows, but I'm expecting information regarding the analyzer etc. that I thought I created.
Returned JSON:
{
"blog": {
"settings": {
"index.number_of_shards": "5",
"index.number_of_replicas": "1"
}
}
}
It should also be noted that when I create a blog index without using the river and run a curl command to input the analysis information, I do receive a response from the browser indicating the settings that I input.
How can I set the default settings of a an index when using the River plugin?
To solve this issue:
Create new Elasticsearch index + mappings etc.
Create new Elasticsearch river with the name of the index set to that of the index created in step one.
I found the answer here:
http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/5ebf1556d139d5ac/f17e71e04cac5889?lnk=gst&q=couchDB+river+settings#f17e71e04cac5889
You can try this url http://localhost:9200/blog/_mapping?pretty=true
In the response mapping, if the analyzer is not explicitly mentioned, it is then the default analyzer.