ElasticSearch - Query the data on a field that matches from first position - search

I searched alot on this and tried numerous combinations. But failed in all attempts :(.
Here is my problem:
I created a jdbc-river in elastic search as below:
{
"type" : "jdbc",
"jdbc" : {
"driver" : "oracle.jdbc.driver.OracleDriver",
"url" : "jdbc:oracle:thin:#//ip:1521/db",
"user" : "user",
"password" : "pwd",
"sql" : "select f1, f2, f3 from table"
},
"index" : {
"index" : "subject2",
"type" : "name2",
"settings": {
"analysis": {
"analyzer": {
"my_analizer": {
"type": "custom",
"tokenizer": "my_pattern_tokenizer",
"filter": []
}
},
"tokenizer": {
"my_pattern_tokenizer": {
"type": "pattern",
"pattern": "$^"
}
},
"filter": []
}
}
},
"mappings":
{
"subject2":
{
"properties" : {
"f1" : {"index" : "not_analyzed", "store": "yes", "analyzer": "my_analizer", "search_analyzer": "keyword", "type": "string"},
"f2" : {"index" : "not_analyzed", "store": "yes", "analyzer": "my_analizer", "search_analyzer": "keyword", "type": "string"},
"f3" : {"index" : "not_analyzed", "store": "yes", "analyzer": "my_analizer", "search_analyzer": "keyword", "type": "string"}
}
}
}
}
I want to implement an auto-complete feature that matches the user entered value with the data in "f1" field say as of now but from the start.
Data in the f1 field is like
"Hardin County ABC"
"Country of XYZ"
"County of Blah blah"
"County of Blah second"
What is as per requirement is when user types "Coun" then result 2nd, 3rd and 4th should be returned by the elastic search and not the first. I read about "keyword" analyzer that makes the complete word to be token but I don't know not working in this case.
Also, if user types "County of B" then 3rd and 4th option should be returned by the elastic search.
Below is the format of my querying the result.
Option 1
{"from":0,"size":10, "query":{ "field" : { "f1" : "count*" } } }
Option 2
{"from":0,"size":10, "query":{ "span_first" : {
"match" : {
"span_term" : { "COMPANY" : "hardin" }
},
"end" : 1
} } }
Please tell me what wrong I am doing here? Thanks in advance.

Before I answer I want to point out you are defining an analyzer then setting index: not_analyzed which means the analyzer is not used. (If you use not_analyzed it is the same as using the keyword analyzer, the whole string, untouched, is one token.)
Also analyzer: my_analizer is a shortcut for index_analyzer: my_analizer and search_analyzer: my_analizer, so your mapping is a bit confusing to me...
Also the fields will be stored in the _source unless you turn this off, you don't need to store the fields separately unless you turn off the _source storing and need that field returned in the result set.
There are 2 ways I can think of doing this:
1. Use a match_phrase_prefix query - Easier and slow
Don't define any analyzers, you don't need them.
Mapping:
"subject2": {
"properties" : {
"f1" : { "type": "string" },
"f2" : { "type": "string" },
"f3" : { "type": "string" },
}
}
}
Query:
"match_phrase_prefix" : {
"f1" : {
"query" : "Count"
}
}
2. Use an edge_ngram token filter - Harder and faster
"settings": {
"analysis": {
"analyzer": {
"edge_autocomplete": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["my_edge_ngram"]
}
},
"filter" : {
"my_edge_ngram" : {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
}
}
}
}
Mapping:
"subject2": {
"properties" : {
"f1" : { "type": "string", "index": "edge_autocomplete" },
"f2" : { "type": "string", "index": "edge_autocomplete" },
"f3" : { "type": "string", "index": "edge_autocomplete" },
}
}
}
Query:
"match" : {
"f1" : "Count",
"analyzer": "keyword"
}
Good luck!

Have you tried an ngram filter? It will tokenize strings of character-length "n". So, your mapping could look like:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "kstem", "ngram"]
}
},
"filter" : {
"ngram" : {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"subject2": {
"properties" : {
"f1" : {
"type": "multi_field",
"fields": {
"f1": {
"type": "string"
},
"autocomplete": {
"analyzer": "autocomplete",
"type": "string"
},
...
This will return the ngram "count" for the 2nd, 3rd, and 4th results, which should give you the desired outcome.
Note that making "f1" a multi_field field is not required. However, when you don't need the "autocomplete" analyzer, such as when returning "f1" in the search results, then it is less expensive to use the "f1" subfield. If you do use a "multi_field", you can access "f1" at "f1" (without dot notation), but to access "autocomplete" you need to use dot notation - so "f1.autocomplete".

Although, The solution we final implemented is a mix of approaches but still answer by "ramseykhalaf" is the closest match. +1 to him.
What I did when ever user enters a word with space fire a match-prefix query and get the closest match result to show.
{"from":0,"size":10, "query":{ "match" : { "f1" : {"query" : "MICROSOU", "type" : "phrase_prefix", "boost":2} } } }
As soon as user hits any character after space I change the mode of query to query field with regex and being multiple words in a field match is again very close to what user is looking for.
{"from":0,"size":10, "query":{ "query_string" : { "default_field":"f1","query" : "micro int*", "boost":2 } } }
In this way we got the closest solution to this requirement. I would be happy to get more optimize solution that suffice my above mentioned use cases.
Just to add one more thing - now the river I created is simple plain vanilla with fields as "not_analyzed" and analyzer as "keyword"

Related

How to define a default value when creating an index in Elasticsearch

I need to create an index in elasticsearch by assigning a default value for a field. Ex,
In python3,
request_body = {
"settings":{
"number_of_shards":1,
"number_of_replicas":1
},
"mappings":{
"properties":{
"name":{
"type":"keyword"
},
"school":{
"type":"keyword"
},
"pass":{
"type":"keyword"
}
}
}
}
from elasticsearch import Elasticsearch
es = Elasticsearch(['https://....'])
es.indices.create(index="test-index", ignore=400, body= request_body)
in above scenario, the index will be created with those fields. But i need to put a default value to "pass" as True. Can i do that here?
Elastic search is schema-less. It allows any number of fields and any content in fields without any logical constraints.
In a distributed system integrity checking can be expensive so checks like RDBMS are not available in elastic search.
Best way is to do validations at client side.
Another approach is to use ingest
Ingest pipelines let you perform common transformations on your data before indexing. For example, you can use pipelines to remove fields, extract values from text, and enrich your data.
**For testing**
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"script": {
"lang": "painless",
"source": "if (ctx.pass ===null) { ctx.pass='true' }"
}
}
]
},
"docs": [
{
"_index": "index",
"_type": "type",
"_id": "2",
"_source": {
"name": "a",
"school":"aa"
}
}
]
}
PUT _ingest/pipeline/default-value_pipeline
{
"description": "Set default value",
"processors": [
{
"script": {
"lang": "painless",
"source": "if (ctx.pass ===null) { ctx.pass='true' }"
}
}
]
}
**Indexing document**
POST my-index-000001/_doc?pipeline=default-value_pipeline
{
"name":"sss",
"school":"sss"
}
**Result**
{
"_index" : "my-index-000001",
"_type" : "_doc",
"_id" : "hlQDGXoB5tcHqHDtaEQb",
"_score" : 1.0,
"_source" : {
"school" : "sss",
"pass" : "true",
"name" : "sss"
}
},

Startswith exact word match in elasticsearch?

I have an index containing field title having data as below.
jam bread
jamun
jamaica country
So If user searches for jam, I don't want jamun and jamaica country also come in search result. Right now I am using prefix query in elasticsearch, but it is not giving me result as I want.
{
"query": {
"prefix" : { "title" : "jam" }
}
}
You will get both the results as prefix query actually runs a regexp query (keyword*) on the inverted index so both the results will match.
you can do something like the following and use term query instead of the prefix query to do the exact match on the tokenized keyword.
PUT exact_index1
{
"mappings": {
"document_type" : {
"properties": {
"title" : {
"type": "text"
}
}
}
}
}
POST exact_index1/document_type
{
"title" : "jamun"
}
POST exact_index1/_search
{
"query": {
"term": {
"title": {
"value": "jam"
}
}
}
}
Hope this helps
The completion suggester provides search-as-you-type functionality
PUT - index_name/document_type/_mapping
{
"document_type": {
"properties": {
"title": {
"type": "text"
},
"suggest": {
"type": "completion",
"analyzer": "simple",
"search_analyzer": "simple"
}
}
}
}
POST - index_name/document_type
{
"name": "jamun",
"suggest":
{
"input": "jamun"
},
"output": "jamun"
}
POST - index_name/document_type/_suggest?pretty
{"type-suggest":{"text":"jam","completion":{"field":"suggest"}}}

Elastic Search: Dynamic Template Mapping for Geo Point Field

Is dynamic mapping for geo point still working in Elastic Search 2.x/5.x?
This is the template:
{
"template": "*",
"mappings": {
"_default_": {
"dynamic_templates": [
{
"geo_point_type": {
"match_mapping_type": "string",
"match": "t_gp_*",
"mapping": {
"type": "geo_point"
}
}
}
]
}
}
}
This is the error I get when I query the field:
"reason": "failed to parse [geo_bbox] query. field [t_gp_lat-long#en] is expected to be of type [geo_point], but is of [string] type instead"
I seems to remember that I saw somewhere in the documentation that this doesn't work, but I thought that's only when there is no dynamic template at all.
Any idea?
Update 1
Here's a sample of the document. The actual document is very big so I took the only relevant part of it.
{
"_index": "route",
"_type": "route",
"_id": "583a014edd76239997fca5e4",
"_score": 1,
"_source": {
"t_b_highway#en": false,
"t_n_number-of-floors#en": 33,
"updatedBy#_id": "58059fe368d0a739916f0888",
"updatedOn": 1480196430596,
"t_n_ceiling-height#en": 2.75,
"t_gp_lat-long#en": "13.736248,100.5604997"
}
}
Data looks correct to me since you can also index Geo Point field with lat/long string.
Update 2
Mapping is definitely wrong. That's why I'm wondering if you can dynamically map Geo Point field.
"t_gp_lat-long#en": {
"type": "string",
"fields": {
"english": {
"type": "string",
"analyzer": "english"
},
"raw": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 256
}
}
},

Elasticsearch wildcard query string with fuzziness

We have an index of items with which I'm attempting to do fuzzy wildcard on the items name.
the query
{
"from": 0,
"size": 10,
"query": {
"bool": {
"must": {
"query_string": {
"fields": [
"name.suggest"
],
"query": "avacado*",
"fuzziness": 0.7
}
}
}
}
}
the field in the index and the analyzers at play
"
suggest_analyzer":{
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "shingle", "punctuation"]
}
"punctuation" : {
"type" : "word_delimiter",
"preserve_original": "true"
}
"name": {
"fields": {
"name": {
"type": "string",
"analyzer": "stem"
},
"suggest":{
"type": "string",
"analyzer": "suggest_analyzer"
},
"untouched": {
"include_in_all": false,
"index": "not_analyzed",
"index_options": "docs",
"omit_norms": true,
"type": "string"
},
"untouched_lowercase": {
"type": "string",
"index_analyzer": "lowercase",
"search_analyzer": "lowercase"
}
},
"type": "multi_field"
},
The problem is this
An item with the name "Avocado Test" will match for the following
avocado*
avo*
avacado
but fails to match for
avacado*
ava*
ava~2
I cant seem to make fuzzy work with wildcards, it seems to be either fuzzy works or wildcards work but not in combination.
Es version is 1.3.1
Note that my query is simplified and we have other filtering going on but I boiled it down to just the query to take any ambiguity out of the results. I've attempted to use the suggest features but they won't allow the level of filtering we need.
Is there any other way to handle doing suggest/typeahead style searching with fuzziness to catch misspellings?
You could try EdgeNgramTokenFilter, use it on a analyzer applied on the desired field and do a fuzzy search on it.

Elasticsearch wildcard search on not_analyzed field

I have an index like following settings and mapping;
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_keyword":{
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
}
},
"mappings":{
"product":{
"properties":{
"name":{
"analyzer":"analyzer_keyword",
"type":"string",
"index": "not_analyzed"
}
}
}
}
}
I am struggling with making an implementation for wildcard search on name field. My example data like this;
[
{"name": "SVF-123"},
{"name": "SVF-234"}
]
When I perform following query;
http://localhost:9200/my_index/product/_search -d '
{
"query": {
"filtered" : {
"query" : {
"query_string" : {
"query": "*SVF-1*"
}
}
}
}
}'
It returns SVF-123,SVF-234. I think, it still tokenizes data. It must return only SVF-123.
Could you please help on this?
Thanks in advance
There's a couple of things going wrong here.
First, you are saying that you don't want terms analyzed index time. Then, there's an analyzer configured (that's used search time) that generates incompatible terms. (They are lowercased)
By default, all terms end up in the _all-field with the standard analyzer. That is where you end up searching. Since it tokenizes on "-", you end up with an OR of "*SVF" and "1*".
Try to do a terms facet on _all and on name to see what's going on.
Here's a runnable Play and gist: https://www.found.no/play/gist/3e5fcb1b4c41cfc20226 (https://gist.github.com/alexbrasetvik/3e5fcb1b4c41cfc20226)
You need to make sure the terms you index is compatible with what you search for. You probably want to disable _all, since it can muddy what's going on.
#!/bin/bash
export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
# Create indexes
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
"settings": {
"analysis": {
"text": [
"SVF-123",
"SVF-234"
],
"analyzer": {
"analyzer_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"type": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed",
"analyzer": "analyzer_keyword"
}
}
}
}
}'
# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"type"}}
{"name":"SVF-123"}
{"index":{"_index":"play","_type":"type"}}
{"name":"SVF-234"}
'
# Do searches
# See all the generated terms.
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
"facets": {
"name": {
"terms": {
"field": "name"
}
},
"_all": {
"terms": {
"field": "_all"
}
}
}
}
'
# Analyzed, so no match
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
"query": {
"match": {
"name": {
"query": "SVF-123"
}
}
}
}
'
# Not analyzed according to `analyzer_keyword`, so matches. (Note: term, not match)
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
"query": {
"term": {
"name": {
"value": "SVF-123"
}
}
}
}
'
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
"query": {
"term": {
"_all": {
"value": "svf"
}
}
}
}
'
My solution adventure
I have started my case as you can see in my question. Whenever, I have changed a part of my settings, one part started to work, but another part stop working. Let me give my solution history:
1.) I have indexed my data as default. This means, my data is analyzed as default. This will cause problem on my side. For example;
When user started to search a keyword like SVF-1, system run this query:
{
"query": {
"filtered" : {
"query" : {
"query_string" : {
"analyze_wildcard": true,
"query": "*SVF-1*"
}
}
}
}
}
and results;
SVF-123
SVF-234
This is normal, because name field of my documents are analyzed. This splits query into tokens SVF and 1, and SVF matches my documents, although 1 does not match. I have skipped this way. I have create a mapping for my fields make them not_analyzed
{
"mappings":{
"product":{
"properties":{
"name":{
"type":"string",
"index": "not_analyzed"
},
"site":{
"type":"string",
"index": "not_analyzed"
}
}
}
}
}
but my problem continued.
2.) I wanted to try another way after lots of research. Decided to use wildcard query.
My query is;
{
"query": {
"wildcard" : {
"name" : {
"value" : *SVF-1*"
}
}
},
"filter":{
"term": {"site":"pro_en_GB"}
}
}
}
This query worked, but one problem here. My fields are not_analyzed anymore, and I am making wildcard query. Case sensitivity is problem here. If I search like svf-1, it returns nothing. Since, user can input lowercase version of query.
3.) I have changed my document structure to;
{
"mappings":{
"product":{
"properties":{
"name":{
"type":"string",
"index": "not_analyzed"
},
"nameLowerCase":{
"type":"string",
"index": "not_analyzed"
}
"site":{
"type":"string",
"index": "not_analyzed"
}
}
}
}
}
I have adde one more field for name called nameLowerCase. When I am indexing my document, I am setting my document like;
{
name: "SVF-123",
nameLowerCase: "svf-123",
site: "pro_en_GB"
}
Here, I am converting query keyword to lowercase and make search operation on new nameLowerCase index. And displaying name field.
Final version of my query is;
{
"query": {
"wildcard" : {
"nameLowerCase" : {
"value" : "*svf-1*"
}
}
},
"filter":{
"term": {"site":"pro_en_GB"}
}
}
}
Now it works. There is also one way to solve this problem by using multi_field. My query contains dash(-), and faced some problems.
Lots of thanks to #Alex Brasetvik for his detailed explanation and effort
Adding to Hüseyin answer, we can use AND as the default operator. So SVF and 1* will be joined using AND operator, therefore giving us the correct results.
"query": {
"filtered" : {
"query" : {
"query_string" : {
"default_operator": "AND",
"analyze_wildcard": true,
"query": "*SVF-1*"
}
}
}
}
#Viduranga Wijesooriya as you stated "default_operator" : "AND" will check for presence of both SVF and 1 but exact match alone is still not possible,
but ya this will filter the results in more appropriate way leaving with all combination of SVF and 1 and sorting the results by relevance which will promote SVF-1 up the order
For pulling out the exact result
"settings": {
"analysis": {
"analyzer": {
"analyzer_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"type": {
"properties": {
"name": {
"type": "string",
"analyzer": "analyzer_keyword"
}
}
}
}
and the query is
{
"query": {
"bool": {
"must": [
{
"query_string" : {
"fields": ["name"],
"query" : "*svf-1*",
"analyze_wildcard": true
}
}
]
}
}
}
result
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "play",
"_type": "type",
"_id": "AVfXzn3oIKphDu1OoMtF",
"_score": 1,
"_source": {
"name": "SVF-123"
}
}
]
}
}

Resources