How to build an N-Gram relationship in Elasticsearch

How to build an N-Gram relationship in Elasticsearch - node.js

I am new to Elasticsearch, and I am looking to build a Front-End app which has a list of proverbs. As the user browses these proverbs, I want them to find related N-Gram proverbs, or analytic proverbs from the Proverb DB. For example when clicking on
"A watched pot never boils" would bring the following suggestions:
1-Gram suggestion:
"Two pees in a pot"
2-Gram suggestion:
"A Watched pot tastes bitter"
Analytical suggestion: "Too many cooks spoil the broth"
Is there a way to do that in ES, or do I need to build my own logic ?

The 1-gram suggestion works out of the box and the 2-gram suggestions can easily be achieved with shingle.
Here is an attempt
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"2-grams": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingles"
]
}
},
"filter": {
"shingles": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"analyzer": "standard",
"fields": {
"2gram": {
"type": "text",
"analyzer": "2-grams"
}
}
}
}
}
}
Next index some documents:
PUT test/_doc/1
{
"text": "Two pees in a pot"
}
PUT test/_doc/2
{
"text": "A Watched pot tastes bitter"
}
Finally, you can search for 1-gram suggestions using the following query and you'll get both documents in the response:
POST test/_search
{
"query": {
"match": {
"text": "A watched pot never boils"
}
}
}
You can also search for 2-gram suggestions using the following query and only the second document will come up:
POST test/_search
{
"query": {
"match": {
"text.2gram": "A watched pot never boils"
}
}
}
PS: Not sure how the "analytical" suggestion works, though, feel free to provide more insights, and I'll update.

Related

how to implement algolia autocomplete on a single index, but i want results to show based on facets

I have an index on algolia, each document like this.
{
"title": "sample title",
"slug": "sample slug",
"content": "Head towards Rajinder Da Dhaba for some insanely delicious Kebabs!!",
"Tags": ["fashion", "shoes"],
"created": "2017-03-30T12:10:08.815Z",
"city": "delhi",
"user": {
"_id": "58b6f3ea884fdc682a820dad",
"description": "Roughly, somewhere between insanity and zen. Mostly the guy at the window seat!",
"displayName": "Jon Doe"
},
"type": "Post",
"places": [
{
"name": "Rajinder Da Dhaba",
"slug": "Rajinder-Da-Dhaba-safdarjung-9e9ffe",
"location": {
"_geoloc": [
{
"name": "Safdarjung",
"_id": "59611a2c2094b56a39afcbce",
"coordinates": {
"lng": 77.2030268,
"lat": 28.5685586
}
}
]
}
}
],
"objectID": "58dcf5a0355b590560d6ad68",
}
I want to implement autocomplete on this.
However, when i see the demos present in algolia dashboard, i found out that it returns the complete documents.
I want to only match on user.displayName, place.name, and title
and return only these fields as suggestions in the autocomplete results instead of complete documents, which match.
I know I can create separate indexes for users, places;
But is this possible with only a single index??

Did you had a look at http://algolia.com/doc/tutorials/search-ui/autocomplete/auto-complete/ ?
It shows how to have a custom display from an index.
To match on on user.displayName, place.name, and title
you can configure the "searchable attributes" from the algolia dashboard.

index and searchs analysers in elastic search: troubles in hitting exact string as first result

I am doing tests with elastic search in indexing wikipedia's topics.
Below my settings.
Results I expect is to have first result matching the exact string - especially if string is made by one word only.
Instead:
Searching for "g"
curl "http://localhost:9200/my_index/_search?q=name:g&pretty=True"
returns
[Changgyeonggung, Lopadotemachoselachogaleokranioleipsanodrimhypotrimmatosilphioparaomelitokatakechymenokichlepikossyphophattoperisteralektryonoptekephalliokigklopeleiolagoiosiraiobaphetraganopterygon, ..] as first results (yes, serendipity time! that is a greek dish if you are curious [http://nifty.works/about/BgdKMmwV6B3r4pXJ/] :)
I thought because the results weight more "G" letters respect to other words.. but:
Searching for "google":
curl "http://localhost:9200/my_index/_search?q=name:google&pretty=True"
returns
[Googlewhack, IGoogle, Google+, Google, ..] as first results, and I would expect Google to be the first.
What is wrong in my settings for not hitting exact keyword if exists?
I used index and search analyzers for the reason suggested in this answer:[https://stackoverflow.com/a/15932838/305883]
Settings
# make index with mapping
curl -X PUT localhost:9200/test-ngram -d '
{
"settings": {
"analysis": {
"analyzer": {
"index_analyzer": {
"type" : "custom",
"tokenizer": "lowercase",
"filter": ["asciifolding", "title_ngram"]
},
"search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "stop", "asciifolding"]
}
},
"filter": {
"title_ngram" : {
"type" : "nGram",
"min_gram" : 1,
"max_gram" : 10
}
}
}
},
"mappings": {
"topic": {
"properties": {
"name": {
"type": "string",
"boost": 10.0,
"index": "analyzed",
"index_analyzer": "index_analyzer",
"search_analyzer": "search_analyzer"
}
}
}
}
}
'

That's because relevance works in a different way by default (check the part about TF/IDF
https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html)
If you want to have exact term match on the top of the results while also matching substrings etc, you need to index name as multifield like this:
"name": {
"type": "string",
"index": "analyzed",
// other analyzer stuff here
"fields": {
"raw": { "type": "string", "index": "not_analyzed" }
}
}
Then in the boolean query you need to query both name and name.raw and boost results from name.raw

ElasticSearch: Suggestion Completion Multi Search

I am using the suggestion api within ES with completion. My implementation works (code below) but I would like to search for multiple words within a query. In the example below if I query search "word" it finds "wordpress" and outputs "Found". What I am am trying to accomplish is querying with something like "word blog magazine" which are all tags and have an output of "Found". Any help would be appreciated!
Mapping:
curl -XPUT "http://localhost:9200/test_index/" -d'
{
"mappings": {
"product": {
"properties": {
"description": {
"type": "string"
},
"tags": {
"type": "string"
},
"title": {
"type": "string"
},
"tag_suggest": {
"type": "completion",
"index_analyzer": "simple",
"search_analyzer": "simple",
"payloads": false
}
}
}
}
}'
Add document:
curl -XPUT "http://localhost:9200/test_index/product/1" -d'
{
"title": "Product1",
"description": "Product1 Description",
"tags": [
"blog",
"magazine",
"responsive",
"two columns",
"wordpress"
],
"tag_suggest": {
"input": [
"blog",
"magazine",
"responsive",
"two columns",
"wordpress"
],
"output": "Found"
}
}'
_suggest query:
curl -XPOST "http://localhost:9200/test_index/_suggest" -d'
{
"product_suggest":{
"text":"word",
"completion": {
"field" : "tag_suggest"
}
}
}'
The results are as we would expect:
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"product_suggest": [
{
"text": "word",
"offset": 0,
"length": 4,
"options": [
{
"text": "Found",
"score": 1
},
]
}
]
}

If you're willing to switch to using edge ngrams (or full ngrams if you need them), I think it will solve your problem.
I wrote up a pretty detailed explanation of how to do this in this blog post:
https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
But I'll give you a quick and dirty version here. The trick is to use ngrams together with the _all field and the match AND operator.
So with this mapping:
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"_all": {
"type": "string",
"analyzer": "ngram_analyzer",
"search_analyzer": "standard"
},
"properties": {
"word": {
"type": "string",
"include_in_all": true
},
"definition": {
"type": "string",
"include_in_all": true
}
}
}
}
}
and some documents:
PUT /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"word":"democracy", "definition":"government by the people; a form of government in which the supreme power is vested in the people and exercised directly by them or by their elected agents under a free electoral system."}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"word":"republic", "definition":"a state in which the supreme power rests in the body of citizens entitled to vote and is exercised by representatives chosen directly or indirectly by them."}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"word":"oligarchy", "definition":"a form of government in which all power is vested in a few persons or in a dominant class or clique; government by the few."}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"word":"plutocracy", "definition":"the rule or power of wealth or of the wealthy."}
{"index":{"_index":"test_index","_type":"doc","_id":5}}
{"word":"theocracy", "definition":"a form of government in which God or a deity is recognized as the supreme civil ruler, the God's or deity's laws being interpreted by the ecclesiastical authorities."}
{"index":{"_index":"test_index","_type":"doc","_id":6}}
{"word":"monarchy", "definition":"a state or nation in which the supreme power is actually or nominally lodged in a monarch."}
{"index":{"_index":"test_index","_type":"doc","_id":7}}
{"word":"capitalism", "definition":"an economic system in which investment in and ownership of the means of production, distribution, and exchange of wealth is made and maintained chiefly by private individuals or corporations, especially as contrasted to cooperatively or state-owned means of wealth."}
{"index":{"_index":"test_index","_type":"doc","_id":8}}
{"word":"socialism", "definition":"a theory or system of social organization that advocates the vesting of the ownership and control of the means of production and distribution, of capital, land, etc., in the community as a whole."}
{"index":{"_index":"test_index","_type":"doc","_id":9}}
{"word":"communism", "definition":"a theory or system of social organization based on the holding of all property in common, actual ownership being ascribed to the community as a whole or to the state."}
{"index":{"_index":"test_index","_type":"doc","_id":10}}
{"word":"feudalism", "definition":"the feudal system, or its principles and practices."}
{"index":{"_index":"test_index","_type":"doc","_id":11}}
{"word":"monopoly", "definition":"exclusive control of a commodity or service in a particular market, or a control that makes possible the manipulation of prices."}
{"index":{"_index":"test_index","_type":"doc","_id":12}}
{"word":"oligopoly", "definition":"the market condition that exists when there are few sellers, as a result of which they can greatly influence price and other market factors."}
I can apply partial matching across both fields (would work with as many fields as you want) like this:
POST /test_index/_search
{
"query": {
"match": {
"_all": {
"query": "theo go",
"operator": "and"
}
}
}
}
which in this case, returns:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.7601639,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "5",
"_score": 0.7601639,
"_source": {
"word": "theocracy",
"definition": "a form of government in which God or a deity is recognized as the supreme civil ruler, the God's or deity's laws being interpreted by the ecclesiastical authorities."
}
}
]
}
}
Here is the code I used here (there's more in the blog post):
http://sense.qbox.io/gist/e4093c25a8257499f54ced5a09f35b1eb48e4e3c
Hope that helps.

Elasticsearch query_string combined with match_phrase

I think it's best if I describe my intent and try to break it down to code.
I want users to have the ability of complex queries should they choose to that query_string offers. For example 'AND' and 'OR' and '~', etc.
I want to have fuzziness in effect, which has made me do things I feel dirty about like "#{query}~" to the sent to ES, in other words I am specifying fuzzy query on the user's behalf because we offer transliteration which could be difficult to get the exact spelling.
At times, users search a number of words that are suppose to be in a phrase. query_string searches them individually and not as a phrase. For example 'he who will' should bring me the top match to be when those three words are in that order, then give me whatever later.
Current query:
{
"indices_boost": {},
"aggregations": {
"by_ayah_key": {
"terms": {
"field": "ayah.ayah_key",
"size": 6236,
"order": {
"average_score": "desc"
}
},
"aggregations": {
"match": {
"top_hits": {
"highlight": {
"fields": {
"text": {
"type": "fvh",
"matched_fields": [
"text.root",
"text.stem_clean",
"text.lemma_clean",
"text.stemmed",
"text"
],
"number_of_fragments": 0
}
},
"tags_schema": "styled"
},
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"_source": {
"include": [
"text",
"resource.*",
"language.*"
]
},
"size": 5
}
},
"average_score": {
"avg": {
"script": "_score"
}
}
}
}
},
"from": 0,
"size": 0,
"_source": [
"text",
"resource.*",
"language.*"
],
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "inna alatheena",
"fuzziness": 1,
"fields": [
"text^1.6",
"text.stemmed"
],
"minimum_should_match": "85%"
}
}
],
"should": [
{
"match": {
"text": {
"query": "inna alatheena",
"type": "phrase"
}
}
}
]
}
}
}
Note: alatheena searched without the ~ will not return anything although I have allatheena in the indices. So I must do a fuzzy search.
Any thoughts?

I see that you're doing ES indexing of Qur'anic verses, +1 ...
Much of your problem domain, if I understood it correctly, can be solved simply by storing lots of transliteration variants (and permutations of their combining) in a separate field on your Aayah documents.
First off, you should make a char filter that replaces all double letters with single letters [aa] => [a], [ll] => [l]
Maybe also make a separate field containing all of [a, e, i] (because of their "vocative"/transcribal ambiguity) replaced with € or something similar, and do the same while querying in order to get as many matches as possible...
Also, TH in "allatheena" (which as a footnote may really be Dhaal, Thaa, Zhaa, Taa+Haa, Taa+Hhaa, Ttaa+Hhaa transcribed ...) should be replaced by something, or both the Dhaal AND the Thaa should be transcribed multiple times.
Then, because it's Qur'anic script, all Alefs without diacritics, Hamza, Madda, etc should be treated as Alef (or Hamzat) ul-Wasl, and that should also be considered when indexing / searching, because of Waqf / Wasl in reading arabic. (consider all the Wasl`s in the first Aayah of Surat Al-Alaq for example)
Dunno if this is answering your question in any way, but I hope it's of some assistance in implementing your application nontheless.

You should use Dis Max Query to achieve that.
A query that generates the union of documents produced by its
subqueries, and that scores each document with the maximum score for
that document as produced by any subquery, plus a tie breaking
increment for any additional matching subqueries.
This is useful when searching for a word in multiple fields with
different boost factors (so that the fields cannot be combined
equivalently into a single search field). We want the primary score to
be the one associated with the highest boost.
Quick example how to use it:
POST /_search
{
"query": {
"dis_max": {
"tie_breaker": 0.7,
"boost": 1.2,
"queries": [
{
"match": {
"text": {
"query": "inna alatheena",
"type": "phrase",
"boost": 5
}
}
},
{
"match": {
"text": {
"query": "inna alatheena",
"type": "phrase",
"fuzziness": "AUTO",
"boost": 3
}
}
},
{
"query_string": {
"default_field": "text",
"query": "inna alatheena"
}
}
]
}
}
}
It will run all of your queries, and the one, which scored highest compared to others, will be taken. So just define your rules using it. You should achieve what you wanted.

Elasticsearch term filter on inner object field not matching

I have just organized my document structure to have a more OO design (e.g. moved top level properties like venueId and venueName into a venue object with id and name fields).
However I can now not get a simple term filter working for fields on the child venue inner object.
Here is my mapping:
{
"deal": {
"properties": {
"textId": {"type":"string","name":"textId","index":"no"},
"displayId": {"type":"string","name":"displayId","index":"no"},
"active": {"name":"active","type":"boolean","index":"not_analyzed"},
"venue": {
"type":"object",
"path":"full",
"properties": {
"textId": {"type":"string","name":"textId","index":"not_analyzed"},
"regionId": {"type":"string","name":"regionId","index":"not_analyzed"},
"displayId": {"type":"string","name":"displayId","index":"not_analyzed"},
"name": {"type":"string","name":"name"},
"address": {"type":"string","name":"address"},
"area": {
"type":"multi_field",
"fields": {
"area": {"type":"string","index":"not_analyzed"},
"area_search": {"type":"string","index":"analyzed"}}},
"location": {"type":"geo_point","lat_lon":true}}},
"tags": {
"type":"multi_field",
"fields": {
"tags":{"type":"string","index":"not_analyzed"},
"tags_search":{"type":"string","index":"analyzed"}}},
"days": {
"type":"multi_field",
"fields": {
"days":{"type":"string","index":"not_analyzed"},
"days_search":{"type":"string","index":"analyzed"}}},
"value": {"type":"string","name":"value"},
"title": {"type":"string","name":"title"},
"subtitle": {"type":"string","name":"subtitle"},
"description": {"type":"string","name":"description"},
"time": {"type":"string","name":"time"},
"link": {"type":"string","name":"link","index":"no"},
"previewImage": {"type":"string","name":"previewImage","index":"no"},
"detailImage": {"type":"string","name":"detailImage","index":"no"}}}
}
Here is an example document:
GET /production/deals/wa-au-some-venue-weekends-some-deal
{
"_index":"some-index-v1",
"_type":"deals",
"_id":"wa-au-some-venue-weekends-some-deal",
"_version":1,
"exists":true,
"_source" : {
"id":"921d5fe0-8867-4d5c-81b4-7c1caf11325f",
"textId":"wa-au-some-venue-weekends-some-deal",
"displayId":"some-venue-weekends-some-deal",
"active":true,
"venue":{
"id":"46a7cb64-395c-4bc4-814a-a7735591f9de",
"textId":"wa-au-some-venue",
"regionId":"wa-au",
"displayId":"some-venue",
"name":"Some Venue",
"address":"sdgfdg",
"area":"Swan Valley & Surrounds"},
"tags":["Lunch"],
"days":["Saturday","Sunday"],
"value":"$1",
"title":"Some Deal",
"subtitle":"",
"description":"",
"time":"5pm - Late"
}
}
And here is an 'explain' test on that same document:
POST /production/deals/wa-au-some-venue-weekends-some-deal/_explain
{
"query": {
"filtered": {
"filter": {
"term": {
"venue.regionId": "wa-au"
}
}
}
}
}
{
"ok":true,
"_index":"some-index-v1",
"_type":"deals",
"_id":"wa-au-some-venue-weekends-some-deal",
"matched":false,
"explanation":{
"value":0.0,
"description":"ConstantScore(cache(venue.regionId:wa-au)) doesn't match id 0"
}
}
Is there any way to get more useful debugging info?
Is there something wrong with the explain result description? Simply saying "doesn't match id 0" does not really make sense to me... the field is called 'regionId' (not 'id') and the value is definitely not 0...???

That happens because the type you submitted the mapping for is called deal, while the type you indexed the document in is called deals.
If you look at the mapping for your type deals, you'll see that was automatically generated and the field venue.regionId is analyzed, thus you most likely have two tokens in your index: wa and au. Only searching for those tokens on that type you would get back that document.
Anything else looks just great! Only a small character is wrong ;)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to build an N-Gram relationship in Elasticsearch - node.js

Related

how to implement algolia autocomplete on a single index, but i want results to show based on facets

index and searchs analysers in elastic search: troubles in hitting exact string as first result

ElasticSearch: Suggestion Completion Multi Search

Elasticsearch query_string combined with match_phrase

Elasticsearch term filter on inner object field not matching

Categories

Resources