ElasticSearch: Suggestion Completion Multi Search - search

I am using the suggestion api within ES with completion. My implementation works (code below) but I would like to search for multiple words within a query. In the example below if I query search "word" it finds "wordpress" and outputs "Found". What I am am trying to accomplish is querying with something like "word blog magazine" which are all tags and have an output of "Found". Any help would be appreciated!
Mapping:
curl -XPUT "http://localhost:9200/test_index/" -d'
{
"mappings": {
"product": {
"properties": {
"description": {
"type": "string"
},
"tags": {
"type": "string"
},
"title": {
"type": "string"
},
"tag_suggest": {
"type": "completion",
"index_analyzer": "simple",
"search_analyzer": "simple",
"payloads": false
}
}
}
}
}'
Add document:
curl -XPUT "http://localhost:9200/test_index/product/1" -d'
{
"title": "Product1",
"description": "Product1 Description",
"tags": [
"blog",
"magazine",
"responsive",
"two columns",
"wordpress"
],
"tag_suggest": {
"input": [
"blog",
"magazine",
"responsive",
"two columns",
"wordpress"
],
"output": "Found"
}
}'
_suggest query:
curl -XPOST "http://localhost:9200/test_index/_suggest" -d'
{
"product_suggest":{
"text":"word",
"completion": {
"field" : "tag_suggest"
}
}
}'
The results are as we would expect:
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"product_suggest": [
{
"text": "word",
"offset": 0,
"length": 4,
"options": [
{
"text": "Found",
"score": 1
},
]
}
]
}

If you're willing to switch to using edge ngrams (or full ngrams if you need them), I think it will solve your problem.
I wrote up a pretty detailed explanation of how to do this in this blog post:
https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
But I'll give you a quick and dirty version here. The trick is to use ngrams together with the _all field and the match AND operator.
So with this mapping:
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"_all": {
"type": "string",
"analyzer": "ngram_analyzer",
"search_analyzer": "standard"
},
"properties": {
"word": {
"type": "string",
"include_in_all": true
},
"definition": {
"type": "string",
"include_in_all": true
}
}
}
}
}
and some documents:
PUT /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"word":"democracy", "definition":"government by the people; a form of government in which the supreme power is vested in the people and exercised directly by them or by their elected agents under a free electoral system."}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"word":"republic", "definition":"a state in which the supreme power rests in the body of citizens entitled to vote and is exercised by representatives chosen directly or indirectly by them."}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"word":"oligarchy", "definition":"a form of government in which all power is vested in a few persons or in a dominant class or clique; government by the few."}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"word":"plutocracy", "definition":"the rule or power of wealth or of the wealthy."}
{"index":{"_index":"test_index","_type":"doc","_id":5}}
{"word":"theocracy", "definition":"a form of government in which God or a deity is recognized as the supreme civil ruler, the God's or deity's laws being interpreted by the ecclesiastical authorities."}
{"index":{"_index":"test_index","_type":"doc","_id":6}}
{"word":"monarchy", "definition":"a state or nation in which the supreme power is actually or nominally lodged in a monarch."}
{"index":{"_index":"test_index","_type":"doc","_id":7}}
{"word":"capitalism", "definition":"an economic system in which investment in and ownership of the means of production, distribution, and exchange of wealth is made and maintained chiefly by private individuals or corporations, especially as contrasted to cooperatively or state-owned means of wealth."}
{"index":{"_index":"test_index","_type":"doc","_id":8}}
{"word":"socialism", "definition":"a theory or system of social organization that advocates the vesting of the ownership and control of the means of production and distribution, of capital, land, etc., in the community as a whole."}
{"index":{"_index":"test_index","_type":"doc","_id":9}}
{"word":"communism", "definition":"a theory or system of social organization based on the holding of all property in common, actual ownership being ascribed to the community as a whole or to the state."}
{"index":{"_index":"test_index","_type":"doc","_id":10}}
{"word":"feudalism", "definition":"the feudal system, or its principles and practices."}
{"index":{"_index":"test_index","_type":"doc","_id":11}}
{"word":"monopoly", "definition":"exclusive control of a commodity or service in a particular market, or a control that makes possible the manipulation of prices."}
{"index":{"_index":"test_index","_type":"doc","_id":12}}
{"word":"oligopoly", "definition":"the market condition that exists when there are few sellers, as a result of which they can greatly influence price and other market factors."}
I can apply partial matching across both fields (would work with as many fields as you want) like this:
POST /test_index/_search
{
"query": {
"match": {
"_all": {
"query": "theo go",
"operator": "and"
}
}
}
}
which in this case, returns:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.7601639,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "5",
"_score": 0.7601639,
"_source": {
"word": "theocracy",
"definition": "a form of government in which God or a deity is recognized as the supreme civil ruler, the God's or deity's laws being interpreted by the ecclesiastical authorities."
}
}
]
}
}
Here is the code I used here (there's more in the blog post):
http://sense.qbox.io/gist/e4093c25a8257499f54ced5a09f35b1eb48e4e3c
Hope that helps.

Related

How to build an N-Gram relationship in Elasticsearch

I am new to Elasticsearch, and I am looking to build a Front-End app which has a list of proverbs. As the user browses these proverbs, I want them to find related N-Gram proverbs, or analytic proverbs from the Proverb DB. For example when clicking on
"A watched pot never boils" would bring the following suggestions:
1-Gram suggestion:
"Two pees in a pot"
2-Gram suggestion:
"A Watched pot tastes bitter"
Analytical suggestion: "Too many cooks spoil the broth"
Is there a way to do that in ES, or do I need to build my own logic ?
The 1-gram suggestion works out of the box and the 2-gram suggestions can easily be achieved with shingle.
Here is an attempt
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"2-grams": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingles"
]
}
},
"filter": {
"shingles": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"analyzer": "standard",
"fields": {
"2gram": {
"type": "text",
"analyzer": "2-grams"
}
}
}
}
}
}
Next index some documents:
PUT test/_doc/1
{
"text": "Two pees in a pot"
}
PUT test/_doc/2
{
"text": "A Watched pot tastes bitter"
}
Finally, you can search for 1-gram suggestions using the following query and you'll get both documents in the response:
POST test/_search
{
"query": {
"match": {
"text": "A watched pot never boils"
}
}
}
You can also search for 2-gram suggestions using the following query and only the second document will come up:
POST test/_search
{
"query": {
"match": {
"text.2gram": "A watched pot never boils"
}
}
}
PS: Not sure how the "analytical" suggestion works, though, feel free to provide more insights, and I'll update.

Microsoft.Azure.Search (sdk v3.0.3) don't return all the facets correctly

When I use Microsoft.Azure.Search (v3.0.3)'s "SearchAsync" and "Search" methods to return the indexed items, the sdk doesn't return all the facets.
However; when I try the same thing using Postman, it returns all the facets correctly.
Could this be a bug of the sdk (I believe it is as a direct call to an sdk method doesn't return the all the facets correctly - but couldn't find any records about this possible bug)? If yes, is there a fix for this for the sdk? Any help is appreciated.
UPDATE:
After spending some more time, I have found out that the bug is not .NET SDK Specific.
Both .NET SDK and REST API appear to have this problem and none of them returns all the facets. Can you please tell me is there a known bug for this and what is the fix for it?
Please see the following example;
There must be 2 Coaching facets but only 1 is returning from the Azure Search Service.
The new search query(facet specialisms added)
https://MYPROJECT-search.search.windows.net/indexes/myproject-directory-qa/docs?api-version=2016-09-01&$count=true&facet=specialisms&$filter=listingType eq 'Therapist'
"Coaching:Development coaching", --> This doesn't return as a facet.
"Coaching:Executive coaching", -->This returns fine.
"#search.facets": {
"specialisms#odata.type": "#Collection(Microsoft.Azure.Search.V2016_09_01.QueryResultFacet)",
"specialisms": [
{
"count": 5,
"value": "Anxiety, depression and trauma:Depression"
},
{
"count": 4,
"value": "Addiction, self-harm and eating disorders:Obsessions"
},
{
"count": 4,
"value": "Anxiety, depression and trauma:Post-traumatic stress"
},
{
"count": 4,
"value": "Coaching:Executive coaching"
},
{
"count": 4,
"value": "Identity, culture and spirituality:Self esteem"
},
{
"count": 4,
"value": "Relationships, family and children:Pregnancy related issues"
},
{
"count": 4,
"value": "Stress and work:Redundancy"
},
{
"count": 3,
"value": "Addiction, self-harm and eating disorders:Eating disorders"
},
{
"count": 3,
"value": "Anxiety, depression and trauma:Bereavement"
},
{
"count": 3,
"value": "Anxiety, depression and trauma:Loss"
}
]
},
{
"#search.score": 1,
"contactId": "df394997-6e94-e711-80ed-3863bb34db00",
"location": {
"type": "Point",
"coordinates": [
-2.58586,
51.47873
],
"crs": {
"type": "name",
"properties": {
"name": "EPSG:4326"
}
}
},
"profileImageUrl": "https://myprojectwebqa.blob.core.windows.net/profileimage/3e31457c-5113-4062-b960-30f038ce7bfc.jpg",
"locationText": "Bristol",
"listingType": "Therapist",
"disabledAccess": true,
"flexibleHours": true,
"offersConcessionaryRates": false,
"homeVisits": true,
"howIWillWork": "<p>Some test data</p>",
"specialisms": [
"Health related issues:Asperger syndrome",
"Health related issues:Chronic fatigue syndrome/ME",
"Addiction, self-harm and eating disorders:Addictions",
"Addiction, self-harm and eating disorders:Eating disorders",
"Addiction, self-harm and eating disorders:Obsessions",
"Anxiety, depression and trauma:Bereavement",
"Anxiety, depression and trauma:Depression",
"Anxiety, depression and trauma:Loss",
"Coaching:Development coaching",
"Coaching:Executive coaching",
"Identity, culture and spirituality:Self esteem",
"Identity, culture and spirituality:Sexuality",
"Relationships, family and children:Infertility",
"Relationships, family and children:Relationships",
"Stress and work:Redundancy"
],
"clientele": [
"Adults",
"Children",
"Groups"
],
"approaches": [
"CBT",
"Cognitive",
"Psychoanalytic",
"Psychosynthesis"
],
"sessionTypes": [
"Home visits",
"Long-term face to face work"
],
"hourlyRate": 50,
"fullName": "Test Name",
"id": "ZWUwNGIyNjYtYjQ5Ny1lNzExLTgwZTktMzg2M2JiMzY0MGI4"
}
Please see details below;
For my case I have found out that by default the Azure Search Service returns 10 of the facets. That is why I couldn't see all my facets.
After updating my search query as follows, I have fixed my problem and now I can see all my facets in the search results - please see the facet bit updated to this; facet=specialisms, count:9999.
https://MYPROJECTNAME-search.search.windows.net/indexes/MYPROJECTNAME-directory-qa/docs?api-version=2016-09-01&$count=true&facet=specialisms, count:9999&facet=clientele, count:9999&$filter=listingType eq 'Therapist'
For the Microsoft documentation, please see the following link.
"max # of facet terms; default is 10"
https://learn.microsoft.com/en-us/rest/api/searchservice/search-documents

index and searchs analysers in elastic search: troubles in hitting exact string as first result

I am doing tests with elastic search in indexing wikipedia's topics.
Below my settings.
Results I expect is to have first result matching the exact string - especially if string is made by one word only.
Instead:
Searching for "g"
curl "http://localhost:9200/my_index/_search?q=name:g&pretty=True"
returns
[Changgyeonggung, Lopadotemachoselachogaleokranioleipsanodrimhypotrimmatosilphioparaomelitokatakechymenokichlepikossyphophattoperisteralektryonoptekephalliokigklopeleiolagoiosiraiobaphetraganopterygon, ..] as first results (yes, serendipity time! that is a greek dish if you are curious [http://nifty.works/about/BgdKMmwV6B3r4pXJ/] :)
I thought because the results weight more "G" letters respect to other words.. but:
Searching for "google":
curl "http://localhost:9200/my_index/_search?q=name:google&pretty=True"
returns
[Googlewhack, IGoogle, Google+, Google, ..] as first results, and I would expect Google to be the first.
What is wrong in my settings for not hitting exact keyword if exists?
I used index and search analyzers for the reason suggested in this answer:[https://stackoverflow.com/a/15932838/305883]
Settings
# make index with mapping
curl -X PUT localhost:9200/test-ngram -d '
{
"settings": {
"analysis": {
"analyzer": {
"index_analyzer": {
"type" : "custom",
"tokenizer": "lowercase",
"filter": ["asciifolding", "title_ngram"]
},
"search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "stop", "asciifolding"]
}
},
"filter": {
"title_ngram" : {
"type" : "nGram",
"min_gram" : 1,
"max_gram" : 10
}
}
}
},
"mappings": {
"topic": {
"properties": {
"name": {
"type": "string",
"boost": 10.0,
"index": "analyzed",
"index_analyzer": "index_analyzer",
"search_analyzer": "search_analyzer"
}
}
}
}
}
'
That's because relevance works in a different way by default (check the part about TF/IDF
https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html)
If you want to have exact term match on the top of the results while also matching substrings etc, you need to index name as multifield like this:
"name": {
"type": "string",
"index": "analyzed",
// other analyzer stuff here
"fields": {
"raw": { "type": "string", "index": "not_analyzed" }
}
}
Then in the boolean query you need to query both name and name.raw and boost results from name.raw

Returning the "search term" along with result - Elasticsearch

In the elasticsearch module I have built, is it possible to return the "input search term" in the search results ?
For example :
GET /signals/_search
{
"query": {
"match": {
"focused_content": "stock"
}
}
}
This returns
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.057534903,
"hits": [
{
"_index": "signals",
"_type": "signal",
"_id": "13",
"_score": 0.057534903,
"_source": {
"username": "abc#abc.com",
"tags": [
"News"
],
"content_url": "http://www.wallstreetscope.com/morning-stock-highlights-western-digital-corporation-wdc-fibria-celulose-sa-fbr-ametek-inc-ame-cott-corporation-cot-graftech-international-ltd-gti/25375462/",
"source": null,
"focused_content": "Morning Stock Highlights: Western Digital Corporation (WDC), Fibria Celulose SA (FBR), Ametek Inc. (AME), Cott Corporation (COT), GrafTech International Ltd. (GTI) - WallStreet Scope",
"time_stamp": "2015-08-12"
}
}
]
}
Is it possible to have the input search term "stock" along with each of the results (like an additional JSON Key along with "content_url","source","focused_content","time_stamp") to identify which search term had brought that result ?
Thanks in Advance !
All I can think of, would be using highlighting feature. So it would bring back additional key _highlightand it would highlight things, that matched.
It won't bring exact matching terms, tho. You'd have to deal with them in your application. You could use pre/post tags functionality to wrap them up somehow specially, so your app could recognize that it was a match.
You can use highlights on all fields, like #Evaldas suggested. This will return the result along with the value in the field which matched, surrounded by customisable tags (default is <em>).
GET /signals/_search
{
"highlight": {
"fields": {
"username": {},
"tags": {},
"source": {},
"focused_content": {},
"time_stamp": {}
}
},
"query": {
"match": {
"focused_content": "stock"
}
}
}

Elasticsearch index short words + make indexes applying EdgeNGram

I am using Elasticsearch with a EdgeNGram filter which is set as follows:
"edgeNGram": {
"type": "edgeNGram",
"min_gram": 3,
"max_gram": 15,
},
The problem is that when I make a query using very short words, they are completely omitted from the search. Let's say I type in "Vitamin C" -> this gives me results for the first term "Vitamin" only. Is there any way how to tell Elasticsearch not to use EdgeNGram filter when indexing words up to 3 characters?
Thank you.
EDIT:
These are my settings:
ELASTICSEARCH_INDEX_SETTINGS = {
"settings": {
"analysis": {
"analyzer": {
"sk_hunspell": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"sk_lowercase", "sk_SK", "stopwords_SK",
"edgeNGram", "asciifolding",
"remove_duplicities",
]
},
},
"filter": {
"sk_SK": {
"type": "hunspell",
"locale": "sk_SK",
"dedup": True,
"recursion_level": 0,
"ignore_case": True,
},
"sk_lowercase": {
"type": "lowercase",
},
"stopwords_SK": {
"type": "stop",
"stopwords": STOPWORDS_SK,
},
"remove_duplicities": {
"type": "unique",
"only_on_same_position": True
},
"edgeNGram": {
"type": "edgeNGram",
"min_gram": 3,
"max_gram": 15,
"token_chars": ["letter", "digit"],
},
},
}
}
}
In the database I store information about vitamins, minerals and medicinal plants. (Their use, collecting, blooming, health benefits etc.) The information are written in Slovak. (The names of the plants and minerals are also stored in Czech and Latin).
This idea may be a hack but you could pad words less than 3 with a special charecter before inserting them into the index so they are length 3.
When you accept the user's query you would have to also pad their words less than three with the same special charecter.
You would need to create a custom tokenizer for this.

Resources