elasticsearch: distinct token matching - search

tl;dr: I want to have a query that matches every token once at max
Given I have an elasticsearch index with the following words:
["stackoverflow", "overflow", "awesome", "some"]
Is there any elasticsearch query, that matches
"stackoverflow" and "awesome" on the sentence "stackoverflow community is awesome" and doesn't match "overflow" and "some"?
I can't do it with score only, because there's also a misspelling detection included.
What I'm searching for is something like a consuming matching. Unfortunately, I couldn't find anything suitable so far :(
Thanks!
more details:
The indexed documents look like this
{"name": "stackoverflow",
"type": "brand"},
{"name": "awesome",
"type": "descriptor"},
{"name": "overflow",
"type": "brand"},
{"name": "some",
"type": "descriptor"}
My query looks like this:
{
"min_score": 1,
"query": {
"match": {
"name": {
"query": "stakoverflow community is awesom",
"fuzziness": 2
}
}
},
"rescore": {
"window_size": 10,
"query": {
"rescore_query": {
"match": {
"name": "stakoverflow community is awesom"
}
},
"query_weight": 0.9,
"rescore_query_weight": 1.1
}
}
}
So I basically try to catch misspellings in the first query and prefer non-misspelling in the rescore.
What I'd like to achieve:
For each token, I'd like to have at most 1 match:
INPUT stakoverflow community is awesom
OUTPUT stackoverflow <nothing> <nothing> awesome
My problem is, that I also get overflow and some returned. Overflow might even have a better score than awesome, because its not a misspelling.

Related

How to query by array of objects in Contentful

I have an content type entry in Contentful that has fields like this:
"fields": {
"title": "How It Works",
"slug": "how-it-works",
"countries": [
{
"sys": {
"type": "Link",
"linkType": "Entry",
"id": "3S5dbLRGjS2k8QSWqsKK86"
}
},
{
"sys": {
"type": "Link",
"linkType": "Entry",
"id": "wHfipcJS6WUSaKae0uOw8"
}
}
],
"content": [
{
"sys": {
"type": "Link",
"linkType": "Entry",
"id": "72R0oUMi3uUGMEa80kkSSA"
}
}
]
}
I'd like to run a query that would only return entries if they contain a particular country.
I played around with this query:
https://cdn.contentful.com/spaces/aoeuaoeuao/entries?content_type=contentPage&fields.countries=3S5dbLRGjS2k8QSWqsKK86
However get this error:
The equals operator cannot be used on fields.countries.en-AU because it has type Object.
I'm playing around with postman, but will be using the .NET API.
Is it possible to search for entities, and filter on arrays that contain Objects?
Still learning the API, so I'm guessing it should be pretty straight forward.
Update:
I looked at the request the Contentful Web CMS makes, as this functionality is possible there. They use query params like this:
filters.0.key=fields.countries.sys.id&filters.0.val=3S5dbLRGjS2k8QSWqsKK86
However, this did not work in the delivery API, and might only be an internal query format.
Figured this out. I used the following URL:
https://cdn.contentful.com/spaces/aoeuaoeua/entries?content_type=contentPage&fields.countries.sys.id=wHfipcJS6WUSaKae0uOw8
Note the query parameter fields.countries.sys.id

Couchdb 2 _find query not using index

I'm struggling with something that should be easy but it's making no sense to me, I have these 2 documents in a database:
{ "name": "foo", "type": "typeA" },
{ "name": "bar", "type": "typeB" }
And I'm posting this to _find:
{
"selector": {
"type": "typeA"
},
"sort": ["name"]
}
Which works as expected but I get a warning that there's no matching index, so I've tried posting various combinations of the following to _index which makes no difference:
{
"index": {
"fields": ["type"]
}
}
{
"index": {
"fields": ["name"]
}
}
{
"index": {
"fields": ["name", "type"]
}
}
If I remove the sort by name and only index the type it works fine except it's not sorted, is this a limitation with couchdbs' mango implementation or am I missing something?
Using a view and map function works fine but I'm curious what mango is/isn't doing here.
With just the type index, I think it will normally be almost as efficient unless you have many documents of each type (as it has to do the sorting stage in memory.)
But since fields are ordered, it would be necessary to do:
{
"index": {
"fields": ["type", "name"]
}
}
to have a contiguous slice of this index for each type that is already ordered by name. But the query planner may not determine that this index applies.
As an example, the current pouchdb-find (which should be similar) needs the more complicated but equivalent query:
{
selector: {type: 'typeA', name: {$gte: null} },
sort: ['type','name']
}
to choose this index and build a plan that doesn't resort to building in memory for any step.

Elasticsearch query_string combined with match_phrase

I think it's best if I describe my intent and try to break it down to code.
I want users to have the ability of complex queries should they choose to that query_string offers. For example 'AND' and 'OR' and '~', etc.
I want to have fuzziness in effect, which has made me do things I feel dirty about like "#{query}~" to the sent to ES, in other words I am specifying fuzzy query on the user's behalf because we offer transliteration which could be difficult to get the exact spelling.
At times, users search a number of words that are suppose to be in a phrase. query_string searches them individually and not as a phrase. For example 'he who will' should bring me the top match to be when those three words are in that order, then give me whatever later.
Current query:
{
"indices_boost": {},
"aggregations": {
"by_ayah_key": {
"terms": {
"field": "ayah.ayah_key",
"size": 6236,
"order": {
"average_score": "desc"
}
},
"aggregations": {
"match": {
"top_hits": {
"highlight": {
"fields": {
"text": {
"type": "fvh",
"matched_fields": [
"text.root",
"text.stem_clean",
"text.lemma_clean",
"text.stemmed",
"text"
],
"number_of_fragments": 0
}
},
"tags_schema": "styled"
},
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"_source": {
"include": [
"text",
"resource.*",
"language.*"
]
},
"size": 5
}
},
"average_score": {
"avg": {
"script": "_score"
}
}
}
}
},
"from": 0,
"size": 0,
"_source": [
"text",
"resource.*",
"language.*"
],
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "inna alatheena",
"fuzziness": 1,
"fields": [
"text^1.6",
"text.stemmed"
],
"minimum_should_match": "85%"
}
}
],
"should": [
{
"match": {
"text": {
"query": "inna alatheena",
"type": "phrase"
}
}
}
]
}
}
}
Note: alatheena searched without the ~ will not return anything although I have allatheena in the indices. So I must do a fuzzy search.
Any thoughts?
I see that you're doing ES indexing of Qur'anic verses, +1 ...
Much of your problem domain, if I understood it correctly, can be solved simply by storing lots of transliteration variants (and permutations of their combining) in a separate field on your Aayah documents.
First off, you should make a char filter that replaces all double letters with single letters [aa] => [a], [ll] => [l]
Maybe also make a separate field containing all of [a, e, i] (because of their "vocative"/transcribal ambiguity) replaced with € or something similar, and do the same while querying in order to get as many matches as possible...
Also, TH in "allatheena" (which as a footnote may really be Dhaal, Thaa, Zhaa, Taa+Haa, Taa+Hhaa, Ttaa+Hhaa transcribed ...) should be replaced by something, or both the Dhaal AND the Thaa should be transcribed multiple times.
Then, because it's Qur'anic script, all Alefs without diacritics, Hamza, Madda, etc should be treated as Alef (or Hamzat) ul-Wasl, and that should also be considered when indexing / searching, because of Waqf / Wasl in reading arabic. (consider all the Wasl`s in the first Aayah of Surat Al-Alaq for example)
Dunno if this is answering your question in any way, but I hope it's of some assistance in implementing your application nontheless.
You should use Dis Max Query to achieve that.
A query that generates the union of documents produced by its
subqueries, and that scores each document with the maximum score for
that document as produced by any subquery, plus a tie breaking
increment for any additional matching subqueries.
This is useful when searching for a word in multiple fields with
different boost factors (so that the fields cannot be combined
equivalently into a single search field). We want the primary score to
be the one associated with the highest boost.
Quick example how to use it:
POST /_search
{
"query": {
"dis_max": {
"tie_breaker": 0.7,
"boost": 1.2,
"queries": [
{
"match": {
"text": {
"query": "inna alatheena",
"type": "phrase",
"boost": 5
}
}
},
{
"match": {
"text": {
"query": "inna alatheena",
"type": "phrase",
"fuzziness": "AUTO",
"boost": 3
}
}
},
{
"query_string": {
"default_field": "text",
"query": "inna alatheena"
}
}
]
}
}
}
It will run all of your queries, and the one, which scored highest compared to others, will be taken. So just define your rules using it. You should achieve what you wanted.

How to search through data with arbitrary amount of fields?

I have the web-form builder for science events. The event moderator creates registration form with arbitrary amount of boolean, integer, enum and text fields.
Created form is used for:
register a new member to event;
search through registered members.
What is the best search tool for second task (to search memebers of event)? Is ElasticSearch well for this task?
I wrote a post about how to index arbitrary data into Elasticsearch and then to search it by specific fields and values. All this, without blowing up your index mapping.
The post is here: http://smnh.me/indexing-and-searching-arbitrary-json-data-using-elasticsearch/
In short, you will need to do the following steps to get what you want:
Create a special index described in the post.
Flatten the data you want to index using the flattenData function:
https://gist.github.com/smnh/30f96028511e1440b7b02ea559858af4.
Create a document with the original and flattened data and index it into Elasticsearch:
{
"data": { ... },
"flatData": [ ... ]
}
Optional: use Elasticsearch aggregations to find which fields and types have been indexed.
Execute queries on the flatData object to find what you need.
Example
Basing on your original question, let's assume that the first event moderator created a form with following fields to register members for the science event:
name string
age long
sex long - 0 for male, 1 for female
In addition to this data, the related event probably has some sort of id, let's call it eventId. So the final document could look like this:
{
"eventId": "2T73ZT1R463DJNWE36IA8FEN",
"name": "Bob",
"age": 22,
"sex": 0
}
Now, before we index this document, we will flatten it using the flattenData function:
flattenData(document);
This will produce the following array:
[
{
"key": "eventId",
"type": "string",
"key_type": "eventId.string",
"value_string": "2T73ZT1R463DJNWE36IA8FEN"
},
{
"key": "name",
"type": "string",
"key_type": "name.string",
"value_string": "Bob"
},
{
"key": "age",
"type": "long",
"key_type": "age.long",
"value_long": 22
},
{
"key": "sex",
"type": "long",
"key_type": "sex.long",
"value_long": 0
}
]
Then we will wrap this data in a document as I've showed before and index it.
Then, the second event moderator, creates another form having a new field, field with same name and type, and also a field with same name but with different type:
name string
city string
sex string - "male" or "female"
This event moderator decided that instead of having 0 and 1 for male and female, his form will allow choosing between two strings - "male" and "female".
Let's try to flatten the data submitted by this form:
flattenData({
"eventId": "F1BU9GGK5IX3ZWOLGCE3I5ML",
"name": "Alice",
"city": "New York",
"sex": "female"
});
This will produce the following data:
[
{
"key": "eventId",
"type": "string",
"key_type": "eventId.string",
"value_string": "F1BU9GGK5IX3ZWOLGCE3I5ML"
},
{
"key": "name",
"type": "string",
"key_type": "name.string",
"value_string": "Alice"
},
{
"key": "city",
"type": "string",
"key_type": "city.string",
"value_string": "New York"
},
{
"key": "sex",
"type": "string",
"key_type": "sex.string",
"value_string": "female"
}
]
Then, after wrapping the flattened data in a document and indexing it into Elasticsearch we can execute complicated queries.
For example, to find members named "Bob" registered for the event with ID 2T73ZT1R463DJNWE36IA8FEN we can execute the following query:
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "flatData",
"query": {
"bool": {
"must": [
{"term": {"flatData.key": "eventId"}},
{"match": {"flatData.value_string.keyword": "2T73ZT1R463DJNWE36IA8FEN"}}
]
}
}
}
},
{
"nested": {
"path": "flatData",
"query": {
"bool": {
"must": [
{"term": {"flatData.key": "name"}},
{"match": {"flatData.value_string": "bob"}}
]
}
}
}
}
]
}
}
}
ElasticSearch automatically detects the field content in order to index it correctly, even if the mapping hasn't been defined previously. So, yes : ElasticSearch suits well these cases.
However, you may want to fine tune this behavior, or maybe the default mapping applied by ElasticSearch doesn't correspond to what you need : in this case, take a look at the default mapping or, for even further control, the dynamic templates feature.
If you let your end users decide the keys you store things in, you'll have an ever-growing mapping and cluster state, which is problematic.
This case and a suggested solution is covered in this article on common problems with Elasticsearch.
Essentially, you want to have everything that can possibly be user-defined as a value. Using nested documents, you can have a key-field and differently mapped value fields to achieve pretty much the same.

Elasticsearch term filter on inner object field not matching

I have just organized my document structure to have a more OO design (e.g. moved top level properties like venueId and venueName into a venue object with id and name fields).
However I can now not get a simple term filter working for fields on the child venue inner object.
Here is my mapping:
{
"deal": {
"properties": {
"textId": {"type":"string","name":"textId","index":"no"},
"displayId": {"type":"string","name":"displayId","index":"no"},
"active": {"name":"active","type":"boolean","index":"not_analyzed"},
"venue": {
"type":"object",
"path":"full",
"properties": {
"textId": {"type":"string","name":"textId","index":"not_analyzed"},
"regionId": {"type":"string","name":"regionId","index":"not_analyzed"},
"displayId": {"type":"string","name":"displayId","index":"not_analyzed"},
"name": {"type":"string","name":"name"},
"address": {"type":"string","name":"address"},
"area": {
"type":"multi_field",
"fields": {
"area": {"type":"string","index":"not_analyzed"},
"area_search": {"type":"string","index":"analyzed"}}},
"location": {"type":"geo_point","lat_lon":true}}},
"tags": {
"type":"multi_field",
"fields": {
"tags":{"type":"string","index":"not_analyzed"},
"tags_search":{"type":"string","index":"analyzed"}}},
"days": {
"type":"multi_field",
"fields": {
"days":{"type":"string","index":"not_analyzed"},
"days_search":{"type":"string","index":"analyzed"}}},
"value": {"type":"string","name":"value"},
"title": {"type":"string","name":"title"},
"subtitle": {"type":"string","name":"subtitle"},
"description": {"type":"string","name":"description"},
"time": {"type":"string","name":"time"},
"link": {"type":"string","name":"link","index":"no"},
"previewImage": {"type":"string","name":"previewImage","index":"no"},
"detailImage": {"type":"string","name":"detailImage","index":"no"}}}
}
Here is an example document:
GET /production/deals/wa-au-some-venue-weekends-some-deal
{
"_index":"some-index-v1",
"_type":"deals",
"_id":"wa-au-some-venue-weekends-some-deal",
"_version":1,
"exists":true,
"_source" : {
"id":"921d5fe0-8867-4d5c-81b4-7c1caf11325f",
"textId":"wa-au-some-venue-weekends-some-deal",
"displayId":"some-venue-weekends-some-deal",
"active":true,
"venue":{
"id":"46a7cb64-395c-4bc4-814a-a7735591f9de",
"textId":"wa-au-some-venue",
"regionId":"wa-au",
"displayId":"some-venue",
"name":"Some Venue",
"address":"sdgfdg",
"area":"Swan Valley & Surrounds"},
"tags":["Lunch"],
"days":["Saturday","Sunday"],
"value":"$1",
"title":"Some Deal",
"subtitle":"",
"description":"",
"time":"5pm - Late"
}
}
And here is an 'explain' test on that same document:
POST /production/deals/wa-au-some-venue-weekends-some-deal/_explain
{
"query": {
"filtered": {
"filter": {
"term": {
"venue.regionId": "wa-au"
}
}
}
}
}
{
"ok":true,
"_index":"some-index-v1",
"_type":"deals",
"_id":"wa-au-some-venue-weekends-some-deal",
"matched":false,
"explanation":{
"value":0.0,
"description":"ConstantScore(cache(venue.regionId:wa-au)) doesn't match id 0"
}
}
Is there any way to get more useful debugging info?
Is there something wrong with the explain result description? Simply saying "doesn't match id 0" does not really make sense to me... the field is called 'regionId' (not 'id') and the value is definitely not 0...???
That happens because the type you submitted the mapping for is called deal, while the type you indexed the document in is called deals.
If you look at the mapping for your type deals, you'll see that was automatically generated and the field venue.regionId is analyzed, thus you most likely have two tokens in your index: wa and au. Only searching for those tokens on that type you would get back that document.
Anything else looks just great! Only a small character is wrong ;)

Resources