Searching after indexing in ElasticSearch - search

I want to index 1 billion records. each record has 2 attributes (attribute1 and attribute2).
each record that has same value in attribute1 must be merge. for example, I have two record
attribute1 attribute2
1 4
1 6
my elastic document must be
{
"attribute1": "1"
"attribute2": "4,6"
}
due to huge amount of data, I must to read a bulk (about 1000 records) and merge them based on the above rule (in memory) and then search them in ElasticSearch and merge them with search result and then index/reindex them.
In summary I have to Search and Index per bulk respectively.
I implemented this rule but in some cases Elastic does not return all results and some documents have been indexed duplicately.
after each Index I Refresh ElasticSearch so that it be ready for next search. but in some case it doesn’t work.
my index setting is followed as:
{
"test_index": {
"settings": {
"index": {
"refresh_interval": "-1",
"translog": {
"flush_threshold_size": "1g"
},
"max_result_window": "1000000",
"creation_date": "1464577964635",
"store": {
"throttle": {
"type": "merge"
}
}
},
"number_of_replicas": "0",
"uuid": "TZOse2tLRqGk-vHRMGc2GQ",
"version": {
"created": "2030199"
},
"warmer": {
"enabled": "false"
},
"indices": {
"memory": {
"index_buffer_size": "40%"
}
},
"number_of_shards": "5",
"merge": {
"policy": {
"max_merge_size": "2g"
}
}
}
}
how can I resolve this problem?
Is there any other setting to handle this situation?

In your bulk commands, you need to use the index operation for the first occurence and then update with a script to update your attribute2 property:
{ "index" : { "_index" : "test_index", "_type" : "test_type", "_id" : "1" } }
{ "attribute1" : "1", "attribute2": [4] }
{ "update" : { "_index" : "test_index", "_type" : "test_type", "_id" : "1" } }
{ "script" : { "inline": "ctx._source.attribute2 += attr2", "params" : {"attr2" : 6}}}
After the first index operation your document will look like
{
"attribute1": "1"
"attribute2": [4]
}
After the second update operation, your document will look like
{
"attribute1": "1"
"attribute2": [4, 6]
}
Note that it is also possible to only use update operations with doc_as_upsert and script.

Related

Elasticsearch Search/filter by occurrence or order in an array

I am having a data field in my index in which,
I want only doc 2 as result i.e logically where b comes before
a in the array field data.
doc 1:
data = ['a','b','t','k','p']
doc 2:
data = ['p','b','i','o','a']
Currently, I am trying terms must on [a,b] then checking the order in another code snippet.
Please suggest any better way around.
My understanding is that the only way to do that would be to make use of Span Queries, however it won't be applicable on an array of values.
You would need to concatenate the values into a single text field with whitespace as delimiter, reingest the documents and make use of Span Near query on that field:
Please find the below mapping, sample document, the query and response:
Mapping:
PUT my_test_index
{
"mappings": {
"properties": {
"data":{
"type": "text"
}
}
}
}
Sample Documents:
POST my_test_index/_doc/1
{
"data": "a b"
}
POST my_test_index/_doc/2
{
"data": "b a"
}
Span Query:
POST my_test_index/_search
{
"query": {
"span_near" : {
"clauses" : [
{ "span_term" : { "data" : "a" } },
{ "span_term" : { "data" : "b" } }
],
"slop" : 0, <--- This means only `a b` would return but `a c b` won't.
"in_order" : true <--- This means a should come first and the b
}
}
}
Note that slop controls the maximum number of intervening unmatched positions permitted.
Response:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.36464313,
"hits" : [
{
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.36464313,
"_source" : {
"data" : "a b"
}
}
]
}
}
Let me know if this helps!

Finding duplicates in Elasticsearch

I'm trying to find entries in my data which are equal in more than one aspect. I currently do this using a complex query which nests aggregations:
{
"size": 0,
"aggs": {
"duplicateFIELD1": {
"terms": {
"field": "FIELD1",
"min_doc_count": 2 },
"aggs": {
"duplicateFIELD2": {
"terms": {
"field": "FIELD2",
"min_doc_count": 2 },
"aggs": {
"duplicateFIELD3": {
"terms": {
"field": "FIELD3",
"min_doc_count": 2 },
"aggs": {
"duplicateFIELD4": {
"terms": {
"field": "FIELD4",
"min_doc_count": 2 },
"aggs": {
"duplicate_documents": {
"top_hits": {} } } } } } } } } } } }
This works to an extent as the result I get when no duplicates are found look something like this:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 27524067,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"duplicateFIELD1" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 27524027,
"buckets" : [
{
"key" : <valueFromField1>,
"doc_count" : 4,
"duplicateFIELD2" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField2>,
"doc_count" : 2,
"duplicateFIELD3" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField3>,
"doc_count" : 2,
"duplicateFIELD4" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
},
{
"key" : <valueFromField2>,
"doc_count" : 2,
"duplicateFIELD3" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField3>,
"doc_count" : 2,
"duplicateFIELD4" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
}
]
}
},
{
"key" : <valueFromField1>,
"doc_count" : 4,
"duplicateFIELD2" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField2>,
"doc_count" : 2,
"duplicateFIELD3" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField3>,
"doc_count" : 2,
"duplicateFIELD4" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
},
{
"key" : <valueFromField2>,
"doc_count" : 2,
"duplicateFIELD3" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : <valueFromField3>,
"doc_count" : 2,
"duplicateFIELD4" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
}
]
}
},
...
I'm skipping some of the output which looks rather similar.
I can now scan through this complex deeply nested data structure and find that no documents are stored in all of these nested buckets. But this seems rather cumbersome. I guess there might be a better (more straight-forward) way of doing this.
Also, if I want to check more than four fields, this nested structure will grow and grow and grow. So it does not scale very well and I want to avoid this.
Can I improve my solution so that I do get a simple list of all documents which are duplicates? (Maybe the ones which are duplicates of each other grouped together somehow.) or is there a completely different approach (such as without aggregation) which does not have the drawbacks I described here?
EDIT: I found an approach using the script feature of ES here, but in my version of ES this returns just an error message. Maybe someone can point out to me how to do it in ES 5.0? My trials up to now did not work.
EDIT: I found a way to use a script for my approach which uses the modern way (language "painless"):
{
"size": 0,
"aggs": {
"duplicateFOO": {
"terms": {
"script": {
"lang": "painless",
"inline": "doc['FIELD1'].value + doc['FIELD2'].value + doc['FIELD3'].value + doc['FIELD4'].value"
},
"min_doc_count": 2
}
}
}
}
This seems to work for very small amounts of data and results in an error for realistic amounts of data (circuit_breaking_exception: [request] Data too large, data for [<reused_arrays>] would be larger than limit of [6348236390/5.9gb]). Any idea on how I can fix this? Probably adjust some configuration of the ES to make it use larger internal buffers or similar?
There does not seem to be a proper solution for my situation which avoids the nesting in a general way.
Fortunately three of my four fields have a very limited value range; the first can only be 1 or 2, the second can be 1, 2, or 3 and the third can be 1, 2, 3, or 4. Since these are just 24 combinations I currently go with filtering one 24th out of the complete data set before applying the aggregation, then of just one (the remaining fourth field). I then have to apply all actions 24 times (once with each combination of the three limited fields mentioned above), but this is still more feasible than handling the complete data set at once.
The query (i. e. one of the 24 queries) I send now look something like this:
{
"size": 0,
"query": {
"bool": {
"must": [
{ "match": { "FIELD1": 2 } },
{ "match": { "FIELD2": 3 } },
{ "match": { "FIELD3": 4 } } ] } },
"aggs": {
"duplicateFIELD4": {
"terms": {
"field": "FIELD4",
"min_doc_count": 2 } } } }
The results for this of course are not nested anymore. But this cannot be done if more than one field holds arbitrary values of a larger range.
I also found out that, if nesting must be done, the fields with the most limited value range (e. g. just two values like "1 or 2") should be innermost, and the one with the largest value range should be outermost. This improves performance greatly (but still not enough in my case). Doing it wrong can let you end up with an unusable query (no response within hours, and finally an out of memory on the server side).
I now think that aggregating properly is the key to solve a problem like mine. The approach using a script to have a flat bucket list (as described in my question) is bound to overload the server as it cannot distribute the task in any way. In the case that no double is found at all, it has to hold a bucket for each document in memory (with just one document in it). Even if just a few doubles can be found, this cannot be done for larger data sets. If nothing else is possible, one will need to split the data set into groups artificially. E. g. one can create 16 sub-data sets by building a hash out of the relevant fields and use the last 4 bits to put the document in on of the 16 groups. Each group can then be handled separately; doubles are bound to fall into one group using this technique.
But independently from these general thoughts, the ES API should provide any means to paginate through the result of aggregations. It's a pity that there is no such option (yet).
Your last approach seems to be the best one. And you can update your elasticsearch settings as following:
indices.breaker.request.limit: "75%"
indices.breaker.total.limit: "85%"
I have chosen 75% because the default is 60% and it is 5.9gb in your elasticsearch and your query is becoming ~6.3gb which is around 71.1% based on your log.
circuit_breaking_exception: [request] Data too large, data for [<reused_arrays>] would be larger than limit of [6348236390/5.9gb]
And finally indices.breaker.total.limit must be greater than indices.breaker.fielddata.limit according to elasticsearch document.
An Idea that might work in a Logstash scenario is using copy fields:
Copy all combinations to a separate fields and concat them:
mutate {
add_field => {
"new_field" => "%{oldfield1} %{oldfield2}"
}
}
aggregate over the new field.
Have a look here: https://www.elastic.co/guide/en/logstash/current/plugins-filters-mutate.html
I don't know if add_field supports array (others do if you look at the documentation). If it does not you could try to add several new fields and use merge to have just one field.
If you can do this at index time it would certanly be better.
You only need the combinations (A_B) and not all Permutations (A_B, B_A)

Elastic.co/Elastic search - Relevance feedback with multiple Boosting Queries

I'm trying to implement relevance feedback for Elastic Search (Elastic.co).
I'm aware of boosting queries, which allow for the specification of postiive and negative terms, with the idea being to discount the negative terms, while not excluding them as would be the case in a boolean must_not.
However, I'm trying to achieve tiered boosting, of both positive and negative terms.
That is, I want to take a list of binned positive and negative terms and generate a query such that there are different positive and negative boost tiers, each containing their own query terms.
something like (pseudo query):
query{
{
terms: [very relevant terms]
pos_boost: 3
}
{
terms: [relevant terms]
pos_boost: 2
}
{
terms: [irrelevant terms]
neg_boost: 0.6
}
{
terms: [very irrelevant terms]
neg_boost: 0.3
}
}
My question is whether or not this can be achieved with nested boosting queries, or if I'm better off with multiple should clauses.
My concern is that I'm not sure if a boost of 0.2 in the should clause of a bool query still gives the document a positive increase in the score or not, as I want to discount the document, rather than provide any increase in score.
With boosting queries, the concern is that I can't control the degree to which positive terms are weighted.
Any help, or suggestions for other implementations, would be greatly appreciated. (What I really wanted to do was create a language model for relevant documents and use that to rank, but I don't see how that can easily be achieved in elastic.)
Seems that you can combine bool query and use boosting query clauses tweaking boost values.
POST so/boost/ {"text": "apple computers"}
POST so/boost/ {"text": "apple pie recipe"}
POST so/boost/ {"text": "apple tree garden"}
POST so/boost/ {"text": "apple iphone"}
POST so/boost/ {"text": "apple company"}
GET so/boost/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"text": "apple"
}
}
],
"should": [
{
"match": {
"text": {
"query": "pie",
"boost": 2
}
}
},
{
"match": {
"text": {
"query": "tree",
"boost": 2
}
}
},
{
"match": {
"text": {
"query": "iphone",
"boost": -0.5
}
}
}
]
}
}
}
Alternately, if you want to encode your language model into your collection at index-time, you can try the approach described here: Elasticsearch: Influence scoring with custom score field in document
To boost the elastic search document(priority based search query) based on custom/variable boost value at query time i.e. conditional boosting.
Java Coding example:
customerKeySearch = QueryBuilders.constantScoreQuery(QueryBuilders.termQuery(keys.type", "xxx"));
customerTypeSearch = QueryBuilders.constantScoreQuery(QueryBuilders.termQuery("keys.keyValues.value", "xxxx"));
keyValueQuery = QueryBuilders.boolQuery().must(customerKeySearch).must(customerTypeSearch).boost(2f);
customerKeySearch = QueryBuilders.constantScoreQuery(QueryBuilders.termQuery(keys.type", "xxx"));
customerTypeSearch = QueryBuilders.constantScoreQuery(QueryBuilders.termQuery("keys.keyValues.value", "xxxx"));
keyValueQuery = QueryBuilders.boolQuery().must(customerKeySearch).must(customerTypeSearch).boost(6f);
Description and search query:
elastic search has its internal score calculation technic so we need to disable this mechanism by setting disableCoord(true) property to true in java for BoleanQuery to apply custom boost effect.
Following Boolean query is running query for boosting the documents in elastic search index based on boost value.
{
"bool" : {
"should" : [ {
"bool" : {
"must" : [ {
"constant_score" : {
"query" : {
"term" : {
"keys.type" : "XXX"
}
}
}
}, {
"constant_score" : {
"query" : {
"term" : {
"keys.keyValues.value" : "XXXX"
}
}
}
} ],
"boost" : 2.0
}
}, {
"bool" : {
"must" : [ {
"constant_score" : {
"query" : {
"term" : {
"keys.type" : "XXX"
}
}
}
}, {
"constant_score" : {
"query" : {
"term" : {
"keys.keyValues.value" : "500072388315"
}
}
}
} ],
"boost" : 6.0
}
}, {
"bool" : {
"must" : [ {
"constant_score" : {
"query" : {
"term" : {
"keys.type" : "XXX"
}
}
}
}, {
"constant_score" : {
"query" : {
"term" : {
"keys.keyValues.value" : "XXXXXX"
}
}
}
} ],
"boost" : 10.0
}
} ],
"disable_coord" : true
}
}

Elasticsearch two level sort in aggregation list

Currently I am sorting aggregations by document score, so most relevant items come first in aggregation list like below:
{
'aggs' : {
'guilds' : {
'terms' : {
'field' : 'guilds.title.original',
'order' : [{'max_score' : 'desc'}],
'aggs' : {
'max_score' : {
'script' : 'doc.score'
}
}
}
}
}
}
I want to add another sort option to the order terms order array in my JSON. but when I do that like this :
{
'order' : [{'max_score' : 'desc'}, {"_count" : "desc"},
}
The second sort does not work. For example when all of the scores are equal it then should sort based on query but it does not work.
As a correction to Andrei's answer ... to order aggregations by multiple criteria, you MUST create an array as shown in Terms Aggregation: Order and you MUST be using ElasticSearch 1.5 or later.
So, for Andrei's answer, the correction is:
"order" : [ { "max_score": "desc" }, { "_count": "desc" } ]
As Andrei has it, ES will not complain but it will ONLY use the last item listed in the "order" element.
I don't know how your 'aggs' is even working because I tried it and I had parsing errors in three places: "order" is not allowed to have that array structure, your second "aggs" should be placed outside the first "terms" aggs and, finally, the "max_score" aggs should have had a "max" type of "aggs". In my case, to make it work (and it does actually order properly), it should look like this:
"aggs": {
"guilds": {
"terms": {
"field": "guilds.title.original",
"order": {
"max_score": "desc",
"_count": "desc"
}
},
"aggs": {
"max_score": {
"max": {
"script": "doc.score"
}
}
}
}
}

Query all unique values of a field with Elasticsearch

How do I search for all unique values of a given field with Elasticsearch?
I have such a kind of query like select full_name from authors, so I can display the list to the users on a form.
You could make a terms facet on your 'full_name' field. But in order to do that properly you need to make sure you're not tokenizing it while indexing, otherwise every entry in the facet will be a different term that is part of the field content. You most likely need to configure it as 'not_analyzed' in your mapping. If you are also searching on it and you still want to tokenize it you can just index it in two different ways using multi field.
You also need to take into account that depending on the number of unique terms that are part of the full_name field, this operation can be expensive and require quite some memory.
For Elasticsearch 1.0 and later, you can leverage terms aggregation to do this,
query DSL:
{
"aggs": {
"NAME": {
"terms": {
"field": "",
"size": 10
}
}
}
}
A real example:
{
"aggs": {
"full_name": {
"terms": {
"field": "authors",
"size": 0
}
}
}
}
Then you can get all unique values of authors field.
size=0 means not limit the number of terms(this requires es to be 1.1.0 or later).
Response:
{
...
"aggregations" : {
"full_name" : {
"buckets" : [
{
"key" : "Ken",
"doc_count" : 10
},
{
"key" : "Jim Gray",
"doc_count" : 10
},
]
}
}
}
see Elasticsearch terms aggregations.
Intuition:
In SQL parlance:
Select distinct full_name from authors;
is equivalent to
Select full_name from authors group by full_name;
So, we can use the grouping/aggregate syntax in ElasticSearch to find distinct entries.
Assume the following is the structure stored in elastic search :
[{
"author": "Brian Kernighan"
},
{
"author": "Charles Dickens"
}]
What did not work: Plain aggregation
{
"aggs": {
"full_name": {
"terms": {
"field": "author"
}
}
}
}
I got the following error:
{
"error": {
"root_cause": [
{
"reason": "Fielddata is disabled on text fields by default...",
"type": "illegal_argument_exception"
}
]
}
}
What worked like a charm: Appending .keyword with the field
{
"aggs": {
"full_name": {
"terms": {
"field": "author.keyword"
}
}
}
}
And the sample output could be:
{
"aggregations": {
"full_name": {
"buckets": [
{
"doc_count": 372,
"key": "Charles Dickens"
},
{
"doc_count": 283,
"key": "Brian Kernighan"
}
],
"doc_count": 1000
}
}
}
Bonus tip:
Let us assume the field in question is nested as follows:
[{
"authors": [{
"details": [{
"name": "Brian Kernighan"
}]
}]
},
{
"authors": [{
"details": [{
"name": "Charles Dickens"
}]
}]
}
]
Now the correct query becomes:
{
"aggregations": {
"full_name": {
"aggregations": {
"author_details": {
"terms": {
"field": "authors.details.name"
}
}
},
"nested": {
"path": "authors.details"
}
}
},
"size": 0
}
Working for Elasticsearch 5.2.2
curl -XGET http://localhost:9200/articles/_search?pretty -d '
{
"aggs" : {
"whatever" : {
"terms" : { "field" : "yourfield", "size":10000 }
}
},
"size" : 0
}'
The "size":10000 means get (at most) 10000 unique values. Without this, if you have more than 10 unique values, only 10 values are returned.
The "size":0 means that in result, "hits" will contain no documents. By default, 10 documents are returned, which we don't need.
Reference: bucket terms aggregation
Also note, according to this page, facets have been replaced by aggregations in Elasticsearch 1.0, which are a superset of facets.
The existing answers did not work for me in Elasticsearch 5.X, for the following reasons:
I needed to tokenize my input while indexing.
"size": 0 failed to parse because "[size] must be greater than 0."
"Fielddata is disabled on text fields by default." This means by default you cannot search on the full_name field. However, an unanalyzed keyword field can be used for aggregations.
Solution 1: use the Scroll API. It works by keeping a search context and making multiple requests, each time returning subsequent batches of results. If you are using Python, the elasticsearch module has the scan() helper function to handle scrolling for you and return all results.
Solution 2: use the Search After API. It is similar to Scroll, but provides a live cursor instead of keeping a search context. Thus it is more efficient for real-time requests.

Resources