dedup in elasticsearch for duplicate documents

dedup in elasticsearch for duplicate documents - search

Find duplicate documents and remove them using dedup option.
how to achieve this during index creation time?
we have tired this option but doing during query time below is the sample example
GET /employeeid/info/_search? "size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "name",
"min_doc_count": 2
},
"aggs": {
"duplicateDocuments": {
"top_hits": {}
}
}
}
}
}
can we achieve the same in indexing time ?

Related

How to perform sub aggregation that will calculate fields with no value per bucket?

Currently building the following Elasticsearch 6.8 query\aggregation:
{
"sort": [
{
"DateCreated": {
"order": "desc"
}
}
],
"query": {
"bool": {
"must": [
{
"match": {
"InternalEntityId": "ExampleValue1111"
}
},
{
"match": {
"Direction": "Inbound"
}
}
]
}
},
"aggs": {
"top_ext": {
"terms": {
"field": "ExternalAddress.keyword"
},
"aggs": {
"top_date": {
"top_hits": {
"sort": [
{
"DateCreated": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}
How do we perform (in the same search):
Count the sum of (hits per bucket) that have no value (must_not exists style query) PER bucket
Ideally, with the return of the top_ext agg return.. each bucket would have a count of the records that have no value.
Thanks!

Now you can do two things here,
1. Either sort the "top_ext" terms agg bucket by asc order of doc count and you can use the top n zero size buckets here
2. You can apply a bucket selector aggregation in parallel to you inner hits so that only those inner hits will appear that have zero docCounts.
Here is a query dsl that uses both the above approaches.(You can plug in all other required elements of the query, I have focused mainly on the aggregation part here)
GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"outer": {
"terms": {
"field": "products.category.keyword",
"size": 10,
"order": {
"_count": "asc"
}
},
"aggs": {
"inner": {
"top_hits": {
"size": 10
}
},
"restrictedBuckets": {
"bucket_selector": {
"buckets_path": {
"docCount": "_count"
},
"script": "params.docCount<1"
}
}
}
}
}
}

Need pagination on Aggreration Groupping Elastic search

We have applying aggregation and grouping, Need pagination for this.
let body = {
size: item_per_page,
"query": {
"bool": {
"must": [{
"terms": {
"log_action_master_id": action_type
}
}, {
"match": {
[search_by]: searchParams.user_id
}
}, {
"match": {
"unit_id": searchParams.unit_id
}
},
{
"range": {
[search_date]: {
gte: from,
lte: to
}
}
}
]
}
},
"aggs": {
"group": {
"terms": {
"field": "id",
"size": item_per_page,
"order": { "_key": sortdirction }
},
},
"types_count": {
"value_count": {
"field": "id.keyword"
}
},
},
};

You can use below options:-
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Include partition: group's the field’s values into a number of partitions at query-time and processing only one partition in each request. Term fields are evenly distributed in different partitions. So you must know number of terms beforehand. You can use cardinality aggregation to get count
GET index22/_search
{
"size": 0,
"aggs": {
"TradeRef": {
"terms": {
"field": "id.keyword",
"include": {
"partition": 0,
"num_partitions": 3
}
}
}
}
}
Bucket Sort aggregation : sorts the buckets of its parents multi bucket aggreation. Each bucket may be sorted based on its _key, _count or its sub-aggregations. It only applies to buckets returned from parent aggregation. You will need to set term size to 10,000(max value) and truncate buckets in bucket_sort. You can paginate using from and size just like in query. If you have terms more that 10,000 you won't be able to use it since it only selects from buckets returned by term.
GET index22/_search
{
"size": 0,
"aggs": {
"valueCount":{
"value_count": {
"field": "TradeRef.keyword"
}
},
"TradeRef": {
"terms": {
"field": "TradeRef.keyword",
"size": 10000
},
"aggs": {
"my_bucket": {
"bucket_sort": {
"sort": [
{
"_key": {
"order": "asc"
}
}
],
"from": 2,
"size": 1
}
}
}
}
}
}
In terms of performance composite aggregation is a better choice

Is there a Group BY function for finding result with elastic search query?

I have tried to integrate group by with elastic search. But I didn't get the answer properly. Please support me to fix this issue. Indexed data is,
data = [
{ "fruit":"apple", "taste":5, "timestamp":100},
{ "fruit":"pear", "taste":5, "timestamp":110},
{ "fruit":"apple", "taste":4, "timestamp":200},
{ "fruit":"pear", "taste":8, "timestamp":90},
{ "fruit":"banana", "taste":5, "timestamp":100}]`
My query is,
`myQuery = {"query": {
"match_all": {}
},
"aggs": {
"group_by_fruit": {
"terms": {
"field": "fruit.keyword"
},
}
}
}
It showing all 5 data in the output. Actually I nee d to get only 3 records. The expected result is,
[
{ "fruit":"apple", "taste":4, "timestamp":200},
{ "fruit":"pear", "taste":8, "timestamp":90},
{ "fruit":"banana", "taste":5, "timestamp":100}]

If you want to get the documents with distinct fruit fields having the largest timestamp value you should use a top_hits aggregation.
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"top_tags": {
"terms": {
"field": "fruit.keyword",
"size": <MAX_NUMBER_OF_DISTINCT_FRUITS>
},
"aggs": {
"group_by_fruit": {
"top_hits": {
"sort": [
{
"timestamp": {
"order": "desc"
}
}
],
"size" : 1
}
}
}
}
}
}

Elasticsearch : Alternate approach to get frequency count without using aggregation

We are trying to get the frequency count for search terms using aggregation. Since there are three keys for which we need to get the frequency count facing performance degrade with respect to search. How to get frequency count without aggregation? Please suggest some alternate approach.
Query:
{
"aggs": {
"name_exct": {
"filter": {
"term": {
"name_exct": "test"
}
},
"aggs": {
"name_exct_count": {
"terms": {
"field": "name_exct"
}
}
}
},
"CITY": {
"filter": {
"term": {
"CITY": "US"
},
"aggs": {
"CITY_count": {
"terms": {
"field": "CITY"
}
}
}
}
}
}

Division of two fields in Elasticsearch

Currently i am trying to group a field based on one field and than getting sum of other fields with respect to the respective field used for grouping. I want to get a new value which needs to be division of the summed field . I will provide the current query i have :
In my query i am aggregating them based on the field ("a_name") and summing "spend" and "gain". I want to get a new field which would be ratio of sum (spend/gain)
I tried adding script but i am getting NaN , also to enable this; i had to enable them first in elasticsearch.yml file
script.engine.groovy.inline.aggs: on
Query
GET /index1/table1/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"query_string": {
"query": "*",
"analyze_wildcard": true
}
},
"filter": {
"bool": {
"must": [
{
"term": {
"account_id": 29
}
}
],
"must_not": []
}
}
}
},
"aggs": {
"custom_name": {
"terms": {
"field": "a_name"
},
"aggs": {
"spe": {
"sum": {
"field": "spend"
}
},
"gained": {
"sum": {
"field": "gain"
}
},
"rati": {
"sum": {
"script": "doc['spend'].value/doc['gain'].value"
}
}
}
}
}
}
This particular query is showing me a 'NaN' in output. If I replace the division to multiplication the query works.
Essentially what i am looking for is to divide my two aggregators "spe" and "gained"
Thanks!

It might be possible that doc.gain is 0 in some of your documents. You may try changing the script to this instead:
"script": "doc['gain'].value != 0 ? doc['spend'].value / doc['gain'].value : 0"
UPDATE
If you want to compute the ratio of the result of two other metric aggregations, you can do so using a bucket_script aggregation (only available in ES 2.0, though).
{
...
"aggs": {
"custom_name": {
"terms": {
"field": "a_name"
},
"aggs": {
"spe": {
"sum": {
"field": "spend"
}
},
"gained": {
"sum": {
"field": "gain"
}
},
"bucket_script": {
"buckets_paths": {
"totalSpent": "spe",
"totalGained": "gained"
},
"script": "totalSpent / totalGained"
}
}
}
}
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

dedup in elasticsearch for duplicate documents - search

Related

How to perform sub aggregation that will calculate fields with no value per bucket?

Need pagination on Aggreration Groupping Elastic search

Is there a Group BY function for finding result with elastic search query?

Elasticsearch : Alternate approach to get frequency count without using aggregation

Division of two fields in Elasticsearch

Categories

Resources