Faceted Search and Aggregations Do Not Work Properly - search

I have an index for a Store with a type for Items, i.e. /store/items.
among other properties, Items have a Title (analyzed text), a
Description (analyzed text), and Tags (not_analyzed text).
I want to be able to show the facets over Tags with counts, so if a
facet of the Tag "Yellow" has a count of 12, for example, then when the
user adds that Tag to the filter she will see only 12 items.
I am using a Filtered Query with Aggs, as shown below, on Elasticsearch
1.1.1 on a single node:
GET _search {
"query": {
"filtered": {
"query": {
"multi_match": {
"query": "Large Widgets",
"fields": [
"title^3",
"description"
]
}
},
"filter": {
"terms": {
"tags": [
"Colorful"
],
"execution": "and"
}
}
}
},
"aggs": {
"available_tags": {
"terms": {
"field": "tags"
}
},
"size": 20
}
}
I have two problems:
No matter what value I pass for the aggs/size I get 10 aggregations.
I want to get more than 10.
The hits count that comes back when adding the new tag to the filter
doesn't match the doc_count that came with the aggregations, for
example, the aggregations might show a doc_count of 12 for the tag
"Yellow", but if I add "Yellow" to the filter terms so that it reads
"tags": [ "Colorful", "Yellow" ]I get 17 hits instead of the expected 12.
This usually does not happen at the first level, but only in subsequent drill down.
am I doing something wrong? is there a bug somewhere?
this is a cross post from the Elasticsearch mailing list which didn't get enough attention

Shard_size cannot be smaller than the size, so you may have a larger size than shard_size. In which case Elasticsearch will override it to be equal to shard_size.
Does the filtered results for "Colorful" "and" "Yellow" equal 17 documents instead of the 12 "Yellow" documents?

Related

Moving specific Index Data into a new Index within Elasticsearch

I have several million docs, that I need to move into a new index, but there is a condition on which docs should flow into the index. Say I have a field named, offsets, that needs to be queried against. The values I need to query for are: [1,7,99,32, ....., 10000432] (very large list) in the offset field..
Does anyone have thoughts on how I can move the specific docs, with those values in the list into a new elasticsearch index.? My first though was reindexing with a query, but there is no pattern for the offsets list..
Would it be a python loop appending each doc to a new index? Looking for any guidance.
Thanks
Are the documents really large, or can you add them into an jsonl file for bulk ingestion?
In what form is the selector list, the one shown as "[1,7,99,32, ....., 10000432]"?
I'd do it in Pandas, but here is an idea in ES parlance.
Whatever you do, do use the _bulk API, or the job will never finish.
You can run a query based upon as file as per
GET my_index/_search?_file="myquery_file"
You can put all the ids into a file, myquery_file, as below:
{
"query": {
"ids" : {
"values" : ["1", "4", "100"]
}
},
"format": "jsonl"
}
and output as jsonl to ingest.
You can do the above for the reindex API.
{
"source": {
"index": "source",
**"query": {
"match": {
"company": "cat"
}
}**
},
"dest": {
"index": "dest",
"routing": "=cat"
}
}
Unfortunately,
I was facing a time crunch, and had to throw in a personalized loop to query a very specific subset of indices..
df = pd.read_csv('C://code//part_1_final.csv')
offsets = df['OFFSET'].tolist()
# Offsets are the "unique" values I need to identify the docs by.. There is no pattern in these values, thus I must go one by one..
missedDocs = []
for i in offsets:
print(i)
try:
client.reindex({
"source": {
"index": "<source_index>,
"query": {
"bool": {
"must": [
{ "match" : {"<index_filed_1>": "1" }},
{ "match" : {"<index_with_that_needs_values_to_match": i }}
]
}
}
},
"dest": {
"index": "<dest_index>"
}
})
except KeyError:
print('error')
#missedDocs.append(query)
print('DOC ERROR')

How to include two analyzers into a single SEARCH statement?

I have a feeds collection with documents like this:
{
"created": 1510000000,
"find": [
"title of the document",
"body of the document"
],
"filter": [
"/example.com",
"-en"
]
}
created contains an epoch timestamp
find contains an array of fulltext snippets, e.g. the title and the body of a text
filter is an array with further search tokens, such as hashtags, domains, locales
Problem is that find contains fulltext snippets, which we want to tokenize, e.g. with a text analyzer, but filter contains final tokens which we want to compare as a whole, e.g. with the identity analyzer.
Goal is to combine find and filter into a single custom analyzer or to combine two analyzers using two SEARCH statements or something to that end.
I did manage to query by either find or by filter successfully, but do not manage to query by both. This is how I query by filter:
I created a feeds_search view:
{
"writebufferIdle": 64,
"type": "arangosearch",
"links": {
"feeds": {
"analyzers": [
"identity"
],
"fields": {
"find": {},
"filter": {},
"created": {}
},
"includeAllFields": false,
"storeValues": "none",
"trackListPositions": false
}
},
"consolidationIntervalMsec": 10000,
"writebufferActive": 0,
"primarySort": [],
"writebufferSizeMax": 33554432,
"consolidationPolicy": {
"type": "tier",
"segmentsBytesFloor": 2097152,
"segmentsBytesMax": 5368709120,
"segmentsMax": 10,
"segmentsMin": 1,
"minScore": 0
},
"cleanupIntervalStep": 2,
"commitIntervalMsec": 1000,
"id": "362444",
"globallyUniqueId": "hD6FBD6EE239C/362444"
}
and I created a sample query:
FOR feed IN feeds_search
SEARCH ANALYZER(feed.created < 9990000000 AND feed.created > 1500000000
AND (feed.find == "title of the document")
AND (feed.`filter` == "/example.com" OR feed.`filter` == "-uk"), "identity")
SORT feed.created
LIMIT 20
RETURN feed
The sample query works, because find contains the full text (identity analyzer). As soon as I switch to a text analyzer, single word tokens work for find, but filter no longer works.
I tried using a combination of SEARCH and FILTER, which gives me the desired result, but I assume it probably performs worse than having the SEARCH analyzer do the whole thing. I see that analyzers is an array in the view syntax, but I seem not to be able to set individual fields for each analyzer.
The analyzers can be added as a property to each field in fields. What is specified in analyzers is the default and is used in case a more specific analyzer is not set for a given field.
"analyzers": [
"identity"
],
"fields": {
"find": {
"analyzers": [
"text_en"
]
},
"filter": {},
"created": {}
},
Credits: Simran at ArangoDB

How to search through data with arbitrary amount of fields?

I have the web-form builder for science events. The event moderator creates registration form with arbitrary amount of boolean, integer, enum and text fields.
Created form is used for:
register a new member to event;
search through registered members.
What is the best search tool for second task (to search memebers of event)? Is ElasticSearch well for this task?
I wrote a post about how to index arbitrary data into Elasticsearch and then to search it by specific fields and values. All this, without blowing up your index mapping.
The post is here: http://smnh.me/indexing-and-searching-arbitrary-json-data-using-elasticsearch/
In short, you will need to do the following steps to get what you want:
Create a special index described in the post.
Flatten the data you want to index using the flattenData function:
https://gist.github.com/smnh/30f96028511e1440b7b02ea559858af4.
Create a document with the original and flattened data and index it into Elasticsearch:
{
"data": { ... },
"flatData": [ ... ]
}
Optional: use Elasticsearch aggregations to find which fields and types have been indexed.
Execute queries on the flatData object to find what you need.
Example
Basing on your original question, let's assume that the first event moderator created a form with following fields to register members for the science event:
name string
age long
sex long - 0 for male, 1 for female
In addition to this data, the related event probably has some sort of id, let's call it eventId. So the final document could look like this:
{
"eventId": "2T73ZT1R463DJNWE36IA8FEN",
"name": "Bob",
"age": 22,
"sex": 0
}
Now, before we index this document, we will flatten it using the flattenData function:
flattenData(document);
This will produce the following array:
[
{
"key": "eventId",
"type": "string",
"key_type": "eventId.string",
"value_string": "2T73ZT1R463DJNWE36IA8FEN"
},
{
"key": "name",
"type": "string",
"key_type": "name.string",
"value_string": "Bob"
},
{
"key": "age",
"type": "long",
"key_type": "age.long",
"value_long": 22
},
{
"key": "sex",
"type": "long",
"key_type": "sex.long",
"value_long": 0
}
]
Then we will wrap this data in a document as I've showed before and index it.
Then, the second event moderator, creates another form having a new field, field with same name and type, and also a field with same name but with different type:
name string
city string
sex string - "male" or "female"
This event moderator decided that instead of having 0 and 1 for male and female, his form will allow choosing between two strings - "male" and "female".
Let's try to flatten the data submitted by this form:
flattenData({
"eventId": "F1BU9GGK5IX3ZWOLGCE3I5ML",
"name": "Alice",
"city": "New York",
"sex": "female"
});
This will produce the following data:
[
{
"key": "eventId",
"type": "string",
"key_type": "eventId.string",
"value_string": "F1BU9GGK5IX3ZWOLGCE3I5ML"
},
{
"key": "name",
"type": "string",
"key_type": "name.string",
"value_string": "Alice"
},
{
"key": "city",
"type": "string",
"key_type": "city.string",
"value_string": "New York"
},
{
"key": "sex",
"type": "string",
"key_type": "sex.string",
"value_string": "female"
}
]
Then, after wrapping the flattened data in a document and indexing it into Elasticsearch we can execute complicated queries.
For example, to find members named "Bob" registered for the event with ID 2T73ZT1R463DJNWE36IA8FEN we can execute the following query:
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "flatData",
"query": {
"bool": {
"must": [
{"term": {"flatData.key": "eventId"}},
{"match": {"flatData.value_string.keyword": "2T73ZT1R463DJNWE36IA8FEN"}}
]
}
}
}
},
{
"nested": {
"path": "flatData",
"query": {
"bool": {
"must": [
{"term": {"flatData.key": "name"}},
{"match": {"flatData.value_string": "bob"}}
]
}
}
}
}
]
}
}
}
ElasticSearch automatically detects the field content in order to index it correctly, even if the mapping hasn't been defined previously. So, yes : ElasticSearch suits well these cases.
However, you may want to fine tune this behavior, or maybe the default mapping applied by ElasticSearch doesn't correspond to what you need : in this case, take a look at the default mapping or, for even further control, the dynamic templates feature.
If you let your end users decide the keys you store things in, you'll have an ever-growing mapping and cluster state, which is problematic.
This case and a suggested solution is covered in this article on common problems with Elasticsearch.
Essentially, you want to have everything that can possibly be user-defined as a value. Using nested documents, you can have a key-field and differently mapped value fields to achieve pretty much the same.

Elastic Search filtering in facets

I want to simulate a parent child relation in elastic search and perform some analytics work over it. My use case is something like this
I have a shop owner like this
"_source": {
"shopId": 5,
"distributorId": 4,
"stateId": 1,
"partnerId": 2,
}
and now have child records (for each day) like this:
"_source": {
"shopId": 5,
"date" : 2013-11-13,
"transactions": 150,
"amount": 1980,
}
The parent is a record per store, while the child is the transactions each store does for
day. Now I want to do some complex query like
Find out total transaction for each day for the last 30 days where distributor is 5
POST /newdb/shopsDaily/_search
{
"query": {
"match_all": {}
},
"filter": {
"has_parent": {
"type": "shop",
"query": {
"match": {
"distributorId": "5"
}
}
}
},
"facets": {
"date": {
"histogram": {
"key_field": "date",
"value_field": "transactions",
"interval": 100
}
}
}
}
But the result I get do not take the filtering into account which I applied.
So I changed the query to this:
POST /newdb/shopDaily/_search
{
"query": {"filtered": {
"query": {"match_all": {}},
"filter": { "has_parent": {
"type": "shop",
"query": {"match": {
"distributorId": "13"
}}
}}
}},
"facets": {
"date": {
"histogram": {
"key_field": "date",
"value_field": "transactions",
"interval": 100
}
}
}
}
And then the final histogram facet took filtering into count.
So, when I browsed though I found out this is due to using filtered(which can only be used inside query clause and not outside like filter) rather than filter,
but it also mentioned that to have fast search you should use filter. Will searching as I did in second step (when I used filtered instead of filter) effect the performance of elastic search? If so, how can I make my facets honor filters and not effect the performance?
Thanks for you time
filters in Filtered query (filters in query clause) are cached, hence faster. These type of filters affect both search result and facet counts.
Filters outside the query clause are not considered during facet calculations. They are considered only for search results. Facet is calculated only on the query clause. If you want filtered facets then you need to set filters to each of the facet clauses.

Query all unique values of a field with Elasticsearch

How do I search for all unique values of a given field with Elasticsearch?
I have such a kind of query like select full_name from authors, so I can display the list to the users on a form.
You could make a terms facet on your 'full_name' field. But in order to do that properly you need to make sure you're not tokenizing it while indexing, otherwise every entry in the facet will be a different term that is part of the field content. You most likely need to configure it as 'not_analyzed' in your mapping. If you are also searching on it and you still want to tokenize it you can just index it in two different ways using multi field.
You also need to take into account that depending on the number of unique terms that are part of the full_name field, this operation can be expensive and require quite some memory.
For Elasticsearch 1.0 and later, you can leverage terms aggregation to do this,
query DSL:
{
"aggs": {
"NAME": {
"terms": {
"field": "",
"size": 10
}
}
}
}
A real example:
{
"aggs": {
"full_name": {
"terms": {
"field": "authors",
"size": 0
}
}
}
}
Then you can get all unique values of authors field.
size=0 means not limit the number of terms(this requires es to be 1.1.0 or later).
Response:
{
...
"aggregations" : {
"full_name" : {
"buckets" : [
{
"key" : "Ken",
"doc_count" : 10
},
{
"key" : "Jim Gray",
"doc_count" : 10
},
]
}
}
}
see Elasticsearch terms aggregations.
Intuition:
In SQL parlance:
Select distinct full_name from authors;
is equivalent to
Select full_name from authors group by full_name;
So, we can use the grouping/aggregate syntax in ElasticSearch to find distinct entries.
Assume the following is the structure stored in elastic search :
[{
"author": "Brian Kernighan"
},
{
"author": "Charles Dickens"
}]
What did not work: Plain aggregation
{
"aggs": {
"full_name": {
"terms": {
"field": "author"
}
}
}
}
I got the following error:
{
"error": {
"root_cause": [
{
"reason": "Fielddata is disabled on text fields by default...",
"type": "illegal_argument_exception"
}
]
}
}
What worked like a charm: Appending .keyword with the field
{
"aggs": {
"full_name": {
"terms": {
"field": "author.keyword"
}
}
}
}
And the sample output could be:
{
"aggregations": {
"full_name": {
"buckets": [
{
"doc_count": 372,
"key": "Charles Dickens"
},
{
"doc_count": 283,
"key": "Brian Kernighan"
}
],
"doc_count": 1000
}
}
}
Bonus tip:
Let us assume the field in question is nested as follows:
[{
"authors": [{
"details": [{
"name": "Brian Kernighan"
}]
}]
},
{
"authors": [{
"details": [{
"name": "Charles Dickens"
}]
}]
}
]
Now the correct query becomes:
{
"aggregations": {
"full_name": {
"aggregations": {
"author_details": {
"terms": {
"field": "authors.details.name"
}
}
},
"nested": {
"path": "authors.details"
}
}
},
"size": 0
}
Working for Elasticsearch 5.2.2
curl -XGET http://localhost:9200/articles/_search?pretty -d '
{
"aggs" : {
"whatever" : {
"terms" : { "field" : "yourfield", "size":10000 }
}
},
"size" : 0
}'
The "size":10000 means get (at most) 10000 unique values. Without this, if you have more than 10 unique values, only 10 values are returned.
The "size":0 means that in result, "hits" will contain no documents. By default, 10 documents are returned, which we don't need.
Reference: bucket terms aggregation
Also note, according to this page, facets have been replaced by aggregations in Elasticsearch 1.0, which are a superset of facets.
The existing answers did not work for me in Elasticsearch 5.X, for the following reasons:
I needed to tokenize my input while indexing.
"size": 0 failed to parse because "[size] must be greater than 0."
"Fielddata is disabled on text fields by default." This means by default you cannot search on the full_name field. However, an unanalyzed keyword field can be used for aggregations.
Solution 1: use the Scroll API. It works by keeping a search context and making multiple requests, each time returning subsequent batches of results. If you are using Python, the elasticsearch module has the scan() helper function to handle scrolling for you and return all results.
Solution 2: use the Search After API. It is similar to Scroll, but provides a live cursor instead of keeping a search context. Thus it is more efficient for real-time requests.

Resources