I have an Elasticsearch instance with about 3.5Million records and I want a fast way to return (n) records per property value in any given search.
Example
Document 1:
{
"id": 1,
"gender": "male",
"name": "Joe"
}
Document 2:
{
"id": 2,
"gender": "male",
"name": "John"
}
Document 3:
{
"id": 3,
"gender": "female",
"name": "Jill"
}
Document 4:
{
"id": 4,
"gender": "female",
"name": "Joan"
}
Assuming a match_all search
I only want to return 1 document for each value of the gender property:
For instance, return only doc 1 and doc 3
This would obviously be spread across a much larger result set, but the result should still scale to (n) docs per unique property value.
Any help with this is much appreciated.
E
Use an aggregation for "gender" field together with a "top_hits" aggregation to return n hits per each "gender" value:
{
"size": 0,
"query": {
"filtered": {
"filter": {
"bool": {
"should": [{}]
}
}
}
},
"aggs": {
"by_gender": {
"terms": {
"field": "gender"
},
"aggs": {
"first_hit": {
"top_hits": {"size":1}
}
}
}
}
}
Related
I have some documents resembling the following structure:
{
"name": "item_1",
"category": ["a", "b"]
},
{
"name": "item_2",
"category": ["c"]
},
{
"name": "item_3",
"category": ["a", "c"]
},
{
"name": "item_4",
"category": ["a"]
},
{
"name": "item_5",
"category": ["a"]
}
I'm trying to get a sorted list of the most used values for the category field in all documents within the collection.
So in this example, the return value I'm expecting should be something like this:
[
{
"category": "a",
"count": 4
},
{
"category": "c",
"count": 2
},
{
"category": "b",
"count": 1
}
]
Is there a way to make such a query in mongoose?
Demo - https://mongoplayground.net/p/sBpwwvowXLH
Use aggregation query to $unwind your category into separate documents $group them back by category and get the count
$sum
db.collection.aggregate({
"$unwind": "$category"
},
{
"$group": {
"_id": "$category",
count: { $sum: 1 }
}
})
I have a huge collection of documents in elastic search and i want to group the documents and add the values for the same.
Sample document:
[
{
"_id": "123",
"meter_id": "1001",
"voltage": "{
"voltage": 50
}",
"date": 2020-05-09T06:03:56Z
}
{
"_id": "1234",
"meter_id": "1002",
"voltage": "{
"voltage": 40
}",
"date": 2020-04-10T06:03:56Z
}
]
Now i want to match this collection specific date range. For example dates between 2020-04-10 to 2020-05-09 and the documents matching this criteria should be grouped into a single document with common meter_id 1001 and average voltage of all documents.
POST _bulk
{"index":{"_index":"voltage","_type":"_doc"}}
{"meter_id":"1001","voltage":{"voltage":50},"date":"2020-04-09T06:03:56Z"}
{"index":{"_index":"voltage","_type":"_doc"}}
{"meter_id":"1001","voltage":{"voltage":60},"date":"2020-05-08T08:03:56Z"}
{"index":{"_index":"voltage","_type":"_doc"}}
{"meter_id":"1001","voltage":{"voltage":60},"date":"2020-05-01T08:03:56Z"}
GET voltage/_search
{
"size": 0,
"query": {
"range": {
"date": {
"gte": "2020-04-10",
"lte": "2020-05-09",
"format": "yyyy-MM-dd"
}
}
},
"aggs": {
"by_meter_id": {
"terms": {
"field": "meter_id.keyword"
},
"aggs": {
"avg_voltage": {
"avg": {
"field": "voltage.voltage"
}
}
}
}
}
}
I'll explain: I have this function
function (doc) {
if(doc.MovieId == "1721")
emit(doc.Rating, 1);
}
but it return me some document that are not relevant (for example they haven't the Rating field). My document _id is composed of partitionName:id, so I thought to do if(doc.MovieId == "1721" && doc._id.contains("ratings"){...} but it doesn't work.
Is there a way to do this?
-----EDIT 1-----
The docs in the circle are not relevant.
Do you need the schema of the JSON document?
-----EDIT 2-----
the following documents are NOT RELEVANT
1.
{
"_id": "movies : 1721",
"_rev": "1-d7e0e3c8152d6978073d280e0aef7457",
"MovieId": "1721",
"Title": "Titanic (1997)",
"Genres": [
"Drama",
"Romance"
]
}
2.
{
"_id": "tags : 1490",
"_rev": "1-14c20c9cfb3ee1964a298777f80333d5",
"MovieId": "1721",
"UserId": "474",
"Tag": "shipwreck",
"Timestamp": "1138031879"
}
3.
{
"_id": "tags : 2791",
"_rev": "1-e4d6c9573fcdae726a69d5fc6255de27",
"MovieId": "1721",
"UserId": "537",
"Tag": "romance",
"Timestamp": "1424141922"
}
documets like this are RELEVANT:
{
"_id": "ratings : 31662",
"_rev": "1-446665286337faaf51e23e40b527ec2d",
"MovieId": "1721",
"UserId": "219",
"Rating": "0.5",
"Timestamp": "1214043346"
}
Following view should just emit documents whose _id starts with "ratings :":
function (doc) {
var id_prefix = "ratings :";
if(doc._id.substr(0, id_prefix.length) === id_prefix && doc.MovieId == "1721")
emit(doc.Rating, 1);
}
I am working on Mongodb distinct query, i have one collection with repeated entry, i am doing as per the created_at. But i want to fetch without repeated values.
Sample JSON
{
"posts": [{
"id": "580a2eb915a0161010c2a562",
"name": "\"Ah Me Joy\" Porter",
"created_at": "15-10-2016"
}, {
"id": "580a2eb915a0161010c2a562",
"name": "\"Ah Me Joy\" Porter",
"created_at": "25-10-2016"
}, {
"id": "580a2eb915a0161010c2a562",
"name": "\"Ah Me Joy\" Porter",
"created_at": "01-10-2016"
}, {
"id": "580a2eb915a0161010c2bf572",
"name": "Hello All",
"created_at": "05-10-2016"
}]
}
Mongodb Query
db.getCollection('posts').find({"id" : ObjectId("580a2eb915a0161010c2a562")})
So i want to know about distinct query of mongodb, please kindly go through my post and let me know.
try as follows:
db.getCollection('posts').distinct("id")
It will return all the unique IDs in the collection posts as follows:
["580a2eb915a0161010c2a562", "580a2eb915a0161010c2bf572"]
From MongoDB docs:
The example use the inventory collection that contains the following documents:
{ "_id": 1, "dept": "A", "item": { "sku": "111", "color": "red" }, "sizes": [ "S", "M" ] }
{ "_id": 2, "dept": "A", "item": { "sku": "111", "color": "blue" }, "sizes": [ "M", "L" ] }
{ "_id": 3, "dept": "B", "item": { "sku": "222", "color": "blue" }, "sizes": "S" }
{ "_id": 4, "dept": "A", "item": { "sku": "333", "color": "black" }, "sizes": [ "S" ] }
To Return Distinct Values for a Field (dept):
db.inventory.distinct( "dept" )
The method returns the following array of distinct dept values:
[ "A", "B" ]
Reference:
https://docs.mongodb.com/v3.2/reference/method/db.collection.distinct/
As per my understanding, you want to get distinct results which should eliminates the duplicate id in that collection
By using distinct in mongodb, It will return list of distinct values
db.getCollection('posts').distinct("id");
["580a2eb915a0161010c2a562", "580a2eb915a0161010c2bf572"]
So you should look into mongodb aggregation
db.posts.aggregate(
{ "$group" : { "_id" : "$id", "name" : {"$first" : "$name"}, "created_at" : {"$first" : "$created_at"} }}
)
The output will be list of results which eliminates the duplicate id documents
Background: I've implemented a partial search on a name field by indexing the tokenized name (name field) as well as a trigram analyzed name (ngram field).
I've boosted the name field to have exact token matches bubble up to the top of the results.
Problem: I am trying to implement a query that limits the nGram matches to ones that only match some threshold (say 80%) of the query string. I understand that minimum_should_match seems to be what I am looking for, but my problem is forming the query to actually produce those results.
My exact token matches are boosted to the top but I still get every document that has a single matching trigram in the ngram field.
GIST: Index settings and mapping
Index Settings
{
"my_index": {
"settings": {
"index": {
"number_of_shards": "5",
"max_result_window": "30000",
"creation_date": "1475853851937",
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": "3",
"max_gram": "3"
}
},
"analyzer": {
"ngram_analyzer": {
"filter": [
"lowercase",
"ngram_filter"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "AuCjcP5sSb-m59bYrprFcw",
"version": {
"created": "2030599"
}
}
}
}
}
Index Mappings
{
"my_index": {
"mappings": {
"my_type": {
"properties": {
"acw": {
"type": "integer"
},
"pcg": {
"type": "integer"
},
"date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"dob": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"id": {
"type": "string"
},
"name": {
"type": "string",
"boost": 10
},
"ngram": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"bdk": {
"type": "integer"
},
"mmw": {
"type": "integer"
},
"mpi": {
"type": "integer"
},
"sex": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Solution Attempts
[GIST: Query Attempts] unlinkifying due to 2 link limit :(
(https://gist.github.com/jordancardwell/2e690013666e7e1da6ef1acee314b4e6)
I tried a multi-match query, which gives me correct search results, but I haven't had luck omitting results for names that only match a single trigram (say "odo" trigram inside "theodophilus")
//this matches 'frodo' and sends results to the top, since `name` field is boosted
// but also matches 'theodore' and 'rodolpho'
{
"size":100,
"from":0,
"query":{
"multi_match":{
"query":"frodo",
"fields":[
"name",
"ngram"
],
"type":"best_fields"
}
}
}
.
//I then tried to throw in the `minimum_must_match` option
// hoping it would filter out large strings that only had one matching trigram for instance
{
"size":100,
"from":0,
"query":{
"multi_match":{
"query":"frodo",
"fields":[
"name",
"ngram"
],
"type":"best_fields",
"minimum_should_match": "90%",
}
}
}
I've tried playing around in sense, to manually produce the match queries that this produces to allow me to only apply minimum_must_match to the ngram field but can't seem to get the syntax right.
// I then tried to contruct a custom query to just return the `minimum_should_match`d results on the ngram field
// I started with a query produced by using bodybuilder to `and` and `or` my other search criteria together
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
//each separate field's criteria `must`/`and`ed together
{
"query": {
"bool": {
"filter": {
"bool": {
"should": [
//each critereon for a specific field `should`/`or`ed together
{
//my attempt at getting `ngram` field results..
// should theoretically only return when field
// contains nothing but matching ngrams
// (i.e. exact matches and other fluke matches)
"query": {
"match": {
"ngram": {
"query": "frodo",
"minimum_should_match": "100%"
}
}
}
}
//... other critereon to be `should`/`or`ed together
]
}
}
}
}
}
//... other criteria to be `must`/`and`ed together
]
}
}
}
}
}
Can anyone see what I'm doing wrong?
It seems like this should be fairly straightforward to accomplish, but I must be missing something obvious.
UPDATE
I ran a query with _explain=true (using sense UI) to try to understand my results.
I queried for a match on the ngram field for "frod" with minimum_should_match = 100%, yet I still get every record that matches at least one ngram.
(e.g. rodolpho even though it doesn't contain fro)
GIST: test query and results
note: cross-posted from [discuss.elastic.co]
will make a link later, can't post more than 2 yet : /
(https://discuss.elastic.co/t/ngram-partial-match-limiting-ngram-results-in-multiple-field-query/62526)
I used your settings and mappings to create an index. And you queries seem to be working fine for me. I would suggest doing an explain on one of the "unexpected" documents which is being returned and see why it is being matched and returned with other results.
Here is what I did:
Run the analyze api on your analyzer to see how the query will be split into tokens.
curl -XGET 'localhost:9200/my_index/_analyze' -d '
{
"analyzer" : "ngram_analyzer",
"text" : "frodo"
}'
frodo will be split into 3 tokens with your analyzer.
{
"tokens": [
{
"token": "fro",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "rod",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "odo",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}
I indexed 3 documents for testing (only used ngrams field) . Here are the docs:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 1,
"_source": {
"ngram": "theodore"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"ngram": "frodo"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "3",
"_score": 1,
"_source": {
"ngram": "rudolpho"
}
}
]
}
}
The first query you mentioned, it matches frodo and theodore, but not rudolpho like you mentioned - which makes sense, since rudolpho does not produce any trigrams which match trigrams from frodo
frodo -> fro, rod, odo
rudolpho -> rud, udo, dol, olp, lph, pho
Using your second query, I get back only frodo (None of the other two) .
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.53148466,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 0.53148466,
"_source": {
"ngram": "frodo"
}
}
]
}
}
I then ran an explain (localhost:9200/my_index/my_type/2/_explain) on other two docs (theodore and rudolpho) and I see this (I have clipped the response)
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"matched": false,
"explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0,
"description": "no match on required clause ((ngram:fro ngram:rod ngram:odo)~2)",
"details": [
The above is expected since atleast two out of three tokens from frodo should match.