I have a set of 2.8 million docs with sets of tags that I'm querying with ElasticSearch, but many of these docs can be grouped together by one ID. I want to query my data using the tags, and then aggregate them by the ID that repeats. Often my search results have tens of thousands of documents, but I only want to aggregate the top 100 results of the search. How can I constrain an aggregation to only the top 100 results from a query?
Sampler Aggregation :
A filtering aggregation used to limit any sub aggregations' processing
to a sample of the top-scoring documents.
"aggs": {
"bestDocs": {
"sampler": {
// "field": "<FIELD>", <-- optional, Controls diversity using a field
"shard_size":100
},
"aggs": {
"bestBuckets": {
"terms": {
"field": "id"
}
}
}
}
}
This query will limit the sub aggregation to top 100 docs from the result and then bucket them by ID.
Optionally, you can use the field or script and max_docs_per_value settings to control the maximum number of documents collected on any one shard which share a common value.
The size parameter can be set to define how many term buckets should be returned out of the overall terms list.
By default, the node coordinating the search process will request each shard to provide its own top size term buckets and once all shards respond, it will reduce the results to the final list that will then be returned to the client. This means that if the number of unique terms is greater than size, the returned list is slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned).
If set to 0, the size will be set to Integer.MAX_VALUE.
Here is an example code to return top 100:
{
"aggs" : {
"products" : {
"terms" : {
"field" : "product",
"size" : 100
}
}
}
}
You can refer to this for more information.
You can use the min_doc_count parameter
{
"aggs" : {
"products" : {
"terms" : {
"field" : "product",
"min_doc_count" : 100
}
}
}
}
Related
Take this query:
{ 'location' : { '$near' : [x,y], '$maxDistance' : this.field } }
I want to assign $maxDistance the value of the specified field from the current evaluated document. Is that possible?
Yes it's possible. You just use $geoNear instead. Beware the catches and read carefully.
Presuming that your intent to is to store a field such as "travelDistance" to indicate on the document that any such searches must be "within" that supplied distance from the queried point to be valid. Then we simply query and evaluate the condition with $redact:
db.collection.aggregate([
{ "$geoNear": {
"near": {
"type": "Point",
"coordinates": [x,y]
},
"spherical": true,
"distanceField": "distance"
}},
{ "$redact": {
"$cond": {
"if": { "$lte": [ "$distance", "$travelDistance" ] },
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
])
The only catch is that $geoNear just like $near will only return a certain number of documents "near" in the first place. You can tune that with the options, but unlike the general query form, this is basically guaranteeing that the eventual returned results will be less than the specified "nearest" numbers.
As long as you are aware of that, then this is perfectly valid.
It is in fact the general way to deal with qualifying what is "near" within a radius.
Also be aware of the "distance" according to how you have the coordinates stored. As legacy coordinate pairs the distances will be in radians which you will probably need to do the math to convert to kilometers or miles.
If using GeoJSON, then the distances are always considered in meters, as the standard format.
All the math notes are in the documentation.
N.B Read the $geoNear documentation carefully. Options like "spherical" are required for "2dsphere" indexes, such as you should have for real world coordinates. Also "limit" may need to be applied to increase past the default 100 document result, for further trimming.
As the comments mention spring mongo, then here is basically the same thing done for that:
Aggregation aggregation = newAggregation(
new AggregationOperation() {
#Override
public DBObject toDBObject(AggregationOperationContext context) {
return new BasicDBObject("$geoNear",
new BasicDBObject(
"near", new BasicDBObject(
"type","Point")
.append("coordinates", Arrays.asList(20,30))
)
.append("spherical",true)
.append("distanceField","distance")
);
}
},
new AggregationOperation() {
#Override
public DBObject toDBObject(AggregationOperationContext context) {
return new BasicDBObject("$redact",
new BasicDBObject(
"$cond", Arrays.asList(
new BasicDBObject("$lte", Arrays.asList("$distance", "$travelDistance")),
"$$KEEP",
"$$PRUNE"
)
)
);
}
}
);
I am using a geospatial index in Cloudant for retrieving all documents inside a polygon. Now I want to calculate some basic static values for those documents (e.g. average age and sum of earnings in a region).
Is it possible to query the geo index and then pass the result on to the MapReduce function?
How can I achieve this, preferable inside the database? Can I avoid querying for the document ids inside the polygon first and then sending the retrieved ids for performing the MapReduce (I am working with large data sets)?
What is working so far is querying the index as well as using the view (separately).
My geo index
function (doc) {
if (doc.geometry && doc.geometry.coordinates) {
st_index(doc.geometry);
}
}
My view
function (doc) {
var beitrag = doc.properties.beitrag;
var schadenaufwand = doc.schadenaufwand;
if(beitrag !== null && typeof beitrag === 'number' ) {
emit(doc._id, doc.properties.beitrag);
}
}
A sample geoJson document (original data looks similar)
{
"_id": "01bff77f642fc4249e787d2ded011504",
"_rev": "1-25a9a1a15939d5b21af3fbcc5c2d6ed1",
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [
7.2316,
40.99
]
},
"properties": {
"age": 34,
"earnings": 982.7
}
}
This question is similar, but did not really help me: Cloudant - apply a view/mapReduce to a geospatial query
This demo could be something in the right direction: https://examples.cloudant.com/simplegeo_places/_design/geo/index.html
It seems like it would be a useful feature, but the answer to this is 'no'. The Geo indexer can't perform aggregations over the data.
I think you'll have to do as you were thinking -- use the returned list of doc ids to distribute the calculation in another map-reduce system.
I am using mongoose with node.js for MongoDB. Now i need to make 20 parallel find query requests in my database with limit of documents 4, same as shown below just brand_id will change for different brand.
areamodel.find({ brand_id: brand_id }, { '_id': 1 }, { limit: 4 }, function(err, docs) {
if (err) {
console.log(err);
} else {
console.log('fetched');
}
}
Now as to run all these query parallely i thought about putting all 20 brand_id in a array of string and then use a $in query to get the results, but i don't know how to specify the limit 4 for every array field which will be matched.
I write below code with aggregation but don't know where to specify limit for each element of my array.
var brand_ids = ["brandid1", "brandid2", "brandid3", "brandid4", "brandid5", "brandid6", "brandid7", "brandid8", "brandid9", "brandid10", "brandid11", "brandid12", "brandid13", "brandid14", "brandid15", "brandid16", "brandid17", "brandid18", "brandid19", "brandid20"];
areamodel.aggregate(
{ $project: { _id: 1 } },
{ $match : { 'brand_id': { $in: brand_ids } } },
function(err, docs) {
if (err) {
console.error(err);
} else {
}
}
);
Can anyone please tell me how can i solve my problem using only one query.
UPDATE- Why i don't think $group be helpful for me.
Suppose my brand_ids array contains these strings
brand_ids = ["id1", "id2", "id3", "id4", "id5"]
and my database have below documents
{
"brand_id": "id1",
"name": "Levis",
"loc": "india"
},
{
"brand_id": "id1",
"name": "Levis"
"loc": "america"
},
{
"brand_id": "id2",
"name": "Lee"
"loc": "india"
},
{
"brand_id": "id2",
"name": "Lee"
"loc": "america"
}
Desired JSON output
{
"name": "Levis"
},
{
"name": "Lee"
}
For above example suppose i have 25000 documents with "name" as "Levis" and 25000 of documents where "name" is "Lee", now if i will use group then all of 50000 documents will be queried and grouped by "name".
But according to the solution i want, when first document with "Levis" and "Lee" gets found then i will don't have to look for remaining thousands of the documents.
Update- I think if anyone of you can tell me this then probably i can get to my solution.
Consider a case where i have 1000 total documents in my mongoDB, now suppose out of that 1000, 100 will pass my match query.
Now if i will apply limit 4 on this query then will this query take same time to execute as the query without any limit, or not.
Why i am thinking about this case
Because if my query will take same time then i don't think $group will increase my time as all documents will be queried.
But if time taken by limit query is more than the time taken without the limit query then.
If i can apply limit 4 on each array element then my question will be solved.
If i cannot apply limit on each array element then i don't think $group will be useful, as in this case i have to scan whole documents to get the results.
FINAL UPDATE- As i read on below answer and also on mongodb docs that by using $limit, time taken by query does not get affected it is the network bandwidth that gets compromised. So i think if anyone of you can tell me how to apply limit on array fields (by using $group or anything other than that)then my problem will get solved.
mongodb: will limit() increase query speed?
Solution
Actually my thinking about mongoDB was very wrong i thought adding limit with queries decrease time taken by query but it is not the case that's why i stumbled so many days to try the answer which Gregory NEUT and JohnnyHK Told me to. Thanks a lot both of you guys i must have found the solution at the day one if i had known about this thing. thanks alot for helping me out of here guys i really appreciate it.
I propose you to use the $group aggregation attribute to group all data you got from the $match by brand_id, and then limit the groups of data using $slice.
Look at this stack overflow post
db.collection.aggregate(
{
$sort: {
created: -1,
}
}, {
$group: {
_id: '$city',
title: {
$push: '$title',
}
}, {
$project: {
_id: 0,
city: '$_id',
mostRecentTitle: {
$slice: ['$title', 0, 2],
}
}
})
I propose using distinct, since that will return all different brand names in your collection. (I assume this is what you are trying to achieve?)
db.runCommand ( { distinct: "areamodel", key: "name" } )
MongoDB docs
In mongoose i think it is: areamodel.db.db.command({ distinct: "areamodel", key: "name" }) (Untested)
The query receives a pair of coordinates, a maximum Distance radius, a "skip" integer and a "limit" integer. The function should return the closest and newest locations according to the position given. There is no visible error in my code, however, when I call the query again, it returns repeated results. "skip" variable is updated according to the results returned.
Example:
1) I make query with skip = 0, limit = 10. I receive 10 non-repeated locations.
2) Query is called again now, skip = 10, limit = 10. I receive another 10 locations with repeated results from the first query.
QUERY
Locations.find({ coordinates :
{ $near : [ x , y ],
$maxDistance: maxDistance }
})
.sort('date_created')
.skip(skip)
.limit(limit)
.exec(function(err, locations) {
console.log("[+]Found Locations");
callback(locations);
});
SCHEMA
var locationSchema = new Schema({
date_created: { type: Date },
coordinates: [],
text: { type: String }
});
I have tried looking everywhere for a solution. My only option would be versions of Mongo? I use mongoose 4.x.x and mongodb is like 2.5.6. I believe. Any ideas?
There are a couple of things to consider here in the sort of results that you want, with the first consideration being that you have a "secondary" sort criteria in the "date_created" to deal with.
The basic problem there is that the $near operator and like operators in MongoDB do not at present "project" any field to indicate the "distance" from the queried location, and simply just "default sort" the data. So in order to do that "secondary" sort, a field with the "distance" needs to be present. There are therefore other options for this.
The second case is that "skip" and "limit" style paging is horrible form performance on large sets of data and should be avoided where you can. So it's better to select data based on a "range" where it occurs rather than "skip" through all the results you have previously displayed.
The first thing to do here is use a command that can "project" the distance into the document along with the other information. The aggregation command of $geoNear is good for this, and especially since we want to do other sorting:
var seenIds = [],
lastDistance = null,
lastDate = null;
Locations.aggregate(
[
{ "$geoNear": {
"near": [x,y],
"maxDistance": maxDistance
"distanceField": "dist",
"limit": 10
}},
{ "$sort": { "dist": 1, "date_created": -1 }
],
function(err,results) {
results.forEach(function(result) {
if ( ( result.dist != lastDistance ) || ( result.date_created != lastDate ) ) {
seenIds = [];
lastDistance = result.dist;
lastDate = result.date_created;
}
seenIds.push(result._id);
});
// save those variables to session or other persistence
// do something with results
}
)
That is the first iteration of your results where you fetch the first 10. Noting the logic inside the loop, where each document in the results is inspected for either a change in the "date_created" or the projected "dist" field now present in the document and where this occurs the "seenIds" array is wiped of all current entries. The general action is that all the variables are tested and possibly updated on each iteration and where there is no change then items are added to the list of "seenIds".
All those three variables being worked on need to be stored somewhere awaiting the next request. For web applications the session store is ideal, but different approaches vary. You just want those values to be recalled when we start the next request, as on the next and subsequent iterations we alter the query a bit:
Locations.aggregate(
[
{ "$geoNear": {
"near": [x,y],
"maxDistance": maxDistance,
"minDistance": lastDistance,
"distanceField": "dist",
"limit": 10,
"query": {
"_id": { "$nin": seenIds },
"date_created": { "$lt": lastDate }
}
}},
{ "$sort": { "dist": 1, "date_created": -1 }
],
function(err,results) {
results.forEach(function(result) {
if ( ( result.dist != lastDistance ) || ( result.date_created != lastDate ) ) {
seenIds = [];
lastDistance = result.dist;
lastDate = result.date_created;
}
seenIds.push(result._id);
});
// save those variables to session or other persistence
// do something with results
}
)
So there the "minDistance" parameter is entered as you want to exclude any of the "nearer" results that have already been seen, and the additional checks are placed in the query with the "date_created" needing to be "less than" the "lastDistance" recorded as well since we are in descending order of sort, with the final "sure" filter in excluding any "_id" values that were recorded within the list because the values had not changed.
Now with geospatial data that "seenIds" list is not likely to grow as generally you are not going to find things all at the same distance, but it is a general process of paging a sorted list of data like this, so it is worth understanding the concept.
So if you want to be able to use a secondary field to sort on with geospatial data and also considering the "near" distance then this is the general approach, by projecting a distance value into the document results as well as storing the last seen values before any changes that would not make them unique.
The general concept is "advancing the minimum distance" to enable each page of results to get gradually "further away" from the source point of origin used in the query.
Let's say that I store documents like this in ElasticSearch:
{
'name':'user name',
'age':43,
'location':'CA, USA',
'bio':'into java, scala, python ..etc.',
'tags':['java','scala','python','django','lift']
}
And let's say that I search using location=CA, how can I sort the results according to the number of the items in 'tags'?
I would like to list the people with the most number of tag in the first page.
You can do it indexing an additional field which contains the number of tags, on which you can then easily sort your results. Otherwise, if you are willing to pay a little performance cost at query time there's a nice solution that doesn't require to reindex your data: you can sort based on a script like this:
{
"query" : {
"match_all" : {}
},
"sort" : {
"_script" : {
"script" : "doc['tags'].values.length",
"type" : "number",
"order" : "asc"
}
}
}
As you can read from the script based sorting section:
Note, it is recommended, for single custom based script based sorting,
to use custom_score query instead as sorting based on score is faster.
That means that it'd be better to use a custom score query to influence your score, and then sort by score, like this:
{
"query" : {
"custom_score" : {
"query" : {
"match_all" : {}
},
"script" : "_score * doc['tags'].values.length"
}
}
}