Aggregation Timing Out - node.js

I am using aggregates to query my schema for counts over date ranges, my problem is i am not getting any response from the server (Times out everytime), other mongoose queries are working fine (find, save, etc.) and when i call aggregates it depends on the pipeline (when i only use match i get a response when i add unwind i don't get any).
Connection Code:
var promise = mongoose.connect('mongodb://<username>:<password>#<db>.mlab.com:<port>/<db-name>', {
useMongoClient: true,
replset: {
ha: true, // Make sure the high availability checks are on
haInterval: 5000 // Run every 5 seconds
}
});
promise.then(function(db){
console.log('DB Connected');
}).catch(function(e){
console.log('DB Not Connected');
console.errors(e.message);
process.exit(1);
});
Schema:
var ProspectSchema = new Schema({
contact_name: {
type: String,
required: true
},
company_name: {
type: String,
required: true
},
contact_info: {
type: Array,
required: true
},
description:{
type: String,
required: true
},
product:{
type: Schema.Types.ObjectId, ref: 'Product'
},
progression:{
type: String
},
creator:{
type: String
},
sales: {
type: Schema.Types.ObjectId,
ref: 'User'
},
technical_sales: {
type: Schema.Types.ObjectId,
ref: 'User'
},
actions: [{
type: {type: String},
description: {type: String},
date: {type: Date}
}],
sales_connect_id: {
type: String
},
date_created: {
type: Date,
default: Date.now
}
});
Aggregation code:
exports.getActionsIn = function(start_date, end_date) {
var start = new Date(start_date);
var end = new Date(end_date);
return Prospect.aggregate([
{
$match: {
// "actions": {
// $elemMatch: {
// "type": {
// "$exists": true
// }
// }
// }
"actions.date": {
$gte: start,
$lte: end
}
}
}
,{
$project: {
_id: 0,
actions: 1
}
}
,{
$unwind: "actions"
}
,{
$group: {
_id: "actions.date",
count: {
$sum: 1
}
}
}
// ,{
// $project: {
// _id: 0,
// date: {
// $dateToString: {
// format: "%d/%m/%Y",
// date: "actions.date"
// }
// }
// // ,
// // count : "$count"
// }
// }
]).exec();
}
Calling the Aggregation:
router.get('/test',function(req, res, next){
var start_date = req.query.start_date;
var end_date = req.query.end_date;
ProspectCont.getActionsIn(start_date,end_date).then(function(value, err){
if(err)console.log(err);
res.json(value);
});
})
My Main Problem is that i get no response at all, i can work with an error message the issue is i am not getting any so i don't know what is wrong.
Mongoose Version: 4.11.8
P.s. I tried multiple variations of the aggregation pipeline, so this isn't my first try, i have an aggregation working on the main prospects schema but not the actions sub-document

You have several problems here, mostly by missing concepts. Lazy readers can skip to the bottom for the full pipeline example, but the main body here is in the explanation of why things are done as they are.
You are trying to select on a date range. The very first thing to check on any long running operation is that you have a valid index. You might have one, or you might not. But you should issue: ( from the shell )
db.prospects.createIndex({ "actions.date": 1 })
Just to be sure. You probably really should add this to the schema definition so you know this should be deployed. So add to your defined schema:
ProspectSchema.index({ "actions.date": 1 })
When querying with a "range" on elements of an array, you need to understand that those are "multiple conditions" which you are expecting to match elements "between". Whilst you generally can get away with querying a "single property" of an array using "Dot Notation", you are missing that the application of [$gte][1] and $lte is like specifying the property several times with $and explicitly.
Whenever you have such "multiple conditions" you always mean to use $elemMatch. Without it, you are simply testing every value in the array to see if it is greater than or less than ( being some may be greater and some may be lesser ). The $elemMatch operator makes sure that "both" are applied to the same "element", and not just all array values as "Dot notation" exposes them:
{ "$match": {
"actions": {
"$elemMatch": { "date": { "$gte": start, "$lte: end } }
}
}}
That will now only match documents where the "array elements" fall between the specified date. Without it, you are selecting and processing a lot more data which is irrelevant to the selection.
Array Filtering: Marked in Bold because it's prominence cannot be ignored. Any initial $match works just like any "query" in that it's "job" is to "select documents" valid to the expression. This however does not have any effect on the contents of the array in the documents returned.
Whenever you have such a condition for document selection, you nearly always intend to "filter" such content from the array itself. This is a separate process, and really should be performed before any other operations that work with the content. Especially [$unwind][4].
So you really should add a $filter in either an $addFields or $project as is appropriate to your intend "immediately" following any document selection:
{ "$project": {
"_id": 0,
"actions": {
"$filter": {
"input": "$actions",
"as": "a",
"in": {
"$and": [
{ "$gte": [ "$$a.date", start ] },
{ "$lte": [ "$$a.date", end ] }
]
}
}
}
}}
Now the array content, which you already know "must" have contained at least one valid item due to the initial query conditions, is "reduced" down to only those entries that actually match the date range that you want. This removes a lot of overhead from later processing.
Note the different "logical variants" of $gte and $lte in use within the $filter condition. These evaluate to return a boolean for expressions that require them.
Grouping It's probably just as an attempt at getting a result, but the code you have does not really do anything with the dates in question. Since typical date values should be provided with millisecond precision, you general want to reduce them.
Commented code suggests usage of $dateToString within a $project. It is strongly recommended that you do not do that. If you intend such a reduction, then supply that expression directly to the grouping key within $group instead:
{ "$group": {
"_id": {
"$dateToString": {
"format": "%Y-%m-%d",
"date": "$actions.date"
}
},
"count": { "$sum": 1 }
}}
I personally don't like returning a "string" when a natural Date object serializes properly for me already. So I like to use the "math" approach to "round" dates instead:
{ "$group": {
"_id": {
"$add": [
{ "$subtract": [
{ "$subtract": [ "$actions.date", new Date(0) ] },
{ "$mod": [
{ "$subtract": [ "$actions.date", new Date(0) ] },
1000 * 60 * 60 * 24
]}
],
new Date(0)
]
},
"count": { "$sum": 1 }
}}
That returns a valid Date object "rounded" to the current day. Mileage may vary on preferred approaches, but it's the one I like. And it takes the least bytes to transfer.
The usage of Date(0) represents the "epoch date". So when you $subtract one BSON Date from another you end up with the milliseconds difference between the two as an integer. When $add an integer value to a BSON Date, you get a new BSON Date representing the sum of the milliseconds value between the two. This is the basis of converting to numeric, rounding to the nearest start of day, and then converting numeric back to a Date value.
By making that statement directly within the $group rather than $project, you are basically saving what actually gets interpreted as "go through all the data and return this calculated value, then go and do...". Much the same as working through a pile of objects, marking them with a pen first and then actually counting them as a separate step.
As a single pipeline stage it saves considerable resources as you do the accumulation at the same time as calculating the value to accumulate on. When you think it though much like the provided analogy, it just makes a lot of sense.
As a full pipeline example you would put the above together as:
Prospect.aggregate([
{ "$match": {
"actions": {
"$elemMatch": { "date": { "$gte": start, "$lte: end } }
}
}},
{ "$project": {
"_id": 0,
"actions": {
"$filter": {
"input": "$actions",
"as": "a",
"in": {
"$and": [
{ "$gte": [ "$$a.date", start ] },
{ "$lte": [ "$$a.date", end ] }
]
}
}
}
}},
{ "$unwind": "$actions" },
{ "$group": {
"_id": {
"$dateToString": {
"format": "%Y-%m-%d",
"date": "$actions.date"
}
},
"count": { "$sum": 1 }
}}
])
And honestly if after making sure an index is in place, and following that pipeline you still have timeout problems, then reduce the date selection down until you get a reasonable response time.
If it's still taking too long ( or the date reduction is not reasonable ) then your hardware simply is not up to the task. If you really have a lot of data then you have to be reasonable with expectations. So scale up or scale out, but those things are outside the scope of any question here.
As it stands those improvements should make a significant difference over any attempt shown so far. Mostly due to a few fundamental concepts that are being missed.

Related

Use $lookup with a Conditional Join

provided I have following documents
User
{
uuid: string,
isActive: boolean,
lastLogin: datetime,
createdOn: datetime
}
Projects
{
id: string,
users: [
{
uuid: string,
otherInfo: ...
},
{... more users}
]
}
And I want to select all users that didn't login since 2 weeks and are inactive or since 5 weeks that don't have projects.
Now, the 2 weeks is working fine but I cannot seem to figure out how to do the "5 weeks and don't have projects" part
I came up with something like below but the last part does not work because $exists obviously is not a top level operator.
Anyone ever did anything like this?
Thanks!
return await this.collection
.aggregate([
{
$match: {
$and: [
{
$expr: {
$allElementsTrue: {
$map: {
input: [`$lastLogin`, `$createdOn`],
in: { $lt: [`$$this`, twoWeeksAgo] }
}
}
}
},
{
$or: [
{
isActive: false
},
{
$and: [
{
$expr: {
$allElementsTrue: {
$map: {
input: [`$lastLogin`, `$createdOn`],
in: { $lt: [`$$this`, fiveWeeksAgo] }
}
}
}
},
{
//No projects exists on this user
$exists: {
$lookup: {
from: _.get(Config, `env.collection.projects`),
let: {
currentUser: `$$ROOT`
},
pipeline: [
{
$project: {
_id: 0,
users: {
$filter: {
input: `$users`,
as: `user`,
cond: {
$eq: [`$$user.uuid`, `$currentUser.uuid`]
}
}
}
}
}
]
}
}
}
]
}
]
}
]
}
}
])
.toArray();
Not certain why you thought $expr was needed in the initial $match, but really:
const getResults = () => {
const now = Date.now();
const twoWeeksAgo = new Date(now - (1000 * 60 * 60 * 24 * 7 * 2 ));
const fiveWeeksAgo = new Date(now - (1000 * 60 * 60 * 24 * 7 * 5 ));
// as long a mongoDriverCollectionReference points to a "Collection" object
// for the "users" collection
return mongoDriverCollectionReference.aggregate([
// No $expr, since you can actually use an index. $expr cannot do that
{ "$match": {
"$or": [
// Active and "logged in"/created in the last 2 weeks
{
"isActive": true,
"$or": [
{ "lastLogin": { "$gte": twoWeeksAgo } },
{ "createdOn": { "$gte": twoWeeksAgo } }
]
},
// Also want those who...
// Not Active and "logged in"/created in the last 5 weeks
// we'll "tag" them later
{
"isActive": false,
"$or": [
{ "lastLogin": { "$gte": fiveWeeksAgo } },
{ "createdOn": { "$gte": fiveWeeksAgo } }
]
}
]
}},
// Now we do the "conditional" stuff, just to return a matching result or not
{ "$lookup": {
"from": _.get(Config, `env.collection.projects`), // there are a lot cleaner ways to register models than this
"let": {
"uuid": {
"$cond": {
"if": "$isActive", // this is boolean afterall
"then": null, // don't really want to match
"else": "$uuid" // Okay to match the 5 week results
}
}
},
"pipeline": [
// Nothing complex here as null will return nothing. Just do $in for the array
{ "$match": { "$in": [ "$$uuid", "$users.uuid" ] } },
// Don't really need the detail, so just reduce any matches to one result of [null]
{ "$group": { "_id": null } }
],
"as": "projects"
}},
// Now test if the $lookup returned something where it mattered
{ "$match": {
"$or": [
{ "active": true }, // remember we selected the active ones already
{
"projects.0": { "$exists": false } // So now we only need to know the "inactive" returned no array result.
}
]
}}
]).toArray(); // returns a Promise
};
It's pretty simple as calculated expressions via $expr are actually really bad and not what you want in a first pipeline stage. Also "not what you need" since createdOn and lastLogin really should not have been merged into an array for $allElementsTrue which would just be an AND condition, where you described logic would really mean OR. So the $or does just fine here.
So does the $or on the separation of conditions for the isActive of true/false. Again it's either "two weeks" OR "five weeks". And this certainly does not need $expr since standard inequality range matching works fine, and uses an "index".
Then you really just want to do the "conditional" things in the let for $lookup instead of your "does it exist" thinking. All you really need to know ( since the range selection of dates is actually already done ) is whether active is now true or false. Where it's active ( meaning by your logic you don't care about projects ) simply make the $$uuid used within the $match pipeline stage a null value so it will not match and the $lookup returns an empty array. Where false ( also already matching the date conditions from earlier ) then you use the actual value and "join" ( where there are projects of course ).
Then it's just a simple matter of keeping the active users, and then only testing the remaining false values for active to see if the "projects" array from the $lookup actually returned anything. If it did not, then they just don't have projects.
Probably should note here is since users is an "array" within the projects collection, you use $in for the $match condition against the single value to the array.
Note that for brevity we can use $group inside the inner pipeline to only return one result instead of possibly many matches to actual matched projects. You don't care about the content or the "count", but simply if one was returned or nothing. Again following the presented logic.
This gets you your desired results, and it does so in a manner that is efficient and actually uses indexes where available.
Also return await certainly does not do what you think it does, and in fact it's an ESLint warning message ( I suggest you enable ESLint in your project ) since it's not a smart thing to do. It does nothing really, as you would need to await getResults() ( as per the example naming ) anyway, as the await keyword is not "magic" but just a prettier way of writing then(). As well as hopefully being easier to understand, once you understand what async/await is really for syntactically that is.

Get max of virtual field of each type of record

I am using MongoDB with mongoose. I have a collection of orders, each order has currenc (usd, eur, ils and so on) and percent.
My node application reads a value from another service, and my Order collection has virtual field called price that is calculated from that value and the percent of the order document.
import mongoose, { Schema } from 'mongoose';
import { priceValue } from '../services/price-value';
const orderSchema = new Schema({
currency: {
type: 'String',
required: true
},
percent: {
type: 'Number',
required: true
}
}, {
toObject: {
virtuals: true
},
toJSON: {
virtuals: true
}
});
orderSchema.virtual('price').get(function() {
return priceValue * this.percent;
});
export default mongoose.model('Order', orderSchema);
I need to find the max prices for each currency. Something like, for each CURRENCY call:
db.orders.find({ currency: CURRENCY }).sort({percent: -1}).limit(1);
collect the results in the node application and calculate the virtual price field.
But this feels incorrect. What is the proper way to do it?
Instead of using "virtual fields" you would delegate this to the aggregation framework, which is a lot better then querying for each possible "currency" value.
Where priceValue is a simple "constant" value, then you just supply it as such to the pipeline expression:
db.orders.aggregate([
{ "$sort": { "currency": 1, "percent": -1 } },
{ "$group": {
"_id": "$currency",
"percent": {
"$first": {
"$multiply": [
{ "$divide": [ "$percent", 100 ] },
priceValue
]
}
}
}}
])
So you $sort to keep things in order of "currency" and descending on each "percent" within that "currency". Then you $group on each distinct "currency", taking the $first from each grouping boundary.
Then all you need do is apply $multiply to the returned value by the priceValue constant, from the "percent" value after $divide by 100, since you need the divisor for the whole value actually stored.
You should also be aware that returned documents from aggregation pipelines do not have the same schema attached methods as from the model used. The basic reason is you typically "change the shape" of the documents returned, and therefore the "schema" no longer applies.
According to this answer, you need aggregation, it's a single call to database
db.orders.aggregate([
{ "$sort": { "currency": 1, "percent": -1 } }, // or "percent": 1
{ "$group": {
"_id": "$currency",
"percent": { "$first": "$percent" }
}}
])
Result should looks something like, its sorted by max percent and uniquely grouped by currency
[{_id: 'usd', percent: 12}, {_id: 'eur', percent: 34}]
Then you can use your
import { priceValue } from '../services/price-value';
To get price

how to combine array of object result in mongodb

how can i combine match document's subdocument together as one and return it as an array of object ? i have tried $group but don't seem to work.
my query ( this return array of object in this case there are two )
User.find({
'business_details.business_location': {
$near: coords,
$maxDistance: maxDistance
},
'deal_details.deals_expired_date': {
$gte: new Date()
}
}, {
'deal_details': 1
}).limit(limit).exec(function(err, locations) {
if (err) {
return res.status(500).json(err)
}
console.log(locations)
the console.log(locations) result
// give me the result below
[{
_id: 55 c0b8c62fd875a93c8ff7ea, // first document
deal_details: [{
deals_location: '101.6833,3.1333',
deals_price: 12.12 // 1st deal
}, {
deals_location: '101.6833,3.1333',
deals_price: 34.3 // 2nd deal
}],
business_details: {}
}, {
_id: 55 a79898e0268bc40e62cd3a, // second document
deal_details: [{
deals_location: '101.6833,3.1333',
deals_price: 12.12 // 3rd deal
}, {
deals_location: '101.6833,3.1333',
deals_price: 34.78 // 4th deal
}, {
deals_location: '101.6833,3.1333',
deals_price: 34.32 // 5th deal
}],
business_details: {}
}]
what i wanted to do is to combine these both deal_details field together and return it as an array of object. It will contain 5 deals in one array of object instead of two separated array of objects.
i have try to do it in my backend (nodejs) by using concat or push, however when there's more than 2 match document i'm having problem to concat them together, is there any way to combine all match documents and return it as one ? like what i mentioned above ?
What you are probably missing here is the $unwind pipeline stage, which is what you typically use to "de-normalize" array content, particularly when your grouping operation intends to work across documents in your query result:
User.aggregate(
[
// Your basic query conditions
{ "$match": {
"business_details.business_location": {
"$near": coords,
"$maxDistance": maxDistance
},
"deal_details.deals_expired_date": {
"$gte": new Date()
}},
// Limit query results here
{ "$limit": limit },
// Unwind the array
{ "$unwind": "$deal_details" },
// Group on the common location
{ "$group": {
"_id": "$deal_details.deals_location",
"prices": {
"$push": "$deal_details.deals_price"
}
}}
],
function(err,results) {
if (err) throw err;
console.log(JSON.stringify(results,undefined,2));
}
);
Which gives output like:
{
"_id": "101.6833,3.1333",
"prices": [
12.12,
34.3,
12.12,
34.78,
34.32
]
}
Depending on how many documents actually match the grouping.
Alternately, you might want to look at the $geoNear pipeline stage, which gives a bit more control, especially when dealing with content in arrays.
Also beware that with "location" data in an array, only the "nearest" result is being considered here and not "all" of the array content. So other items in the array may not be actually "near" the queried point. That is more of a design consideration though as any query operation you do will need to consider this.
You can merge them with reduce:
locations = locations.reduce(function(prev, location){
previous = prev.concat(location.deal_details)
return previous
},[])

Insert or update object element in array [duplicate]

I'm new to MongoDB and Mongoose and I'm trying to use it to save stock ticks for daytrading analysis. So I imagined this Schema:
symbolSchema = Schema({
name:String,
code:String
});
quoteSchema = Schema({
date:{type:Date, default: now},
open:Number,
high:Number,
low:Number,
close:Number,
volume:Number
});
intradayQuotesSchema = Schema({
id_symbol:{type:Schema.Types.ObjectId, ref:"symbol"},
day:Date,
quotes:[quotesSchema]
});
From my link I receive information like this every minute:
date | symbol | open | high | low | close | volume
2015-03-09 13:23:00|AAPL|127,14|127,17|127,12|127,15|19734
I have to:
Find the ObjectId of the symbol (AAPL).
Discover if the intradayQuote document of this symbol already exists (symbol and date combination)
Discover if the minute OHLCV data of this symbol exists on the quotes array (because it could be repeated)
Update or create the document and update or create the quotes inside the array
I'm able to accomplish this task without veryfing if the quotes already exists, but this method can creates repeated entries inside quotes array:
symbol.find({"code":mySymbol}, function(err, stock) {
intradayQuote.findOneAndUpdate({
{ id_symbol:stock[0]._id, day: myDay },
{ $push: { quotes: myQuotes } },
{ upsert: true },
myCallback
});
});
I already tried:
$addToSet instead of $push, but unfortunatelly this doesn't seems to work with array of documents
{ id_symbol:stock[0]._id, day: myDay, 'quotes["date"]': myDate } on the conditions of findOneAndUpdate; but unfortunatelly if mongo doesn't find it, it creates a new document for the minute instead of appending to the quotes array.
Is there a way to get this working without using one more query (I'm already using 2)? Should I rethink my Schema to facilitate this job? Any help will be appreciated. Thanks!
Basically put an $addToSet operator cannot work for you because your data is not a true "set" by definition being a collection of "completely distinct" objects.
The other piece of logical sense here is that you would be working on the data as it arrives, either as a sinlge object or a feed. I'll presume its a feed of many items in some form and that you can use some sort of stream processor to arrive at this structure per document received:
{
"date": new Date("2015-03-09 13:23:00.000Z"),
"symbol": "AAPL",
"open": 127.14
"high": 127.17,
"low": 127.12
"close": 127.15,
"volume": 19734
}
Converting to a standard decimal format as well as a UTC date since any locale settings really should be the domain of your application once data is retrieved from the datastore of course.
I would also at least flatten out your "intraDayQuoteSchema" a little by removing the reference to the other collection and just putting the data in there. You would still need a lookup on insertion, but the overhead of the additional populate on read would seem to be more costly than the storage overhead:
intradayQuotesSchema = Schema({
symbol:{
name: String,
code: String
},
day:Date,
quotes:[quotesSchema]
});
It depends on you usage patterns, but it's likely to be more effective that way.
The rest really comes down to what is acceptable to
stream.on(function(data) {
var symbol = data.symbol,
myDay = new Date(
data.date.valueOf() -
( data.date.valueOf() % 1000 * 60 * 60 * 24 ));
delete data.symbol;
symbol.findOne({ "code": symbol },function(err,stock) {
intraDayQuote.findOneAndUpdate(
{ "symbol.code": symbol , "day": myDay },
{ "$setOnInsert": {
"symbol.name": stock.name
"quotes": [data]
}},
{ "upsert": true }
function(err,doc) {
intraDayQuote.findOneAndUpdate(
{
"symbol.code": symbol,
"day": myDay,
"quotes.date": data.date
},
{ "$set": { "quotes.$": data } },
function(err,doc) {
intraDayQuote.findOneAndUpdate(
{
"symbol.code": symbol,
"day": myDay,
"quotes.date": { "$ne": data.date }
},
{ "$push": { "quotes": data } },
function(err,doc) {
}
);
}
);
}
);
});
});
If you don't actually need the modified document in the response then you would get some benefit by implementing the Bulk Operations API here and sending all updates in this package within a single database request:
stream.on("data",function(data) {
var symbol = data.symbol,
myDay = new Date(
data.date.valueOf() -
( data.date.valueOf() % 1000 * 60 * 60 * 24 ));
delete data.symbol;
symbol.findOne({ "code": symbol },function(err,stock) {
var bulk = intraDayQuote.collection.initializeOrderedBulkOp();
bulk.find({ "symbol.code": symbol , "day": myDay })
.upsert().updateOne({
"$setOnInsert": {
"symbol.name": stock.name
"quotes": [data]
}
});
bulk.find({
"symbol.code": symbol,
"day": myDay,
"quotes.date": data.date
}).updateOne({
"$set": { "quotes.$": data }
});
bulk.find({
"symbol.code": symbol,
"day": myDay,
"quotes.date": { "$ne": data.date }
}).updateOne({
"$push": { "quotes": data }
});
bulk.execute(function(err,result) {
// maybe do something with the response
});
});
});
The point is that only one of the statements there will actually modify data, and since this is all sent in the same request there is less back and forth between the application and server.
The alternate case is that it might just be more simple in this case to have the actual data referenced in another collection. This then just becomes a simple matter of processing upserts:
intradayQuotesSchema = Schema({
symbol:{
name: String,
code: String
},
day:Date,
quotes:[{ type: Schema.Types.ObjectId, ref: "quote" }]
});
// and in the steam processor
stream.on("data",function(data) {
var symbol = data.symbol,
myDay = new Date(
data.date.valueOf() -
( data.date.valueOf() % 1000 * 60 * 60 * 24 ));
delete data.symbol;
symbol.findOne({ "code": symbol },function(err,stock) {
quote.update(
{ "date": data.date },
{ "$setOnInsert": data },
{ "upsert": true },
function(err,num,raw) {
if ( !raw.updatedExisting ) {
intraDayQuote.update(
{ "symbol.code": symbol , "day": myDay },
{
"$setOnInsert": {
"symbol.name": stock.name
},
"$addToSet": { "quotes": data }
},
{ "upsert": true },
function(err,num,raw) {
}
);
}
}
);
});
});
It really comes down to how important to you is it to have the data for quotes nested within the "day" document. The main distinction is if you want to query those documents based on the data some of those "quote" fields or otherwise live with the overhead of using .populate() to pull in the "quotes" from the other collection.
Of course if referenced and the quote data is important to your query filtering, then you can always just query that collection for the _id values that match and use an $in query on the "day" documents to only match days that contain those matched "quote" documents.
It's a big decision where it matters most which path you take based on how your application uses the data. Hopefully this should guide you on the general concepts behind doing what you want to achieve.
P.S Unless you are "sure" that your source data is always a date rounded to an exact "minute" then you probably want to employ the same kind of date rounding math as used to get the discrete "day" as well.

How to make a query using Mongoose that gets N results, but combines any documents it finds that meet certain criteria?

I have a Comments collection in Mongoose, and a query that returns the most recent five (an arbitrary number) Comments.
Every Comment is associated with another document. What I would like to do is make a query that returns the most recent 5 comments, with comments associated with the same other document combined.
So instead of a list like this:
results = [
{ _id: 123, associated: 12 },
{ _id: 122, associated: 8 },
{ _id: 121, associated: 12 },
{ _id: 120, associated: 12 },
{ _id: 119, associated: 17 }
]
I'd like to return a list like this:
results = [
{ _id: 124, associated: 3 },
{ _id: 125, associated: 19 },
[
{ _id: 123, associated: 12 },
{ _id: 121, associated: 12 },
{ _id: 120, associated: 12 },
],
{ _id: 122, associated: 8 },
{ _id: 119, associated: 17 }
]
Please don't worry too much about the data format: it's just a sketch to try to show the sort of thing I want. I want a result set of a specified size, but with some results grouped according to some criterion.
Obviously one way to do this would be to just make the query, crawl and modify the results, then recursively make the query again until the result set is as long as desired. That way seems awkward. Is there a better way to go about this? I'm having trouble phrasing it in a Google search in a way that gets me anywhere near anyone who might have insight.
Here's an aggregation pipeline query that will do what you are asking for:
db.comments.aggregate([
{ $group: { _id: "$associated", maxID: { $max: "$_id"}, cohorts: { $push: "$$ROOT"}}},
{ $sort: { "maxID": -1 } },
{ $limit: 5 }
])
Lacking any other fields from the sample data to sort by, I used $_id.
If you'd like results that are a little closer in structure to the sample result set you provided you could add a $project to the end:
db.comments.aggregate([
{ $group: { _id: "$associated", maxID: { $max: "$_id"}, cohorts: { $push: "$$ROOT"}}},
{ $sort: { "maxID": -1 } },
{ $limit: 5 },
{ $project: { _id: 0, cohorts: 1 }}
])
That will print only the result set. Note that even comments that do not share an association object will be in an array. It will be an array of 1 length.
If you are concerned about limiting the results in the grouping as Neil Lunn is suggesting, perhaps a $match in the beginning is a smart idea.
db.comments.aggregate([
{ $match: { createDate: { $gte: new Date(new Date() - 5 * 60000) } } },
{ $group: { _id: "$associated", maxID: { $max: "$_id"}, cohorts: { $push: "$$ROOT"}}},
{ $sort: { "maxID": -1 } },
{ $limit: 5 },
{ $project: { _id: 0, cohorts: 1 }}
])
That will only include comments made in the last 5 minutes assuming you have a createDate type field. If you do, you might also consider using that as the field to sort by instead of "_id". If you do not have a createDate type field, I'm not sure how best to limit the comments that are grouped as I do not know of a "current _id" in the way that there is a "current time".
I honestly think you are asking a lot here and cannot really see the utility myself, but I'm always happy to have that explained to me if there is something useful I have missed.
Bottom line is you want comments from the last five distinct users by date, and then some sort of grouping of additional comments by those users. The last part is where I see difficulty in rules no matter how you want to attack this, but I'll try to keep this to the most brief form.
No way this happens in a single query of any sort. But there are things that can be done to make it an efficient server response:
var DataStore = require('nedb'),
store = new DataStore();
async.waterfall(
function(callback) {
Comment.aggregate(
[
{ "$match": { "postId": thisPostId } },
{ "$sort": { "associated": 1, "createdDate": -1 } },
{ "$group": {
"_id": "$associated",
"date": { "$first": "$createdDate" }
}},
{ "$sort": { "date": -1 } },
{ "$limit": 5 }
],
callback);
},
function(docs,callback) {
async.each(docs,function(doc,callback) {
Comment.aggregate(
[
{ "$match": { "postId": thisPostId, "associated": doc._id } },
{ "$sort": { "createdDate": -1 } },
{ "$limit": 5 },
{ "$group": {
"_id": "$associated",
"docs": {
"$push": {
"_id": "$_id", "createdDate": "$createdDate"
}
},
"firstDate": { "$first": "$createdDate" }
}}
],
function(err,results) {
if (err) callback(err);
async.each(results,function(result,callback) {
store.insert( result, function(err, result) {
callback(err);
});
},function(err) {
callback(err);
});
}
);
},
callback);
},
function(err) {
if (err) throw err;
store.find({}).sort({ "firstDate": - 1 }).exec(function(err,docs) {
if (err) throw err;
console.log( JSON.stringify( docs, undefined, 4 ) );
});
}
);
Now I stuck more document properties in both the document and the array, but the simplified form based on your sample would then come out like this:
results = [
{ "_id": 3, "docs": [124] },
{ "_id": 19, "docs": [125] },
{ "_id": 12, "docs": [123,121,120] },
{ "_id": 8, "docs": [122] },
{ "_id": 17, "docs": [119] }
]
So the essential idea is to first find your distinct "users" who where the last to comment by basically chopping off the last 5. Without filtering some kind of range here that would go over the entire collection to get those results, so it would be best to restrict this in some way, as in the last hour or last few hours or something sensible as required. Just add those conditions to the $match along with the current post that is associated with the comments.
Once you have those 5, then you want to get any possible "grouped" details for multiple comments by those users. Again, some sort of limit is generally advised for a timeframe, but as a general case this is just looking for the most recent comments by the user on the current post and restricting that to 5.
The execution here is done in parallel, which will use more resources but is fairly effective considering there are only 5 queries to run anyway. In contrast to your example output, the array here is inside the document result, and it contains the original document id values for each comment for reference. Any other content related to the document would be pushed into the array as well as required (ie The content of the comment).
The other little trick here is using nedb as a means for storing the output of each query in an "in memory" collection. This need only really be a standard hash data structure, but nedb gives you a way of doing that while maintaining the MongoDB statement form that you may be used to.
Once all results are obtained you just return them as your output, and sorted as shown to retain the order of who commented last. The actual comments are grouped in the array for each item and you can traverse this to output how you like.
Bottom line here is that you are asking for a compounded version of the "top N results problem", which is something often asked of MongoDB. I've written about ways to tackle this before to show how it's possible in a single aggregation pipeline stage, but it really is not practical for anything more than a relatively small result set.
If you really want to join in the insanity, then you can look at Mongodb aggregation $group, restrict length of array for one of the more detailed examples. But for my money, I would run on parallel queries any day. Node.js has the right sort of environment to support them, so you would be crazy to do it otherwise.

Resources