Mongodb Aggregation Append method for optional $match pipeline operator

Mongodb Aggregation Append method for optional $match pipeline operator - node.js

I'm using nodejs + mongoosejs with mongodb 2.6. I have a static function on a model that sums the value of all items in the collection. Each item is assigned to a project using a projectNo property. I need the static function to be able to give me the total for the collection, and if a projectNo argument is passed, add a $match pipeline operator to the aggregation. This will save me from having to make 2 static functions that essentially does the same thing.
To spice things up a bit I use bluebird promisifyAll method to make the aggregation framework return a promise.
my static function that sums the entire collection:
db.collection.aggregateAsync([
{$group:{_id: null, amount: { $sum: "$amount" }}}
])
my static function that sums only the records with a matching projectNo:
db.collection.aggregateAsync([
{$match: { projectNo: projectNo }},
{$group:{_id: null, amount: { $sum: "$amount" }}}
])
I really want to use the Aggregate.append method to append the $match pipeline only if a req.params.projectNo is included.
When I try to add it to the async aggregation it gets an error, which makes sense because its just a promise. If I try this:
db.collection.aggregateAsync([
{$group:{_id: null, amount: { $sum: "$amount" }}}
]).then(function(aggregate){
aggregate.append({$match: { projectNo: projectNo }})
})
I get an error, (append is undefined). How should I go about doing this? Or just live with the fact that I have two functions that do the same thing?

I read the source code in mongodb to see exactly how to use the aggregate.append method. If you're building the aggregation using the chained methods, you can use append to add any pipeline operations.
So what I did instead is put the array of aggregation pipelines into an array. If there is a projectNo then I add the $match pipeline to the array using unshift(). I used unshift because you usually want the $match pipeline to first limit the number of records, then do the rest of the pipeline operations.
var pipeline = [{$group:{_id: null, amount: { $sum: "$amount" }}}];
if(req.params.projectNo){
pipeline.unshift({$match: { projectNo: req.params.projectNo }});
}
db.collection.aggregateAsync(pipeline);
I usually make things way more complicated than I need to...

Related

How can I optimize this query in mongo db?

Here is the query:
const tags = await mongo
.collection("positive")
.aggregate<{ word: string; count: number }>([
{
$lookup: {
from: "search_history",
localField: "search_id",
foreignField: "search_id",
as: "history",
pipeline: [
{
$match: {
created_at: { $gt: prevSunday.toISOString() },
},
},
{
$group: {
_id: "$url",
},
},
],
},
},
{
$match: {
history: { $ne: [] },
},
},
{
$group: {
_id: "$word",
url: {
$addToSet: "$history._id",
},
},
},
{
$project: {
_id: 0,
word: "$_id",
count: {
$size: {
$reduce: {
input: "$url",
initialValue: [],
in: {
$concatArrays: ["$$value", "$$this"],
},
},
},
},
},
},
{
$sort: {
count: -1,
},
},
{
$limit: 50,
},
])
.toArray();
I think I need an index but not sure how or where to add.

Perhaps performance of this operation should be revisited after we confirm that it is satisfying the desired application logic that the approach itself is reasonable.
When it comes to performance, there is nothing that can be done to improve efficiency on the positive collection if the intention is to process every document. By definition, processing all documents requires a full collection scan.
To efficiently support the $lookup on the search_history collection, you may wish to confirm that an index on { search_id: 1, created_at: 1, url: 1 } exists. Providing the .explain("allPlansExecution") output would allow us to better understand the current performance characteristics.
Desired Logic
Updating the question to include details about the schemas and the purpose of the aggregation would be very helpful with respect to understanding the overall situation. Just looking at the aggregation, it appears to be doing the following:
For every single document in the positive collection, add a new field called history.
This new field is a list of url values from the search_history collection where the corresponding document has a matching search_id value and was created_at after last Sunday.
The aggregation then filters to only keep documents where the new history field has at least one entry.
The next stage then groups the results together by word. The $addToSet operator is used here, but it may be generating an array of arrays rather than de-duplicated urls.
The final 3 stages of the aggregation seem to be focused on calculating the number of urls and returning the top 50 results by word sorted on that size in descending order.
Is this what you want? In particular the following aspects may be worth confirming:
Is it your intention to process every document in the positive collection? This may be the case, but it's impossible to tell without any schema/use-case context.
Is the size calculation of the urls correct? It seems like you may need to use a $map when doing the $addToSet for the $group instead of using $reduce for the subsequent $project.

The best thing to do is to limit the number of documents passed to each stage.
Indexes are used by mongo in aggregations only in the first stage only if it's a match, using 1 index max.
So the best thing to do is to have a match on an indexed field that is very restrictive.
Moreover, please note that $limit, $skip and $sample are not panaceas because they still scan the entire collection.
A way to efficiently limit the number of documents selected on the first stage is to use a "pagination". You can make it work like this :
Once every X requests
Count the number of docs in the collection
Divide this in chunks of Yk max
Find the _ids of the docs at the place Y, 2Y, 3Y etc with skip and limit
Cache the results in redis/memcache (or as global variable if you really cannot do otherwise)
Every request
Get the current chunk to scan by reading the redis keys used and nbChunks
Get the _ids cached in redis used to delimit the next aggregation id:${used%nbChunks} and id:${(used%nbChunks)+1} respectively
Aggregate using $match with _id:{$gte: ObjectId(id0), $lt: ObjectId(id1)}) }
Increment used, if used > X then update chunks
Further optimisation
If using redis, supplement every key with ${cluster.worker.id}:to avoid hot keys.
Notes
The step 3) of the setup of chunks can be a really long and intensive process, so do it only when necessary, let's say every X~1k requests.
If you are scanning the last chunk, do not put the $lt
Once this process implemented, your job is to find the sweet spot of X and Y that suits your needs, constrained by a Y being large enough to retrieve max documents while being not too long and a X that keeps the chunks roughly equals as the collection has more and more documents.
This process is a bit long to implement but once it is, time complexity is ~O(Y) and not ~O(N). Indeed, the $match being the first stage and _id being a field that is indexed, this first stage is really fast and limits to max Y documents scanned.
Hope it help =) Make sure to ask more if needed =)

How do I sort the output of a bucket aggregation stage?

I have a mongodb aggregation pipeline consisting of match and bucket. The match just specifies the type of document to be bucketed, then the bucket bins the documents based on their timestamp. The problem I am encountering is that the results are all out of (time) order. There is an ascending index on type and a descending index on data.tod.
I have tried adding a sort stage between the two stages and seems to ignore it. {$sort:{'data.tod':-1}}
I next tried a sort after the bucket {$sort:{T:-1}}, which also had no effect on the output.
let cursor = self.collection.aggregate([
{
$match: {
type: 'image',
}
},{
$bucket: {
groupBy: '$data.tod',
boundaries: boundsObj.array,
default: 'ungrouped',
output: {
'data': {$addToSet:{
T: '$data.tod',
SDN: '$data.shortDirName'
}
}
}
}
}],null);

Sorting before the group stage is the answer actually, your problem is the addToSet which does not preserve document order.
From the official Mongo's docs
Order of the elements in the output array is unspecified.
Assuming we want to keep the set property after the bucket stage our field data.T is an inner field, the sort operator sorts between documents and not within them.
What you need to do is unwind that field, sort it and then re-group using the push operator instead which does preserve document order.
{$unwind: '$data'},
{$sort: {'data.tod': -1}},
{$group: {_id: '$_id', 'data': {$push: '$data'}}})

Upsert and $inc Sub-document in Array

The following schema is intended to record total views and views for a very specific day only.
const usersSchema = new Schema({
totalProductsViews: {type: Number, default: 0},
productsViewsStatistics: [{
day: {type: String, default: new Date().toISOString().slice(0, 10), unique: true},
count: {type: Number, default: 0}
}],
});
So today views will be stored in another subdocument different from yesterday. To implement this I tried to use upsert so as subdocument will be created each day when product is viewed and counts will be incremented and recorded based on a particular day. I tried to use the following function but seems not to work the way I intended.
usersSchema.statics.increaseProductsViews = async function (id) {
//Based on day only.
const todayDate = new Date().toISOString().slice(0, 10);
const result = await this.findByIdAndUpdate(id, {
$inc: {
totalProductsViews: 1,
'productsViewsStatistics.$[sub].count': 1
},
},
{
upsert: true,
arrayFilters: [{'sub.day': todayDate}],
new: true
});
console.log(result);
return result;
};
What do I miss to get the functionality I want? Any help will be appreciated.

What you are trying to do here actually requires you to understand some concepts you may not have grasped yet. The two primary ones being:
You cannot use any positional update as part of an upsert since it requires data to be present
Adding items into arrays mixed with "upsert" is generally a problem that you cannot do in a single statement.
It's a little unclear if "upsert" is your actual intention anyway or if you just presumed that was what you had to add in order to get your statement to work. It does complicate things if that is your intent, even if it's unlikely give the finByIdAndUpdate() usage which would imply you were actually expecting the "document" to be always present.
At any rate, it's clear you actually expect to "Update the array element when found, OR insert a new array element where not found". This is actually a two write process, and three when you consider the "upsert" case as well.
For this, you actually need to invoke the statements via bulkWrite():
usersSchema.statics.increaseProductsViews = async function (_id) {
//Based on day only.
const todayDate = new Date().toISOString().slice(0, 10);
await this.bulkWrite([
// Try to match an existing element and update it ( do NOT upsert )
{
"updateOne": {
"filter": { _id, "productViewStatistics.day": todayDate },
"update": {
"$inc": {
"totalProductsViews": 1,
"productViewStatistics.$.count": 1
}
}
}
},
// Try to $push where the element is not there but document is - ( do NOT upsert )
{
"updateOne": {
"filter": { _id, "productViewStatistics.day": { "$ne": todayDate } },
"update": {
"$inc": { "totalProductViews": 1 },
"$push": { "productViewStatistics": { "day": todayDate, "count": 1 } }
}
}
},
// Finally attempt upsert where the "document" was not there at all,
// only if you actually mean it - so optional
{
"updateOne": {
"filter": { _id },
"update": {
"$setOnInsert": {
"totalProductViews": 1,
"productViewStatistics": [{ "day": todayDate, "count": 1 }]
}
}
}
])
// return the modified document if you really must
return this.findById(_id); // Not atomic, but the lesser of all evils
}
So there's a real good reason here why the positional filtered [<identifier>] operator does not apply here. The main good reason is the intended purpose is to update multiple matching array elements, and you only ever want to update one. This actually has a specific operator in the positional $ operator which does exactly that. It's condition however must be included within the query predicate ( "filter" property in UpdateOne statements ) just as demonstrated in the first two statements of the bulkWrite() above.
So the main problems with using positional filtered [<identifier>] are that just as the first two statements show, you cannot actually alternate between the $inc or $push as would depend on if the document actually contained an array entry for the day. All that will happen is at best no update will be applied when the current day is not matched by the expression in arrayFilters.
The at worst case is an actual "upsert" will throw an error due to MongoDB not being able to decipher the "path name" from the statement, and of course you simply cannot $inc something that does not exist as a "new" array element. That needs a $push.
That leaves you with the mechanic that you also cannot do both the $inc and $push within a single statement. MongoDB will error that you are attempting to "modify the same path" as an illegal operation. Much the same applies to $setOnInsert since whilst that operator only applies to "upsert" operations, it does not preclude the other operations from happening.
Thus the logical steps fall back to what the comments in the code also describe:
Attempt to match where the document contains an existing array element, then update that element. Using $inc in this case
Attempt to match where the document exists but the array element is not present and then $push a new element for the given day with the default count, updating other elements appropriately
IF you actually did intend to upsert documents ( not array elements, because that's the above steps ) then finally actually attempt an upsert creating new properties including a new array.
Finally there is the issue of the bulkWrite(). Whilst this is a single request to the server with a single response, it still is effectively three ( or two if that's all you need ) operations. There is no way around that and it is better than issuing chained separate requests using findByIdAndUpdate() or even updateOne().
Of course the main operational difference from the perspective of code you attempted to implement is that method does not return the modified document. There is no way to get a "document response" from any "Bulk" operation at all.
As such the actual "bulk" process will only ever modify a document with one of the three statements submitted based on the presented logic and most importantly the order of those statements, which is important. But if you actually wanted to "return the document" after modification then the only way to do that is with a separate request to fetch the document.
The only caveat here is that there is the small possibility that other modifications could have occurred to the document other than the "array upsert" since the read and update are separated. There really is no way around that, without possibly "chaining" three separate requests to the server and then deciding which "response document" actually applied the update you wanted to achieve.
So with that context it's generally considered the lesser of evils to do the read separately. It's not ideal, but it's the best option available from a bad bunch.
As a final note, I would strongly suggest actually storing the the day property as a BSON Date instead of as a string. It actually takes less bytes to store and is far more useful in that form. As such the following constructor is probably the clearest and least hacky:
const todayDate = new Date(new Date().setUTCHours(0,0,0,0))

Mongoose/Mongodb previous and next in embedded document

I'm learning Mongodb/Mongoose/Express and have come across a fairly complex query (relative to my current level of understanding anyway) that I'm not sure how best to approach. I have a collection - to keep it simple let's call it entities - with an embedded actions array:
name: String
actions: [{
name: String
date: Date
}]
What I'd like to do is to return an array of documents with each containing the most recent action (or most recent to a specified date), and the next action (based on the same date).
Would this be possible with one find() query, or would I need to break this down into multiple queries and merge the results to generate one result array? I'm looking for the most efficient route possible.

Provided that your "actions" are inserted with the "most recent" being the last entry in the list, and usually this will be the case unless you are specifically updating items and changing dates, then all you really want to do is "project" the last item of the array. This is what the $slice projection operation is for:
Model.find({},{ "actions": { "$slice": -1 } },function(err,docs) {
// contains an array with the last item
});
If indeed you are "updating" array items and changing dates, but you want to query for the most recent on a regular basis, then you are probably best off keeping the array ordered. You can do this with a few modifiers such as:
Model.update(
{
"_id": ObjectId("541f7bbb699e6dd5a7caf2d6"),
},
{
"$push": { "actions": { "$each": [], "$sort": { "date": 1 } } }
},
function(err,numAffected) {
}
);
Which is actually more of a trick that you can do with the $sort modifier to simply sort the existing array elements without adding or removing. In versions prior to 2.6 you need the $slice "update" modifier in here as well, but this could be set to a value larger than the expected array elements if you did not actually want to restrict the possible size, but that is probably a good idea.
Unfortunately, if you were "updating" via a $set statement, then you cannot do this "sorting" in a single update statement, as MongoDB will not allow both types of operations on the array at once. But if you can live with that, then this is a way to keep the array ordered so the first query form works.
If it just seems too hard to keep an array ordered by date, then you can in fact retrieve the largest value my means of the .aggregate() method. This allows greater manipulation of the documents than is available to basic queries, at a little more cost:
Model.aggregate([
// Unwind the array to de-normalize as documents
{ "$unwind": "$actions" },
// Sort the contents per document _id and inner date
{ "$sort": { "_id": 1, "actions.date": 1 } },
// Group back with the "last" element only
{ "$group": {
"_id": "$_id",
"name": { "$last": "$name" },
"actions": { "$last": "$actions" }
}}
],
function(err,docs) {
})
And that will "pull apart" the array using the $unwind operator, then process with a next stage to $sort the contents by "date". In the $group pipeline stage the "_id" means to use the original document key to "collect" on, and the $last operator picks the field values from the "last" document ( de-normalized ) on that grouping boundary.
So there are various things that you can do, but of course the best way is to keep your array ordered and use the basic projection operators to simply get the last item in the list.

Mongoose Query: compare two values on same document

How can I query a Mongo collection using Mongoose to find all the documents that have a specific relation between two of their own properties?
For example, how can I query a characters collections to find all those characters that have their currentHitPoints value less than their maximumHitPoints value? Or all those projects that have their currentPledgedMoney less than their pledgeGoal?
I tried to something like this:
mongoose.model('Character')
.find({
player: _currentPlayer
})
.where('status.currentHitpoints').lt('status.maximumHitpoints')
.exec(callback)
but I am getting errors since the lt argument must be a Number. The same goes if I use $.status.maximumHitpoints (I was hoping Mongoose would be able to resolve it like it does when doing collection operations).
Is this something that can be done within a Query? I would expect so, but can't find out how. Otherwise I can filter the whole collection with underscore but I suspect that is going to have a negative impact on performance.
PS: I also tried using similar approaches with the find call, no dice.

MongoDB 3.6 and above supports aggregation expressions within the query language:
db.monthlyBudget.find( { $expr: { $gt: [ "$spent" , "$budget" ] } } )
https://docs.mongodb.com/manual/reference/operator/query/expr/

Thanks to Aniket's suggestion in the question's comments, I found that the same can be done with Mongoose using the following syntax:
mongoose.model('Character')
.find({
player: _currentPlayer
})
.$where('this.status.currentHitpoints < this.status.maximumHitpoints')
.exec(callback)
Notice the $where method is used instead of the where method.
EDIT: To expand on Derick's comment below, a more performance sensitive solution would be to have a boolean property inside your Mongoose schema containing the result of the comparison, and update it everytime the document is saved. This can be easily achieved through the use of Mongoose Schema Plugin, so you would have something like:
var CharacterSchema = new mongoose.Schema({
// ...
status: {
hitpoints: Number,
maxHitpoints: Number,
isInFullHealth: {type: Boolean, default: false}
}
})
.plugin(function(schema, options) {
schema.pre('save', function(next) {
this.status.isInFullHealth = (this.status.hitPoints >= this.status.maxHitpoints);
next();
})
})

mongoose.model('Character')
.find({
player: _currentPlayer, $expr: { $lt: ['$currentHitpoints', '$maximumHitpoints'] }
})
This above query means find the record which has currentHitpoints less than maximumHitpoints

Starting in MongoDB 5.0, the $eq, $lt, $lte, $gt, and $gte comparison operators placed in an $expr operator can use an index on the from collection referenced in a $lookup stage.
Example
The following operation uses $expr to find documents where the spent amount exceeds the budget:
db.monthlyBudget.find( { $expr: { $gt: [ "$spent" , "$budget" ] } } )

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string