I have couple hundred thousand documents in my collection, each with a timestamp field.
I want to count number of records with respect to each day for the last month.
I'm running a mongoose aggregate command to do that which is unfortunately taking long time than what I was expecting.
Following is the aggregate function:
function dailyStats()
{
var lastMonth = new Date();
lastMonth.setMonth(lastMonth.getMonth() - 1);
MyModel.aggregate(
{
$match: {timestamp: {$gte: lastMonth}}
},
{
$group: {
// Data count grouped by date
_id: {$dateToString: {format: "%Y-%m-%d", date: "$timestamp"}},
count: {$sum: 1}
}
},
{$sort: {"_id": 1}},
function (err, docs)
{
console.log(docs);
});
}
Now, whole point of callbacks is non-blocking code. However, when this function is executed, it takes around 20-25 seconds. For this whole time, my node application doesn't respond to other APIs!
First I thought that my CPU gets so busy that its not responding to anything else. So I had a small node app run along with it, which works fine!
So I don't understand why this application does not respond to other requests till mongodb driver returns with the result.
Related
Here is the query:
const tags = await mongo
.collection("positive")
.aggregate<{ word: string; count: number }>([
{
$lookup: {
from: "search_history",
localField: "search_id",
foreignField: "search_id",
as: "history",
pipeline: [
{
$match: {
created_at: { $gt: prevSunday.toISOString() },
},
},
{
$group: {
_id: "$url",
},
},
],
},
},
{
$match: {
history: { $ne: [] },
},
},
{
$group: {
_id: "$word",
url: {
$addToSet: "$history._id",
},
},
},
{
$project: {
_id: 0,
word: "$_id",
count: {
$size: {
$reduce: {
input: "$url",
initialValue: [],
in: {
$concatArrays: ["$$value", "$$this"],
},
},
},
},
},
},
{
$sort: {
count: -1,
},
},
{
$limit: 50,
},
])
.toArray();
I think I need an index but not sure how or where to add.
Perhaps performance of this operation should be revisited after we confirm that it is satisfying the desired application logic that the approach itself is reasonable.
When it comes to performance, there is nothing that can be done to improve efficiency on the positive collection if the intention is to process every document. By definition, processing all documents requires a full collection scan.
To efficiently support the $lookup on the search_history collection, you may wish to confirm that an index on { search_id: 1, created_at: 1, url: 1 } exists. Providing the .explain("allPlansExecution") output would allow us to better understand the current performance characteristics.
Desired Logic
Updating the question to include details about the schemas and the purpose of the aggregation would be very helpful with respect to understanding the overall situation. Just looking at the aggregation, it appears to be doing the following:
For every single document in the positive collection, add a new field called history.
This new field is a list of url values from the search_history collection where the corresponding document has a matching search_id value and was created_at after last Sunday.
The aggregation then filters to only keep documents where the new history field has at least one entry.
The next stage then groups the results together by word. The $addToSet operator is used here, but it may be generating an array of arrays rather than de-duplicated urls.
The final 3 stages of the aggregation seem to be focused on calculating the number of urls and returning the top 50 results by word sorted on that size in descending order.
Is this what you want? In particular the following aspects may be worth confirming:
Is it your intention to process every document in the positive collection? This may be the case, but it's impossible to tell without any schema/use-case context.
Is the size calculation of the urls correct? It seems like you may need to use a $map when doing the $addToSet for the $group instead of using $reduce for the subsequent $project.
The best thing to do is to limit the number of documents passed to each stage.
Indexes are used by mongo in aggregations only in the first stage only if it's a match, using 1 index max.
So the best thing to do is to have a match on an indexed field that is very restrictive.
Moreover, please note that $limit, $skip and $sample are not panaceas because they still scan the entire collection.
A way to efficiently limit the number of documents selected on the first stage is to use a "pagination". You can make it work like this :
Once every X requests
Count the number of docs in the collection
Divide this in chunks of Yk max
Find the _ids of the docs at the place Y, 2Y, 3Y etc with skip and limit
Cache the results in redis/memcache (or as global variable if you really cannot do otherwise)
Every request
Get the current chunk to scan by reading the redis keys used and nbChunks
Get the _ids cached in redis used to delimit the next aggregation id:${used%nbChunks} and id:${(used%nbChunks)+1} respectively
Aggregate using $match with _id:{$gte: ObjectId(id0), $lt: ObjectId(id1)}) }
Increment used, if used > X then update chunks
Further optimisation
If using redis, supplement every key with ${cluster.worker.id}:to avoid hot keys.
Notes
The step 3) of the setup of chunks can be a really long and intensive process, so do it only when necessary, let's say every X~1k requests.
If you are scanning the last chunk, do not put the $lt
Once this process implemented, your job is to find the sweet spot of X and Y that suits your needs, constrained by a Y being large enough to retrieve max documents while being not too long and a X that keeps the chunks roughly equals as the collection has more and more documents.
This process is a bit long to implement but once it is, time complexity is ~O(Y) and not ~O(N). Indeed, the $match being the first stage and _id being a field that is indexed, this first stage is really fast and limits to max Y documents scanned.
Hope it help =) Make sure to ask more if needed =)
I am currently trying to query a MongoDB collection using Mongoose, and I am having trouble trying to convert this query into a useable Mongoose query.
The MongoDB CLI query is db.events.find({}, {'first.points': 1, '_id': 0}).
This works fine and returns what I would expect when I run this in the command line, I have tried several methods of converting this to a Mongoose query, my attempts so far are:
Attempt #1
Events.find({}).populate('first').exec(function(err, events){
if(err) { console.log(err) }
console.log(events);
});
This does not work and throws the error Cast to ObjectId failed for value "10" at path "_id" for model "Event" when the node server is started.
Attempt #2
Event.find({'first.points': "10"}).populate('first').exec(function(err, events)
This does not throw any errors, and it does return the values I would expect, however I am trying to return all the first.points values for all events, and I cannot seem to do this.
Attempt #3
Event.find({'first.points': "$all"}).populate('first').exec(function(err, events)
This also does not work, and was my most recent attempt at this issue, it again throws an error this time saying Cast to number failed for value "$all" at path "first.points" for model "Event"
I am not sure what else to try for this, I am unsure how to return all of the values without specifying which to look for.
EDIT
The model for Events is included below
var eventsSchema = new mongoose.Schema({
name: String, // The event name, with data type String
date: Date, // Date with data type Date
first: {
points: Number, // The points awarded to the first place with data type Number
house: String
},
second: {
points: Number,
house: String
},
third: {
points: Number,
house: String
},
fourth: {
points: Number,
house: String
}
});
Any help is appreciated.
Credit to naga - elixir - jar for this answer.
Event.find({}, {'first.points': 1, '_id': 0}, function(err, events) {...})
This code returns the values that I needed, without errors and in the correct format.
Note I have converted this away from an arrow function for clarity. Arrow function is here Event.find({}, {'first.points': 1, '_id': 0}, (err, events) => {...})
Is there any recommended way to update multiple items in MongoDB with one query ? I know that this is possible:
db.collection('mycollection').update({active: 1}, {$set: {active:0}}, {multi: true});
But in my case I want to update several documents with "unique" changes.
e.g. I want to combine these two queries into one:
db.collection('mycollection').update({
id: 'my id'
}, {
$set: {
name: "new name"
}
});
db.collection('mycolleciont').update({
id: 'my second id'
}, {
$set: {
name: "new name two"
}
});
Why ? I have a system which gets daily updates imported. The updates are mostly large so its around 200,000 Updates a day so currently I am executing 200,000 times the update query which takes a long time.
If its necessary to know: I am using Mongo 3 and nodeJS.
I have an application that's saving data every second to MongoDB. This is important data, but holding data every second forever isn't necessary. After some time, I'd like to run a process (background worker) to clean up this data into hourly chunks, which includes every piece of data (1 per second) for each hour of that day. Kinda like Time Machine does on Mac.
From researching and thinking about it, there's a couple ways I can think of that I can make this happen with:
Mongo aggregators (not sure exactly how this would work)
Node background process with momentjs and sort by date, hour, etc. (really long time)
What's the best way to do this with MongoDB?
I think the Date Aggregation Operators could be better option for your case. Given your schema as below
var dataSchema = new Schema({
// other fields are here...
updated: Date,
});
var Data = mongoose.model('Data', dataSchema );
Just save those data as the normal Date.
Then you can retrieve the hourly chunks through aggregate operation in mongoose, one sample code like,
MyModel.aggregate([
{$match: {$and: [{updated: {$gte: start_date_hour}}, {updated: {$lte: end_date_hour}}]}},
{$group: {
_id: {
year: {$year: "$updated"},
month: {$month: "$updated"},
day: {$dayOfMonth: "$updated"}
// other fields should be here to meet your requirement
},
}},
{$sort: {"date.year":1, "date.month":1, "date.day":1}}
], callback);
For more arguments of aggregate, please refer to this doc.
I am working through a MEAN stack tutorial. It contains the following code as a route in index.js. The name of my Mongo collection is brandcollection.
/* GET Brand Complaints page. */
router.get('/brands', function(req, res) {
var db = req.db;
var collection = db.get('brandcollection');
collection.find({},{},function(e,docs){
res.render('brands', {
"brands" : docs
});
});
});
I would like to modify this code but I don't fully understand how the .find method is being invoked. Specifically, I have the following questions:
What objects are being passed to function(e, docs) as its arguments?
Is function(e, docs) part of the MongoDB syntax? I have looked at the docs on Mongo CRUD operations and couldn't find a reference to it. And it seems like the standard syntax for a Mongo .find operation is collection.find({},{}).someCursorLimit(). I have not seen a reference to a third parameter in the .find operation, so why is one allowed here?
If function(e, docs) is not a MongoDB operation, is it part of the Monk API?
It is clear from the tutorial that this block of code returns all of the documents in the collection and places them in an object as an attribute called "brands." However, what role specifically does function(e, docs) play in that process?
Any clarification would be much appreciated!
The first parameter is the query.
The second parameter(which is optional) is the projection i.e if you want to restrict the contents of the matched documents
collection.find( { qty: { $gt: 25 } }, { item: 1, qty: 1 },function(e,docs){})
would mean to get only the item and qty fields in the matched documents
The third parameter is the callback function which is called after the query is complete. function(e, docs) is the mongodb driver for node.js syntax. The 1st parameter e is the error. docs is the array of matched documents. If an error occurs it is given in e. If the query is successful the matched documents are given in the 2nd parameter docs(the name can be anything you want).
The cursor has various methods which can be used to manipulate the matched documents before mongoDB returns them.
collection.find( { qty: { $gt: 25 } }, { item: 1, qty: 1 })
is a cursor you can do various operations on it.
collection.find( { qty: { $gt: 25 } }, { item: 1, qty: 1 }).skip(10).limit(5).toArray(function(e,docs){
...
})
meaning you will skip the first 10 matched documents and then return a maximum of 5 documents.
All this stuff is given in the docs. I think it's better to use mongoose instead of the native driver because of the features and the popularity.