I am trying to aggregate some records in a mongo database using the node driver. I am first matching to org, fed, and sl fields (these are indexed). If I only include a few companies in the array that I am matching the org field to, the query runs fine and works as expected. However, when including all of the clients in the array, I always get:
MongoError: getMore: cursor didn't exist on server, possible restart or timeout?
I have tried playing with the allowDiskUse, and the batchSize settings, but nothing seems to work. With all the client strings in the array, the aggregation runs for ~5hours before throwing the cursor error. Any ideas? Below is the pipeline along with the actual aggregate command.
setting up the aggregation pipeline:
var aggQuery = [
{
$match: { //all clients, from last three days, and scored
org:
{ $in : array } //this is the array I am talking about
,
frd: {
$gte: _.last(util.lastXDates(3))
},
sl : true
}
}
, {
$group: { //group by isp and make fields for calculation
_id: "$gog",
count: {
$sum: 1
},
countRisky: {
$sum: {
$cond: {
if :{
$gte: ["$scr", 65]
},
then: 1,
else :0
}
}
},
countTimeZoneRisky: {
$sum: {
$cond: {
if :{
$eq: ["$gmt", "$gtz"]
},
then: 0,
else :1
}
}
}
}
}
, {
$match: { //show records with count >= 500
count: {
$gte: 500
}
}
}
, {
$project: { //rename _id to isp, only show relevent fields
_id: 0,
ISP: "$_id",
percentRisky: {
$multiply: [{
$divide: ["$countRisky", "$count"]
},
100
]
},
percentTimeZoneDiscrancy: {
$multiply: [{
$divide: ["$countTimeZoneRisky", "$count"]
},
100
]
},
count: 1
}
}
, {
$sort: { //sort by percent risky and then by count
percentRisky: 1,
count: 1
}
}
];
Running the aggregation:
var cursor = reportingCollections.hitColl.aggregate(aggQuery, {
allowDiskUse: true,
cursor: {
batchSize: 40000
}
});
console.log('Writing data to csv ' + currentFileNamePrefix + '!');
//iterate through cursor and write documents to CSV
cursor.each(function (err, document) {
//write each document to csv file
//maybe start a nuclear war
});
You're calling the aggregate method which doesn't return the cursor by default (like e.g. find()). To return query as a cursor, you must add the cursor option in the options. But, the timeout setting for the aggregation cursor is (currently) not supported. The native node.js driver only supports the batchSize setting.
You would set the batchOption like this:
var cursor = coll.aggregate(query, {cursor: {batchSize:100}}, writeResultsToCsv);
To circumvent such problems, I'd recommend aggregation or map-reduce directly through mongo client. There you can add the notimeout option.
The default timeout is 10 minutes (obviously useless for long time-consuming queries) and there's no way currently to set a different one as far as I know, only infinite by aforementioned option. The timeout hits you especially for high batch sizes, because it will take more than 10 mins to process the incoming docs and before you ask mongo server for more, the cursor has been deleted.
IDK your use case, but if it's a web view, there should be only fast queries/aggregations.
BTW I think this didn't change with 3.0.*
Related
I'm needing to update/save documents with sizes between 100KB - 800KB. Update operations like so, console.time('save'); await doc.findByIdAndUpdate(...).lean(); console.timeEnd('save');, are taking over 5s - 10s to finish. The updates contain ~50KB at most.
The large document property which is being updated has a structure like so:
{
largeProp: [{
key1: { key1A:val, key1B:val, ... 10 more ... },
key2: { key1A:val, key1B:val, ... 10 more ... },
key3: { key1A:val, key1B:val, ... 10 more ... },
...300 more...
}, ...100 more... ]
}
I'm using a Node.js server on Ubuntu VM with mongoose.js with MongoDB hosted on a separate server. The MongoDB server is does not show any unusual load, it usually stays under 7% CPU, however my Node.js server will hit 100% CPU usage with just this update operation (after a .findById() and some quick logic, 8ms-52ms). The .findById() takes about 500ms - 1s for this same object.
I need these saves to be much faster, and I don't understand why this is so slow.
I did not do much more profiling on the Mongoose query. Instead I tested out a native MongoDB query and it significantly improved the speed, so I will be using native MongoDB going forward.
const {ObjectId} = mongoose.Types;
let result = await mongoose.connection.collection('collection1')
.aggregate([
{ $match: { _id: ObjectId(gameId) } },
{ $lookup: {
localField:'field1',
from:'collection2',
foreignField:'_id',
as:'field1'
}
},
{ $unwind: '$field1' },
{ $project: {
_id: 1,
status: 1,
createdAt: 1,
slowArrProperty: { $slice: ["$positions", -1] } },
updatedAt: 1
}
},
{ $unwind: "$slowArrProperty" }
]).toArray();
if (result.length < 1) return {};
return result[0];
This query, as well as doing some restructuring of my data model solved my issue. Specifically, the document property that was very large and causing issues, I used the above { $slice: ["$positions", -1] } } to only return one of the objects in the array at a time.
Just from switching to native MongoDB queries (within the mongoose wrapper), I saw between 60x and 3000x improvements on query speeds.
I have a pretty simple $lookup aggregation query like the following:
{'$lookup':
{'from': 'edge',
'localField': 'gid',
'foreignField': 'to',
'as': 'from'}}
When I run this on a match with enough documents I get the following error:
Command failed with error 4568: 'Total size of documents in edge
matching { $match: { $and: [ { from: { $eq: "geneDatabase:hugo" }
}, {} ] } } exceeds maximum document size' on server
All attempts to limit the number of documents fail. allowDiskUse: true does nothing. Sending a cursor in does nothing. Adding in a $limit into the aggregation also fails.
How could this be?
Then I see the error again. Where did that $match and $and and $eq come from? Is the aggregation pipeline behind the scenes farming out the $lookup call to another aggregation, one it runs on its own that I have no ability to provide limits for or use cursors with??
What is going on here?
As stated earlier in comment, the error occurs because when performing the $lookup which by default produces a target "array" within the parent document from the results of the foreign collection, the total size of documents selected for that array causes the parent to exceed the 16MB BSON Limit.
The counter for this is to process with an $unwind which immediately follows the $lookup pipeline stage. This actually alters the behavior of $lookup in such that instead of producing an array in the parent, the results are instead a "copy" of each parent for every document matched.
Pretty much just like regular usage of $unwind, with the exception that instead of processing as a "separate" pipeline stage, the unwinding action is actually added to the $lookup pipeline operation itself. Ideally you also follow the $unwind with a $match condition, which also creates a matching argument to also be added to the $lookup. You can actually see this in the explain output for the pipeline.
The topic is actually covered (briefly) in a section of Aggregation Pipeline Optimization in the core documentation:
$lookup + $unwind Coalescence
New in version 3.2.
When a $unwind immediately follows another $lookup, and the $unwind operates on the as field of the $lookup, the optimizer can coalesce the $unwind into the $lookup stage. This avoids creating large intermediate documents.
Best demonstrated with a listing that puts the server under stress by creating "related" documents that would exceed the 16MB BSON limit. Done as briefly as possible to both break and work around the BSON Limit:
const MongoClient = require('mongodb').MongoClient;
const uri = 'mongodb://localhost/test';
function data(data) {
console.log(JSON.stringify(data, undefined, 2))
}
(async function() {
let db;
try {
db = await MongoClient.connect(uri);
console.log('Cleaning....');
// Clean data
await Promise.all(
["source","edge"].map(c => db.collection(c).remove() )
);
console.log('Inserting...')
await db.collection('edge').insertMany(
Array(1000).fill(1).map((e,i) => ({ _id: i+1, gid: 1 }))
);
await db.collection('source').insert({ _id: 1 })
console.log('Fattening up....');
await db.collection('edge').updateMany(
{},
{ $set: { data: "x".repeat(100000) } }
);
// The full pipeline. Failing test uses only the $lookup stage
let pipeline = [
{ $lookup: {
from: 'edge',
localField: '_id',
foreignField: 'gid',
as: 'results'
}},
{ $unwind: '$results' },
{ $match: { 'results._id': { $gte: 1, $lte: 5 } } },
{ $project: { 'results.data': 0 } },
{ $group: { _id: '$_id', results: { $push: '$results' } } }
];
// List and iterate each test case
let tests = [
'Failing.. Size exceeded...',
'Working.. Applied $unwind...',
'Explain output...'
];
for (let [idx, test] of Object.entries(tests)) {
console.log(test);
try {
let currpipe = (( +idx === 0 ) ? pipeline.slice(0,1) : pipeline),
options = (( +idx === tests.length-1 ) ? { explain: true } : {});
await new Promise((end,error) => {
let cursor = db.collection('source').aggregate(currpipe,options);
for ( let [key, value] of Object.entries({ error, end, data }) )
cursor.on(key,value);
});
} catch(e) {
console.error(e);
}
}
} catch(e) {
console.error(e);
} finally {
db.close();
}
})();
After inserting some initial data, the listing will attempt to run an aggregate merely consisting of $lookup which will fail with the following error:
{ MongoError: Total size of documents in edge matching pipeline { $match: { $and : [ { gid: { $eq: 1 } }, {} ] } } exceeds maximum document size
Which is basically telling you the BSON limit was exceeded on retrieval.
By contrast the next attempt adds the $unwind and $match pipeline stages
The Explain output:
{
"$lookup": {
"from": "edge",
"as": "results",
"localField": "_id",
"foreignField": "gid",
"unwinding": { // $unwind now is unwinding
"preserveNullAndEmptyArrays": false
},
"matching": { // $match now is matching
"$and": [ // and actually executed against
{ // the foreign collection
"_id": {
"$gte": 1
}
},
{
"_id": {
"$lte": 5
}
}
]
}
}
},
// $unwind and $match stages removed
{
"$project": {
"results": {
"data": false
}
}
},
{
"$group": {
"_id": "$_id",
"results": {
"$push": "$results"
}
}
}
And that result of course succeeds, because as the results are no longer being placed into the parent document then the BSON limit cannot be exceeded.
This really just happens as a result of adding $unwind only, but the $match is added for example to show that this is also added into the $lookup stage and that the overall effect is to "limit" the results returned in an effective way, since it's all done in that $lookup operation and no other results other than those matching are actually returned.
By constructing in this way you can query for "referenced data" that would exceed the BSON limit and then if you want $group the results back into an array format, once they have been effectively filtered by the "hidden query" that is actually being performed by $lookup.
MongoDB 3.6 and Above - Additional for "LEFT JOIN"
As all the content above notes, the BSON Limit is a "hard" limit that you cannot breach and this is generally why the $unwind is necessary as an interim step. There is however the limitation that the "LEFT JOIN" becomes an "INNER JOIN" by virtue of the $unwind where it cannot preserve the content. Also even preserveNulAndEmptyArrays would negate the "coalescence" and still leave the intact array, causing the same BSON Limit problem.
MongoDB 3.6 adds new syntax to $lookup that allows a "sub-pipeline" expression to be used in place of the "local" and "foreign" keys. So instead of using the "coalescence" option as demonstrated, as long as the produced array does not also breach the limit it is possible to put conditions in that pipeline which returns the array "intact", and possibly with no matches as would be indicative of a "LEFT JOIN".
The new expression would then be:
{ "$lookup": {
"from": "edge",
"let": { "gid": "$gid" },
"pipeline": [
{ "$match": {
"_id": { "$gte": 1, "$lte": 5 },
"$expr": { "$eq": [ "$$gid", "$to" ] }
}}
],
"as": "from"
}}
In fact this would be basically what MongoDB is doing "under the covers" with the previous syntax since 3.6 uses $expr "internally" in order to construct the statement. The difference of course is there is no "unwinding" option present in how the $lookup actually gets executed.
If no documents are actually produced as a result of the "pipeline" expression, then the target array within the master document will in fact be empty, just as a "LEFT JOIN" actually does and would be the normal behavior of $lookup without any other options.
However the output array to MUST NOT cause the document where it is being created to exceed the BSON Limit. So it really is up to you to ensure that any "matching" content by the conditions stays under this limit or the same error will persist, unless of course you actually use $unwind to effect the "INNER JOIN".
I had same issue with fllowing Node.js query becuase 'redemptions' collection has more then 400,000 of data. I am using Mongo DB server 4.2 and Node JS driver 3.5.3.
db.collection('businesses').aggregate(
{
$lookup: { from: 'redemptions', localField: "_id", foreignField: "business._id", as: "redemptions" }
},
{
$project: {
_id: 1,
name: 1,
email: 1,
"totalredemptions" : {$size:"$redemptions"}
}
}
I have modified query as below to make it work super fast.
db.collection('businesses').aggregate(query,
{
$lookup:
{
from: 'redemptions',
let: { "businessId": "$_id" },
pipeline: [
{ $match: { $expr: { $eq: ["$business._id", "$$businessId"] } } },
{ $group: { _id: "$_id", totalCount: { $sum: 1 } } },
{ $project: { "_id": 0, "totalCount": 1 } }
],
as: "redemptions"
},
{
$project: {
_id: 1,
name: 1,
email: 1,
"totalredemptions" : {$size:"$redemptions"}
}
}
}
I have some documents with below format
{
"count_used" : Int64,
"last_used" : Date,
"enabled" : Boolean,
"data" : String
}
I use the last_used field to sort on within an aggregate query so the last used document is served first. I use updateOne after the aggregate to bump the last used field using the below query. So every read request will change the last_used field date stamp.
updateOne(
{ _id: result['_id']},
{
$currentDate: { 'last_used': true },
$inc: { 'count_used': 1 }
}
)
The problem i have is clients will make 5 concurrent requests and instead of receiving document 1,2,3,4,5 they will get 1,1,1,2,2 for example.
I'm guessing this is because there is no lock between the read and the update, so several reads get same results before the update is performed.
Is there any way around this?
Update with below aggregate nodejs code
db.collection('testing').aggregate([{
$project: {
last_used : 1,
count_used : 1,
doc_type : 1,
_id : 1,
data : 1,
is_type : { $setIsSubset: [ [type], '$doc_type' ] }
}
},
{ $match: {
$or: query}
},
{ $sort: {
is_type: -1,
last_used: 1
}
},
{
$limit: 1
} ]).next(function (error, result) {
db.collection('adverts').updateOne({_id: result['_id']}, {$currentDate: {'last_used': true}, $inc: {'count_used': 1}})
}
The problem is that multiple read requests come in from same client who issues multiple concurrent requests, which means they hit db at the same time and read the same data before the updateOne fires.
I have a Comments collection in Mongoose, and a query that returns the most recent five (an arbitrary number) Comments.
Every Comment is associated with another document. What I would like to do is make a query that returns the most recent 5 comments, with comments associated with the same other document combined.
So instead of a list like this:
results = [
{ _id: 123, associated: 12 },
{ _id: 122, associated: 8 },
{ _id: 121, associated: 12 },
{ _id: 120, associated: 12 },
{ _id: 119, associated: 17 }
]
I'd like to return a list like this:
results = [
{ _id: 124, associated: 3 },
{ _id: 125, associated: 19 },
[
{ _id: 123, associated: 12 },
{ _id: 121, associated: 12 },
{ _id: 120, associated: 12 },
],
{ _id: 122, associated: 8 },
{ _id: 119, associated: 17 }
]
Please don't worry too much about the data format: it's just a sketch to try to show the sort of thing I want. I want a result set of a specified size, but with some results grouped according to some criterion.
Obviously one way to do this would be to just make the query, crawl and modify the results, then recursively make the query again until the result set is as long as desired. That way seems awkward. Is there a better way to go about this? I'm having trouble phrasing it in a Google search in a way that gets me anywhere near anyone who might have insight.
Here's an aggregation pipeline query that will do what you are asking for:
db.comments.aggregate([
{ $group: { _id: "$associated", maxID: { $max: "$_id"}, cohorts: { $push: "$$ROOT"}}},
{ $sort: { "maxID": -1 } },
{ $limit: 5 }
])
Lacking any other fields from the sample data to sort by, I used $_id.
If you'd like results that are a little closer in structure to the sample result set you provided you could add a $project to the end:
db.comments.aggregate([
{ $group: { _id: "$associated", maxID: { $max: "$_id"}, cohorts: { $push: "$$ROOT"}}},
{ $sort: { "maxID": -1 } },
{ $limit: 5 },
{ $project: { _id: 0, cohorts: 1 }}
])
That will print only the result set. Note that even comments that do not share an association object will be in an array. It will be an array of 1 length.
If you are concerned about limiting the results in the grouping as Neil Lunn is suggesting, perhaps a $match in the beginning is a smart idea.
db.comments.aggregate([
{ $match: { createDate: { $gte: new Date(new Date() - 5 * 60000) } } },
{ $group: { _id: "$associated", maxID: { $max: "$_id"}, cohorts: { $push: "$$ROOT"}}},
{ $sort: { "maxID": -1 } },
{ $limit: 5 },
{ $project: { _id: 0, cohorts: 1 }}
])
That will only include comments made in the last 5 minutes assuming you have a createDate type field. If you do, you might also consider using that as the field to sort by instead of "_id". If you do not have a createDate type field, I'm not sure how best to limit the comments that are grouped as I do not know of a "current _id" in the way that there is a "current time".
I honestly think you are asking a lot here and cannot really see the utility myself, but I'm always happy to have that explained to me if there is something useful I have missed.
Bottom line is you want comments from the last five distinct users by date, and then some sort of grouping of additional comments by those users. The last part is where I see difficulty in rules no matter how you want to attack this, but I'll try to keep this to the most brief form.
No way this happens in a single query of any sort. But there are things that can be done to make it an efficient server response:
var DataStore = require('nedb'),
store = new DataStore();
async.waterfall(
function(callback) {
Comment.aggregate(
[
{ "$match": { "postId": thisPostId } },
{ "$sort": { "associated": 1, "createdDate": -1 } },
{ "$group": {
"_id": "$associated",
"date": { "$first": "$createdDate" }
}},
{ "$sort": { "date": -1 } },
{ "$limit": 5 }
],
callback);
},
function(docs,callback) {
async.each(docs,function(doc,callback) {
Comment.aggregate(
[
{ "$match": { "postId": thisPostId, "associated": doc._id } },
{ "$sort": { "createdDate": -1 } },
{ "$limit": 5 },
{ "$group": {
"_id": "$associated",
"docs": {
"$push": {
"_id": "$_id", "createdDate": "$createdDate"
}
},
"firstDate": { "$first": "$createdDate" }
}}
],
function(err,results) {
if (err) callback(err);
async.each(results,function(result,callback) {
store.insert( result, function(err, result) {
callback(err);
});
},function(err) {
callback(err);
});
}
);
},
callback);
},
function(err) {
if (err) throw err;
store.find({}).sort({ "firstDate": - 1 }).exec(function(err,docs) {
if (err) throw err;
console.log( JSON.stringify( docs, undefined, 4 ) );
});
}
);
Now I stuck more document properties in both the document and the array, but the simplified form based on your sample would then come out like this:
results = [
{ "_id": 3, "docs": [124] },
{ "_id": 19, "docs": [125] },
{ "_id": 12, "docs": [123,121,120] },
{ "_id": 8, "docs": [122] },
{ "_id": 17, "docs": [119] }
]
So the essential idea is to first find your distinct "users" who where the last to comment by basically chopping off the last 5. Without filtering some kind of range here that would go over the entire collection to get those results, so it would be best to restrict this in some way, as in the last hour or last few hours or something sensible as required. Just add those conditions to the $match along with the current post that is associated with the comments.
Once you have those 5, then you want to get any possible "grouped" details for multiple comments by those users. Again, some sort of limit is generally advised for a timeframe, but as a general case this is just looking for the most recent comments by the user on the current post and restricting that to 5.
The execution here is done in parallel, which will use more resources but is fairly effective considering there are only 5 queries to run anyway. In contrast to your example output, the array here is inside the document result, and it contains the original document id values for each comment for reference. Any other content related to the document would be pushed into the array as well as required (ie The content of the comment).
The other little trick here is using nedb as a means for storing the output of each query in an "in memory" collection. This need only really be a standard hash data structure, but nedb gives you a way of doing that while maintaining the MongoDB statement form that you may be used to.
Once all results are obtained you just return them as your output, and sorted as shown to retain the order of who commented last. The actual comments are grouped in the array for each item and you can traverse this to output how you like.
Bottom line here is that you are asking for a compounded version of the "top N results problem", which is something often asked of MongoDB. I've written about ways to tackle this before to show how it's possible in a single aggregation pipeline stage, but it really is not practical for anything more than a relatively small result set.
If you really want to join in the insanity, then you can look at Mongodb aggregation $group, restrict length of array for one of the more detailed examples. But for my money, I would run on parallel queries any day. Node.js has the right sort of environment to support them, so you would be crazy to do it otherwise.
I have the following mongodb query in node.js which gives me a list of unique zip codes with a count of how many times the zip code appears in the database.
collection.aggregate( [
{
$group: {
_id: "$Location.Zip",
count: { $sum: 1 }
}
},
{ $sort: { _id: 1 } },
{ $match: { count: { $gt: 1 } } }
], function ( lookupErr, lookupData ) {
if (lookupErr) {
res.send(lookupErr);
return;
}
res.send(lookupData.sort());
});
});
How can this query be modified to return one specific zip code? I've tried the condition clause but have not been able to get it to work.
Aggregations that require filtered results can be done with the $match operator. Without tweaking what you already have, I would suggest just sticking in a $match for the zip code you want returned at the top of the aggregation list.
collection.aggregate( [
{
$match: {
zip: 47421
}
},
{
$group: {
...
This example will result in every aggregation operation after the $match working on only the data set that is returned by the $match of the zip key to the value 47421.
in the $match pipeline operator add
{ $match: { count: { $gt: 1 },
_id : "10002" //replace 10002 with the zip code you want
}}
As a side note, you should put the $match operator first and in general as high in the aggregation chain as you can.