I'm needing to update/save documents with sizes between 100KB - 800KB. Update operations like so, console.time('save'); await doc.findByIdAndUpdate(...).lean(); console.timeEnd('save');, are taking over 5s - 10s to finish. The updates contain ~50KB at most.
The large document property which is being updated has a structure like so:
{
largeProp: [{
key1: { key1A:val, key1B:val, ... 10 more ... },
key2: { key1A:val, key1B:val, ... 10 more ... },
key3: { key1A:val, key1B:val, ... 10 more ... },
...300 more...
}, ...100 more... ]
}
I'm using a Node.js server on Ubuntu VM with mongoose.js with MongoDB hosted on a separate server. The MongoDB server is does not show any unusual load, it usually stays under 7% CPU, however my Node.js server will hit 100% CPU usage with just this update operation (after a .findById() and some quick logic, 8ms-52ms). The .findById() takes about 500ms - 1s for this same object.
I need these saves to be much faster, and I don't understand why this is so slow.
I did not do much more profiling on the Mongoose query. Instead I tested out a native MongoDB query and it significantly improved the speed, so I will be using native MongoDB going forward.
const {ObjectId} = mongoose.Types;
let result = await mongoose.connection.collection('collection1')
.aggregate([
{ $match: { _id: ObjectId(gameId) } },
{ $lookup: {
localField:'field1',
from:'collection2',
foreignField:'_id',
as:'field1'
}
},
{ $unwind: '$field1' },
{ $project: {
_id: 1,
status: 1,
createdAt: 1,
slowArrProperty: { $slice: ["$positions", -1] } },
updatedAt: 1
}
},
{ $unwind: "$slowArrProperty" }
]).toArray();
if (result.length < 1) return {};
return result[0];
This query, as well as doing some restructuring of my data model solved my issue. Specifically, the document property that was very large and causing issues, I used the above { $slice: ["$positions", -1] } } to only return one of the objects in the array at a time.
Just from switching to native MongoDB queries (within the mongoose wrapper), I saw between 60x and 3000x improvements on query speeds.
Related
I am encountering a delay of 5 to 10 seconds from when the operation happens in MongoDB until I capture it in a Change Stream in NodeJS.
Are these times normal, what parameters could I check to see if any are impacting this?
Here are a couple of examples and some suspicions (to be tested).
Here we try to catch changes only in the fields of the Users collection that interest us, I do not know if doing this to avoid unwanted events may be causing delay in the reception of the ChangeStream and it would be convenient to receive more events and filter in code the updated fields.
I do not know, also if the "and" of the type of operation would have to be put before or it is irrelevant.
userChangeStreamQuery: [{
$match: {
$and: [
{$or:[
{ "updateDescription.updatedFields.name": { $exists: true } },
{ "updateDescription.updatedFields.email": { $exists: true } },
{ "updateDescription.updatedFields.organization": { $exists: true } },
{ "updateDescription.updatedFields.displayName": { $exists: true } },
{ "updateDescription.updatedFields.image": { $exists: true } },
{ "updateDescription.updatedFields.organizationName": { $exists: true } },
{ "updateDescription.updatedFields.locationName": { $exists: true } }
]},
{ operationType: "update" }]
}
}],
Of this other one, that waits for events on the Plans collection, I worry that it does not have aggregate defined and it is when receiving the event where it is filtered if the operation arrives type 'insert', 'update', 'delete'. This one is giving us a delay of 7~10 seconds.
startChangeStream({
streamId: 'plans',
collection: 'plans',
query: '',
resumeTokens
});
...
const startChangeStream = ({ streamId, collection, query, resumeTokens }) => {
const resumeToken = resumeTokens ? resumeTokens[streamId] || undefined : undefined;
nativeMongoDbFactory.setChangeStream({
streamId,
collection,
query,
resumeToken
});
}
In no case are massive operations, normally they are operations performed by the user through web forms.
when the collection is sharding, using change streams the mongos server need to wait until all shards have data to return, if some shards no data to write, the idle primary mongod writes a no-op to the oplog every 10 (idlewriteperiodms) seconds. that is why you delay is 7~10 seconds.
Playground with data sample: https://mongoplayground.net/p/OJKVFHLamig
Yet, when I run in mongoose within Node, the exact same collection and aggregation instead returns the total number of documents count and everything else is null:
[ { _id: null, myCount: 130111, site: null } ]
I've looked at all other variables, every comma in my production code and there's nothing else that explains this behaviour.
Is Mongoose unfit to use for the mongo aggregation framework or am I missing something about the syntax?
Schema:
import mongoose from 'mongoose';
import { SiteModel } from './site.schema';
const JobModel = new mongoose.Schema({
_id: String,
(other properties that are strings)
title: String,
site: { SiteModel },
});
export default mongoose.model('jobs', JobModel);
// SITE MODEL:
export const SiteModel = new mongoose.Schema({
_id: String;
title: String;
city: String;
UNID: String;
})
The models are incomplete as I'm only using it for reading purposes, the database is used by another app live and I'm merely building some reports/searching some data on it.
I am however pulling all the data that I need and It worked without a hiccup up until this.
EDIT 3: LOG ENTRY:
{ aggregate: "sites", pipeline:
[ { $group: { _id: "$site.UNID", myCount: { $sum: 1 },
score: { $first: "$score" } } },
{ $limit: 20 },
{ $project: { site: "$_id", myCount: 1 } } ]
I was using the wrong mongoose model to run the aggregation on, but the collections were similar enough that it was still returning results. I realized this by looking into the db logs and seeing exactly which collection my aggregation was running on.
I have a pretty simple $lookup aggregation query like the following:
{'$lookup':
{'from': 'edge',
'localField': 'gid',
'foreignField': 'to',
'as': 'from'}}
When I run this on a match with enough documents I get the following error:
Command failed with error 4568: 'Total size of documents in edge
matching { $match: { $and: [ { from: { $eq: "geneDatabase:hugo" }
}, {} ] } } exceeds maximum document size' on server
All attempts to limit the number of documents fail. allowDiskUse: true does nothing. Sending a cursor in does nothing. Adding in a $limit into the aggregation also fails.
How could this be?
Then I see the error again. Where did that $match and $and and $eq come from? Is the aggregation pipeline behind the scenes farming out the $lookup call to another aggregation, one it runs on its own that I have no ability to provide limits for or use cursors with??
What is going on here?
As stated earlier in comment, the error occurs because when performing the $lookup which by default produces a target "array" within the parent document from the results of the foreign collection, the total size of documents selected for that array causes the parent to exceed the 16MB BSON Limit.
The counter for this is to process with an $unwind which immediately follows the $lookup pipeline stage. This actually alters the behavior of $lookup in such that instead of producing an array in the parent, the results are instead a "copy" of each parent for every document matched.
Pretty much just like regular usage of $unwind, with the exception that instead of processing as a "separate" pipeline stage, the unwinding action is actually added to the $lookup pipeline operation itself. Ideally you also follow the $unwind with a $match condition, which also creates a matching argument to also be added to the $lookup. You can actually see this in the explain output for the pipeline.
The topic is actually covered (briefly) in a section of Aggregation Pipeline Optimization in the core documentation:
$lookup + $unwind Coalescence
New in version 3.2.
When a $unwind immediately follows another $lookup, and the $unwind operates on the as field of the $lookup, the optimizer can coalesce the $unwind into the $lookup stage. This avoids creating large intermediate documents.
Best demonstrated with a listing that puts the server under stress by creating "related" documents that would exceed the 16MB BSON limit. Done as briefly as possible to both break and work around the BSON Limit:
const MongoClient = require('mongodb').MongoClient;
const uri = 'mongodb://localhost/test';
function data(data) {
console.log(JSON.stringify(data, undefined, 2))
}
(async function() {
let db;
try {
db = await MongoClient.connect(uri);
console.log('Cleaning....');
// Clean data
await Promise.all(
["source","edge"].map(c => db.collection(c).remove() )
);
console.log('Inserting...')
await db.collection('edge').insertMany(
Array(1000).fill(1).map((e,i) => ({ _id: i+1, gid: 1 }))
);
await db.collection('source').insert({ _id: 1 })
console.log('Fattening up....');
await db.collection('edge').updateMany(
{},
{ $set: { data: "x".repeat(100000) } }
);
// The full pipeline. Failing test uses only the $lookup stage
let pipeline = [
{ $lookup: {
from: 'edge',
localField: '_id',
foreignField: 'gid',
as: 'results'
}},
{ $unwind: '$results' },
{ $match: { 'results._id': { $gte: 1, $lte: 5 } } },
{ $project: { 'results.data': 0 } },
{ $group: { _id: '$_id', results: { $push: '$results' } } }
];
// List and iterate each test case
let tests = [
'Failing.. Size exceeded...',
'Working.. Applied $unwind...',
'Explain output...'
];
for (let [idx, test] of Object.entries(tests)) {
console.log(test);
try {
let currpipe = (( +idx === 0 ) ? pipeline.slice(0,1) : pipeline),
options = (( +idx === tests.length-1 ) ? { explain: true } : {});
await new Promise((end,error) => {
let cursor = db.collection('source').aggregate(currpipe,options);
for ( let [key, value] of Object.entries({ error, end, data }) )
cursor.on(key,value);
});
} catch(e) {
console.error(e);
}
}
} catch(e) {
console.error(e);
} finally {
db.close();
}
})();
After inserting some initial data, the listing will attempt to run an aggregate merely consisting of $lookup which will fail with the following error:
{ MongoError: Total size of documents in edge matching pipeline { $match: { $and : [ { gid: { $eq: 1 } }, {} ] } } exceeds maximum document size
Which is basically telling you the BSON limit was exceeded on retrieval.
By contrast the next attempt adds the $unwind and $match pipeline stages
The Explain output:
{
"$lookup": {
"from": "edge",
"as": "results",
"localField": "_id",
"foreignField": "gid",
"unwinding": { // $unwind now is unwinding
"preserveNullAndEmptyArrays": false
},
"matching": { // $match now is matching
"$and": [ // and actually executed against
{ // the foreign collection
"_id": {
"$gte": 1
}
},
{
"_id": {
"$lte": 5
}
}
]
}
}
},
// $unwind and $match stages removed
{
"$project": {
"results": {
"data": false
}
}
},
{
"$group": {
"_id": "$_id",
"results": {
"$push": "$results"
}
}
}
And that result of course succeeds, because as the results are no longer being placed into the parent document then the BSON limit cannot be exceeded.
This really just happens as a result of adding $unwind only, but the $match is added for example to show that this is also added into the $lookup stage and that the overall effect is to "limit" the results returned in an effective way, since it's all done in that $lookup operation and no other results other than those matching are actually returned.
By constructing in this way you can query for "referenced data" that would exceed the BSON limit and then if you want $group the results back into an array format, once they have been effectively filtered by the "hidden query" that is actually being performed by $lookup.
MongoDB 3.6 and Above - Additional for "LEFT JOIN"
As all the content above notes, the BSON Limit is a "hard" limit that you cannot breach and this is generally why the $unwind is necessary as an interim step. There is however the limitation that the "LEFT JOIN" becomes an "INNER JOIN" by virtue of the $unwind where it cannot preserve the content. Also even preserveNulAndEmptyArrays would negate the "coalescence" and still leave the intact array, causing the same BSON Limit problem.
MongoDB 3.6 adds new syntax to $lookup that allows a "sub-pipeline" expression to be used in place of the "local" and "foreign" keys. So instead of using the "coalescence" option as demonstrated, as long as the produced array does not also breach the limit it is possible to put conditions in that pipeline which returns the array "intact", and possibly with no matches as would be indicative of a "LEFT JOIN".
The new expression would then be:
{ "$lookup": {
"from": "edge",
"let": { "gid": "$gid" },
"pipeline": [
{ "$match": {
"_id": { "$gte": 1, "$lte": 5 },
"$expr": { "$eq": [ "$$gid", "$to" ] }
}}
],
"as": "from"
}}
In fact this would be basically what MongoDB is doing "under the covers" with the previous syntax since 3.6 uses $expr "internally" in order to construct the statement. The difference of course is there is no "unwinding" option present in how the $lookup actually gets executed.
If no documents are actually produced as a result of the "pipeline" expression, then the target array within the master document will in fact be empty, just as a "LEFT JOIN" actually does and would be the normal behavior of $lookup without any other options.
However the output array to MUST NOT cause the document where it is being created to exceed the BSON Limit. So it really is up to you to ensure that any "matching" content by the conditions stays under this limit or the same error will persist, unless of course you actually use $unwind to effect the "INNER JOIN".
I had same issue with fllowing Node.js query becuase 'redemptions' collection has more then 400,000 of data. I am using Mongo DB server 4.2 and Node JS driver 3.5.3.
db.collection('businesses').aggregate(
{
$lookup: { from: 'redemptions', localField: "_id", foreignField: "business._id", as: "redemptions" }
},
{
$project: {
_id: 1,
name: 1,
email: 1,
"totalredemptions" : {$size:"$redemptions"}
}
}
I have modified query as below to make it work super fast.
db.collection('businesses').aggregate(query,
{
$lookup:
{
from: 'redemptions',
let: { "businessId": "$_id" },
pipeline: [
{ $match: { $expr: { $eq: ["$business._id", "$$businessId"] } } },
{ $group: { _id: "$_id", totalCount: { $sum: 1 } } },
{ $project: { "_id": 0, "totalCount": 1 } }
],
as: "redemptions"
},
{
$project: {
_id: 1,
name: 1,
email: 1,
"totalredemptions" : {$size:"$redemptions"}
}
}
}
I am in situation where I have to update either two documents or none of them, how is it possible to implement such behavior with mongo?
// nodejs mongodb driver
Bus.update({
"_id": { $in: [ObjectId("abc"), ObjectId("def")] },
"seats": { $gt: 0 }
}, {
$inc: { "seats": -1 }
}, { multi: true }, function(error, update) {
assert(update.result.nModified === 2)
})
The problem with code above it will update even if only one bus matched. In my case I try to book ticket for bus in both directions and should fail if at least one of them already fully booked.
Thank you
I am trying to aggregate some records in a mongo database using the node driver. I am first matching to org, fed, and sl fields (these are indexed). If I only include a few companies in the array that I am matching the org field to, the query runs fine and works as expected. However, when including all of the clients in the array, I always get:
MongoError: getMore: cursor didn't exist on server, possible restart or timeout?
I have tried playing with the allowDiskUse, and the batchSize settings, but nothing seems to work. With all the client strings in the array, the aggregation runs for ~5hours before throwing the cursor error. Any ideas? Below is the pipeline along with the actual aggregate command.
setting up the aggregation pipeline:
var aggQuery = [
{
$match: { //all clients, from last three days, and scored
org:
{ $in : array } //this is the array I am talking about
,
frd: {
$gte: _.last(util.lastXDates(3))
},
sl : true
}
}
, {
$group: { //group by isp and make fields for calculation
_id: "$gog",
count: {
$sum: 1
},
countRisky: {
$sum: {
$cond: {
if :{
$gte: ["$scr", 65]
},
then: 1,
else :0
}
}
},
countTimeZoneRisky: {
$sum: {
$cond: {
if :{
$eq: ["$gmt", "$gtz"]
},
then: 0,
else :1
}
}
}
}
}
, {
$match: { //show records with count >= 500
count: {
$gte: 500
}
}
}
, {
$project: { //rename _id to isp, only show relevent fields
_id: 0,
ISP: "$_id",
percentRisky: {
$multiply: [{
$divide: ["$countRisky", "$count"]
},
100
]
},
percentTimeZoneDiscrancy: {
$multiply: [{
$divide: ["$countTimeZoneRisky", "$count"]
},
100
]
},
count: 1
}
}
, {
$sort: { //sort by percent risky and then by count
percentRisky: 1,
count: 1
}
}
];
Running the aggregation:
var cursor = reportingCollections.hitColl.aggregate(aggQuery, {
allowDiskUse: true,
cursor: {
batchSize: 40000
}
});
console.log('Writing data to csv ' + currentFileNamePrefix + '!');
//iterate through cursor and write documents to CSV
cursor.each(function (err, document) {
//write each document to csv file
//maybe start a nuclear war
});
You're calling the aggregate method which doesn't return the cursor by default (like e.g. find()). To return query as a cursor, you must add the cursor option in the options. But, the timeout setting for the aggregation cursor is (currently) not supported. The native node.js driver only supports the batchSize setting.
You would set the batchOption like this:
var cursor = coll.aggregate(query, {cursor: {batchSize:100}}, writeResultsToCsv);
To circumvent such problems, I'd recommend aggregation or map-reduce directly through mongo client. There you can add the notimeout option.
The default timeout is 10 minutes (obviously useless for long time-consuming queries) and there's no way currently to set a different one as far as I know, only infinite by aforementioned option. The timeout hits you especially for high batch sizes, because it will take more than 10 mins to process the incoming docs and before you ask mongo server for more, the cursor has been deleted.
IDK your use case, but if it's a web view, there should be only fast queries/aggregations.
BTW I think this didn't change with 3.0.*