Speed ​issue and also memory error when querying in Mongo - node.js

I have a table that contains over 100,000 records. Server: node.js/express.js. DB: mongo
On the client, a table with a pager is implemented. 10 records are requested each time.
When there were 10,000 records, of course, everything worked faster, but now there was a problem with speed and not only.
My aggregation:
import { concat } from 'lodash';
...
let query = [{$match: {}}];
query = concat(query, [{$sort : {createdAt: -1}}]);
query = concat(query, [
{$skip : (pageNum - 1) * perPage}, // 0
{$limit : perPage} // 10
]);
return User.aggregate(query)
.collation({locale: 'en', strength: 2})
.then((users) => ...;
2 cases:
first fetch very slow
when I click to last page I got error:
MongoError: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.
Please, tell me, I am building the aggregation incorrectly, or is there a problem with memory on the server as the error says and additional nginx settings are needed (another person is engaged in this) or is the problem complex, or perhaps something else altogether?
Added:
I noticed that the index is not used when sorting, although it should be used?
aggregation to execute console.log =>
[
{
"$match": {}
},
{
"$lookup": {
...
}
},
{
"$project": {
...,
createdAt: 1,
...
}
},
{
"$match": {}
},
{
"$sort": {
"createdAt": -1
}
},
{
"$skip": 0
},
{
"$limit": 10
}
]
Thanks for any answers and sorry my English :)

It does say that you've memory limit, which makes sense, considering that you're trying to filter through 100,000 requests. I'd try using return User.aggregate(query, { allowDiskUse: true }) //etc, and see if that helps your issue.
Whilst this isn't the documentation on the Node.js driver specifically, this link summaries what the allowDiskUse option does (or in short, it allows MongoDB to go past the 100MB memory limit, and uses your system storage to temporarily store some information while it performs the query).

Related

mongodb caching with aggregation pipeline

mongodb aggregate pipeline like
db.testing.aggregate(
{
$match : {hosting : "aws.amazon.com"}
},
{
$group : { _id : "$hosting", total : { $sum : 1 } }
},
{
$project : { title : 1 , author : 1, <few other transformations> }
{$sort : {total : -1}}
);
Now I want to enable paging. I have 2 options.
Use skip and limit in the pipeline.
{ $skip : pageNumber * pageSize }
{ $limit : pageSize }
External API level caching for each page can be used which will reduce time for repeated loading of same pages, but the first loading of each page will be painful because of the linear scan due to sorting.
Handle pagination in application.
Cache the findAll result i.e for List findAll();
Now pagination will be handled at the service layer and result will be published
From next request onward you will be referring to the cached result and send the desired set of records from the cache.
Question: 2nd approach seems better if database is not doing some magical optimizations. In 1st, my view is that since the pipeline involves sorting, hence every page request will do a scan of the full table, which will be sub-optimal. What are your views? Which one should be done? What would you choose? What is the good practice(Is moving some db logic to service layer for optimizations advisable)?
It depends on your data.
MongoDB does not cache the query results in order to return the cached results for identical queries. https://docs.mongodb.com/manual/faq/fundamentals/#does-mongodb-handle-caching
However, you may create View (from source + pipeline) and update it on-demand. This will allow you to have aggregated data with good performance for paging and update the content periodically. You may create indexes for better performance (No need to develop in service layer extra logic)
Also, if you always filter and $group by hosting field, you may benefit MongoDB index swapping last $sort next ot $match stage. In this case, MongoDB will use index for filter + sort and paging are done in memory.
db.testing.createIndex({hosting:-1})
db.collection.aggregate([
{
$match: {
hosting: "aws.amazon.com"
}
},
{
$sort: {
hosting: -1
}
},
{
$group: {
_id: "$hosting",
title: {
$first: "$title"
},
author: {
$first: "$author"
},
total: {
$sum: 1
}
}
},
{
$project: {
title: 1,
author: 1,
total: 1
}
},
{ $skip : pageNumber * pageSize },
{ $limit : pageSize }
])

Combine multiple query with one single $in query and specify limit for each array field?

I am using mongoose with node.js for MongoDB. Now i need to make 20 parallel find query requests in my database with limit of documents 4, same as shown below just brand_id will change for different brand.
areamodel.find({ brand_id: brand_id }, { '_id': 1 }, { limit: 4 }, function(err, docs) {
if (err) {
console.log(err);
} else {
console.log('fetched');
}
}
Now as to run all these query parallely i thought about putting all 20 brand_id in a array of string and then use a $in query to get the results, but i don't know how to specify the limit 4 for every array field which will be matched.
I write below code with aggregation but don't know where to specify limit for each element of my array.
var brand_ids = ["brandid1", "brandid2", "brandid3", "brandid4", "brandid5", "brandid6", "brandid7", "brandid8", "brandid9", "brandid10", "brandid11", "brandid12", "brandid13", "brandid14", "brandid15", "brandid16", "brandid17", "brandid18", "brandid19", "brandid20"];
areamodel.aggregate(
{ $project: { _id: 1 } },
{ $match : { 'brand_id': { $in: brand_ids } } },
function(err, docs) {
if (err) {
console.error(err);
} else {
}
}
);
Can anyone please tell me how can i solve my problem using only one query.
UPDATE- Why i don't think $group be helpful for me.
Suppose my brand_ids array contains these strings
brand_ids = ["id1", "id2", "id3", "id4", "id5"]
and my database have below documents
{
"brand_id": "id1",
"name": "Levis",
"loc": "india"
},
{
"brand_id": "id1",
"name": "Levis"
"loc": "america"
},
{
"brand_id": "id2",
"name": "Lee"
"loc": "india"
},
{
"brand_id": "id2",
"name": "Lee"
"loc": "america"
}
Desired JSON output
{
"name": "Levis"
},
{
"name": "Lee"
}
For above example suppose i have 25000 documents with "name" as "Levis" and 25000 of documents where "name" is "Lee", now if i will use group then all of 50000 documents will be queried and grouped by "name".
But according to the solution i want, when first document with "Levis" and "Lee" gets found then i will don't have to look for remaining thousands of the documents.
Update- I think if anyone of you can tell me this then probably i can get to my solution.
Consider a case where i have 1000 total documents in my mongoDB, now suppose out of that 1000, 100 will pass my match query.
Now if i will apply limit 4 on this query then will this query take same time to execute as the query without any limit, or not.
Why i am thinking about this case
Because if my query will take same time then i don't think $group will increase my time as all documents will be queried.
But if time taken by limit query is more than the time taken without the limit query then.
If i can apply limit 4 on each array element then my question will be solved.
If i cannot apply limit on each array element then i don't think $group will be useful, as in this case i have to scan whole documents to get the results.
FINAL UPDATE- As i read on below answer and also on mongodb docs that by using $limit, time taken by query does not get affected it is the network bandwidth that gets compromised. So i think if anyone of you can tell me how to apply limit on array fields (by using $group or anything other than that)then my problem will get solved.
mongodb: will limit() increase query speed?
Solution
Actually my thinking about mongoDB was very wrong i thought adding limit with queries decrease time taken by query but it is not the case that's why i stumbled so many days to try the answer which Gregory NEUT and JohnnyHK Told me to. Thanks a lot both of you guys i must have found the solution at the day one if i had known about this thing. thanks alot for helping me out of here guys i really appreciate it.
I propose you to use the $group aggregation attribute to group all data you got from the $match by brand_id, and then limit the groups of data using $slice.
Look at this stack overflow post
db.collection.aggregate(
{
$sort: {
created: -1,
}
}, {
$group: {
_id: '$city',
title: {
$push: '$title',
}
}, {
$project: {
_id: 0,
city: '$_id',
mostRecentTitle: {
$slice: ['$title', 0, 2],
}
}
})
I propose using distinct, since that will return all different brand names in your collection. (I assume this is what you are trying to achieve?)
db.runCommand ( { distinct: "areamodel", key: "name" } )
MongoDB docs
In mongoose i think it is: areamodel.db.db.command({ distinct: "areamodel", key: "name" }) (Untested)

MongoDB (mongoose) retrieve substring

Let's say I have a blog with some very long posts.
So, I want to display a list of my posts in "preview mode", for instance only first 50 chars of text.
Simple answer is to do this:
Post.find(
(err, posts) => {
if(err) return console.log(err);
posts.forEach(
post => {
console.log('post preview:', post.data.substr(0,50) + '...');
}
)
}
)
This way we retrieve all data from specific collection.
If each post has more than 3 KB of data retrieving 30 posts seems very inefficient in terms of data transfer and processing.
So, I wondered if there is a way to retrieve already sliced string from DB?
Or at least do you have a better solution for my issue?
yes, you can use the $substr operator with a query like this :
db.collection.aggregate(
[
{
$project:
{
preview: { $substr: [ "$data", 0, 50 ] }
}
}
]
)
Edit:
$substr is deprecated from mongodb 3.4 because it only has a well-defined behavior for strings of ASCII characters. If you're facing UTF-8 issues, consider upgrading to mongodb 3.4 in order to use the $substrCP operator
so your query becomes :
db.collection.aggregate(
[
{
$project:
{
preview: { $substrCP: [ "$data", 0, 50 ] }
}
}
]
)
As of today, mongodb 3.4 is only available for development, but a production version should be released soon

MongoDB update/insert document and Increment the matched array element

I use Node.js and MongoDB with monk.js and i want to do the logging in a minimal way with one document per hour like:
final doc:
{ time: YYYY-MM-DD-HH, log: [ {action: action1, count: 1 }, {action: action2, count: 27 }, {action: action3, count: 5 } ] }
the complete document should be created by incrementing one value.
e.g someone visits a webpage first this hour and the incrementation of action1 should create the following document with a query:
{ time: YYYY-MM-DD-HH, log: [ {action: action1, count: 1} ] }
an other user in this hour visits an other webpage and document should be exteded to:
{ time: YYYY-MM-DD-HH, log: [ {action: action1, count: 1}, {action: action2, count: 1} ] }
and the values in count should be incremented on visiting the different webpages.
At the moment i create vor each action a doc:
tracking.update({
time: moment().format('YYYY-MM-DD_HH'),
action: action,
info: info
}, { $inc: {count: 1} }, { upsert: true }, function (err){}
Is this possible with monk.js / mongodb?
EDIT:
Thank you. Your solution looks clean and elegant, but it looks like my server can't handle it, or i am to nooby to make it work.
i wrote a extremly dirty solution with the action-name as key:
tracking.update({ time: time, ts: ts}, JSON.parse('{ "$inc":
{"'+action+'": 1}}') , { upsert: true }, function (err) {});
Yes it is very possible and a well considered question. The only variation I would make on the approach is to rather calculate the "time" value as a real Date object ( Quite useful in MongoDB, and manipulative as well ) but simply "round" the values with basic date math. You could use "moment.js" for the same result, but I find the math simple.
The other main consideration here is that mixing array "push" actions with possible "updsert" document actions can be a real problem, so it is best to handle this with "multiple" update statements, where only the condition you want is going to change anything.
The best way to do that, is with MongoDB Bulk Operations.
Consider that your data comes in something like this:
{ "timestamp": 1439381722531, "action": "action1" }
Where the "timestamp" is an epoch timestamp value acurate to the millisecond. So the handling of this looks like:
// Just adding for the listing, assuming already defined otherwise
var payload = { "timestamp": 1439381722531, "action": "action1" };
// Round to hour
var hour = new Date(
payload.timestamp - ( payload.timestamp % ( 1000 * 60 * 60 ) )
);
// Init transaction
var bulk = db.collection.initializeOrderedBulkOp();
// Try to increment where array element exists in document
bulk.find({
"time": hour,
"log.action": payload.action
}).updateOne({
"$inc": { "log.$.count": 1 }
});
// Try to upsert where document does not exist
bulk.find({ "time": hour }).upsert().updateOne({
"$setOnInsert": {
"log": [{ "action": payload.action, "count": 1 }]
}
});
// Try to "push" where array element does not exist in matched document
bulk.find({
"time": hour,
"log.action": { "$ne": payload.action }
}).updateOne({
"$push": { "log": { "action": payload.action, "count": 1 } }
});
bulk.execute();
So if you look through the logic there, then you will see that it is only ever possible for "one" of those statements to be true for any given state of the document either existing or not. Technically speaking, the statment with the "upsert" can actually match a document when it exists, however the $setOnInsert operation used makes sure that no changes are made, unless the action actually "inserts" a new document.
Since all operations are fired in "Bulk", then the only time the server is contacted is on the .execute() call. So there is only "one" request to the server and only "one" response, despite the multiple operations. It is actually "one" request.
In this way the conditions are all met:
Create a new document for the current period where one does not exist and insert initial data to the array.
Add a new item to the array where the current "action" classification does not exist and add an initial count.
Increment the count property of the specified action within the array upon execution of the statement.
All in all, yes posssible, and also a great idea for storage as long as the action classifications do not grow too large within a period ( 500 array elements should be used as a maximum guide ) and the updating is very efficient and self contained within a single document for each time sample.
The structure is also nice and well suited to other query and possible addtional aggregation purposes as well.

Mongo $min and $max, or Parallel sort

Hi I want to get the min and a max value of a field in my db.
I found this solution which queries and sorts the results:
get max value in mongoose
I could do this twice and combine it with async.parallel to write it non-blocking. But I guess two db queries may not be the best solution.
The second solution would be to use aggregate. But I don't want to group anything. I only want to use $match to filter (filter criteria are always diff and can be {}) and run the query with all documents in my collection.
http://docs.mongodb.org/manual/reference/operator/aggregation/min/
http://docs.mongodb.org/manual/reference/operator/aggregation/max/
Question)
Can I run this in one query with aggregate, maybe with $project
Is there another method than aggregate that works without
grouping
Will 1)/2) be more time efficient than the first solution with sorting?
EDIT:
Solved with the first solution, but I think there is a more efficient solution because this needs two database operations:
async.parallel
min: (next) ->
ImplantModel.findOne(newFilter).sort("serialNr").exec((err, result) ->
return next err if err?
return next null, 0 if !result?
next(null, result.serialNr)
)
max: (next) ->
ImplantModel.findOne(newFilter).sort("-serialNr").exec((err, result) ->
return next err if err?
return next null, 0 if !result?
next(null, result.serialNr)
)
(err, results) ->
console.log results.min, ' ', results.max
return callback(err) if err?
return callback null, {min:results.min, max:results.max}
)
Don't know what it is about this question, and sure to get no real love from the response but I just could not let it go and get to sleep without resolving.
So the first thing to say is I think I owe the OP here $10, because my expected results are not the case.
The basic idea presented here is a comparison of:
Using parallel execution of queries to find the "maximum" ( sorted total value ) af a field and also the minimum value by the same constraint
The aggregation framework $max and $min grouping accumulators over the whole collection.
In "theory" these two options are doing exactly the same thing. And in "theory" even though parallel execution can happen "over the wire" with simultaneous requests to the server, there still should be an "overhead" inherrent in those requests and the "aggregation" function in the client to bring both results together.
The tests here run a "series" execution of creating random data of a reasonable key length, the to be "fair" in comparison the "key" data here is also indexed.
The next "fairness" stage is to "warm up" the data, by doing a sequential "fetch" on all items, to simulate loading as much of the "working set" of data into memory as the client machine is capable.
Then we run each test, in comparison and series so as not to compete against eachover for resources, for either the "parallel query" case or the "aggregation" case to see the results with timers attached to the start and end of each excution.
Here is my testbed script, on the basic driver to keep thing as lean as possible ( nodejs environment considered ):
var async = require('async'),
mongodb = require('mongodb'),
MongoClient = mongodb.MongoClient;
var total = 1000000;
MongoClient.connect('mongodb://localhost/bigjunk',function(err,db) {
if (err) throw err;
var a = 10000000000000000000000;
db.collection('bigjunk',function(err,coll) {
if (err) throw err;
async.series(
[
// Clean data
function(callback) {
console.log("removing");
coll.remove({},callback);
},
// Insert data
function(callback) {
var count = 0,
bulk = coll.initializeUnorderedBulkOp();
async.whilst(
function() { return count < total },
function(callback) {
var randVal = Math.floor(Math.random(a)*a).toString(16);
//console.log(randVal);
bulk.insert({ "rand": randVal });
count++;
if ( count % 1000 == 0 ) {
if ( count % 10000 == 0 ) {
console.log("counter: %s",count); // log 10000
}
bulk.execute(function(err,res) {
bulk = coll.initializeUnorderedBulkOp();
callback();
});
} else {
callback();
}
},
callback
);
},
// index the collection
function(callback) {
console.log("indexing");
coll.createIndex({ "rand": 1 },callback);
},
// Warm up
function(callback) {
console.log("warming");
var cursor = coll.find();
cursor.on("error",function(err) {
callback(err);
});
cursor.on("data",function(data) {
// nuthin
});
cursor.on("end",function() {
callback();
});
},
/*
* *** The tests **
*/
// Parallel test
function(callback) {
console.log("parallel");
console.log(Date.now());
async.map(
[1,-1],
function(order,callback) {
coll.findOne({},{ "sort": { "rand": order } },callback);
},
function(err,result) {
console.log(Date.now());
if (err) callback(err);
console.log(result);
callback();
}
);
},
function(callback) {
console.log(Date.now());
coll.aggregate(
{ "$group": {
"_id": null,
"min": { "$min": "$rand" },
"max": { "$max": "$rand" }
}},
function(err,result) {
console.log(Date.now());
if (err) callback(err);
console.log(result);
callback();
}
);
}
],
function(err) {
if (err) throw err;
db.close();
}
);
});
});
And the results ( compared to what I expected ) are appauling in the "aggregate case".
For 10,000 documents:
1438964189731
1438964189737
[ { _id: 55c4d9dc57c520412399bde4, rand: '1000bf6bda089c00000' },
{ _id: 55c4d9dd57c520412399c731, rand: 'fff95e4662e6600000' } ]
1438964189741
1438964189773
[ { _id: null,
min: '1000bf6bda089c00000',
max: 'fff95e4662e6600000' } ]
Which indicates a difference of 6 ms for the parallel case, and a huge difference of 32ms for the aggregation case.
Can this get better? No:
For 100,000 documents:
1438965011402
1438965011407
[ { _id: 55c4dd036902125223a05958, rand: '10003bab87750d00000' },
{ _id: 55c4dd066902125223a0a84a, rand: 'fffe9714df72980000' } ]
1438965011411
1438965011640
[ { _id: null,
min: '10003bab87750d00000',
max: 'fffe9714df72980000' } ]
And the results still clearly show 5 ms which is close to the result of 10 times less the data and with the aggregation case this is 229 ms slower, nearly a factor of 10 ( the increased amount ) slower than the previous sample.
But wait, because it gets worse. Let's increase the sample to 1,000,000 entries:
1,000,000 document sample:
1438965648937
1438965648942
[ { _id: 55c4df7729cce9612303e39c, rand: '1000038ace6af800000' },
{ _id: 55c4df1029cce96123fa2195, rand: 'fffff7b34aa7300000' } ]
1438965648946
1438965651306
[ { _id: null,
min: '1000038ace6af800000',
max: 'fffff7b34aa7300000' } ]
This is actually the worst, becuase whilst the "parallel" case still continues to exhibit a 5ms response time, the "aggregation" case now blows out to a whopping 2360ms (wow, over 2 whole seconds). Which only has to be considered to be totally unacceptable as a differential from the alternate approach time. That is 500 times the execution cycle, and in computing terms that is huge.
Conclusions
Never make a bet on something unless you know a sure winner.
Aggregation "should" win here as the principles behind the results are basically the same as the "parallel excecution case" in the basic algorithm to pick the results from the keys of the index which is available.
This is a "fail" ( as my kids are fond of saying ) where the aggregation pipeline needs to be tought by someone ( my "semi-partner" is good at these things ) to go back to "algorithm school" and re-learn the basics that are being used by it's poorer cousin to producemuch faster results.
So the basic lesson here is:
We think the "aggregate" accumulators should be optimized to do this, but at present they clearly are not.
Of you want the fastest way to determine min/max on a collection of data ( without and distinct keys ) then a parallel query execution using the .sort() modfier is actually much faster than any alternative. ( with an index ).
So for people wanting to do this over a collection of data, use a parallel query as shown here. It's much faster ( until we can teach operators to be better :> )
I should note here that all timings are relative to hardware, and it is mainly the "comparison" of timings that is valid here.
These results are from my ( ancient ) laptop
Core I7 CPU (8x cores)
Windows 7 Host ( yes could not be bothered to re-install )
8GB RAM Host
4GB Allocated VM ( 4x core allocation )
VirtualBox 4.3.28
Ubuntu 15.04
MongoDB 3.1.6 (Devel)
And the latest "stable" node versions for packages as required in the listing here.

Resources