Querying a mongoDB database collection with over 60k documents - node.js

I have tried searching similar answers and have applied the solutions. The solutions seem not to work in my case. I am querying a mongoose collection that contains 60k documents, and I need all 60k to apply combinatorics. Hence, can't apply limits. I can reduce the data volume by querying multiple times depending on a certain property, but that will be costly in terms of performance as well. I don't see what else to try. Can someone help me out?
I am using this simple code for now:
StagingData.find({})
.lean()
.exec(function(err, results){
console.log(results) //I don't get any output
}
When I use:
let data = await StagingData.find({}).lean() //it takes forever
What should I do?

You might want to apply indexing first, e.g. precomputing some values as a separate operation, parallel processing, etc. For this you may want to jump to a a different technology maybe Elasticsearch, Spark etc depend on your code.
You may also want to identify what is the bottleneck in your process: memory, processor. Try experimenting with a shorter set of documents and see how quickly you get results. With this you might be able to infer how long will it take to do it for the whole dataset.
You may also trying breaking down your operation into smaller chunks and identifying the cost of processing etc.

Related

Is it faster to use aggregation or manually filter through data with nodejs and mongoose?

I'm at a crossroads trying to decide what methodology to use. Basically, I have a mongodb collection and i want to query it with specific params provided by the user, then i want to group the response according to the value of some of those parameters. For example, let's say my collection is animals and if i query all animals i get something like this
[
{type:"Dog",age:3,name:"Kahla"},
{type:"Cat",age:6,name:"mimi"},
...
]
Now i would like to return to the user a response that is grouped by the animal type, so that i end up with something like
{
Dogs: [...dog docs],
Cats: [...cat docs],
Cows: [...],
}
So basically I have 2 ways of doing this. One is to just use Model.find() and fetch all the animals that match my specific queries, such as age or any other field, and then manually filter and format my json response before sending it back to the user with res.json({}) (im using express btw)
Or I can use mongo's aggregate framework and $group to do this at the query level, hence returning from the DB an already grouped response to my request. The only inconvenience I've found with this so far with this is how the response is formatted, and ends up looking more something like this
[
{
"_id":"Dog",
"docs":[{dog docs...}]
},
{
"_id":"Cat",
"docs":[{...}]
}
]
The overall result is BASICALLY the same, but the formatting of the response is quite different, and my front end client needs to adjust to how Im sending the response. I don't really like the array of objects from the aggregation, and prefer a json-like object response with key names correponding to the arrays as I see fit.
So the real question here is whether there is one significant advantage of one way over the other? Is the aggregation framework so fast that it will scale well if my collection grows to huge numbers? Is filtering through the data with javascript and mapping the response so I can shape it to my liking a very inefficient process, and hence it's better to use aggregation and adapt the front end to this response shape?
I'm considering that by Faster you meant the least time to serve a request. That said, let's divide the time required to process your request:
Asynchronous Operations (Network Operations, File read/write etc)
Synchronous Operations
Synchronous operations are usually much more faster than the Asynchronous ones.(This also depends on the nature of the operation and the amount of data being processed). For example, if you loop over an iterable(e.g. Array, Map etc) which has a length of less than 1000 it won't take more than a few milliseconds.
On the other hand, Asynchronous operations takes more times. For example, if you run an HTTP request it would take couple of milliseconds to get the response.
When you are querying on the MongoDB with mongoose, it's an asynchronous call and it will take more time. So, if you run more queries to Database it will make your API slower. MongoDB Aggregation can help you reducing the total number of queries which may help you to make APIs faster. But the problem is, Aggregations are usually slower than normal find requests.
The summary is, if you can manually filter data without adding any DB query it's going to be faster.

mongoose count query taking too much time, Need to reduce time

I am trying to get a total number of count of some documents. I am using mongoose count query. When I am trying to find documents like 85k, at that time, It's taking 12 seconds. I need to reduce the time to 2 or 3 seconds.
It just an example, There could be several hundreds of thousands of data which has to be counted. I think it will take too much time.
Here is the query which I am using to count documents
Donor.count(find_cond, function (er, doc) {
console.log(doc, "doc")
});
when it will count 10k to 20k, It's fine. when It goes to more than that it will too much time-consuming, It should not.
I know it is a little late but will write for future reference. After looking for a while, the best way I found to count documents is to use estimatedDocumentCount(), which uses collection metadata.
Another way for a very large collection (over 200k documents) is through the Model.collection.stats() method which will return an object with a key "count" like in this example:
const stats = await User.collection.stats();
const userCount = stats.count
It's still not great but the performance is much much better than countDocuments().
Can you try something like this..
Donor.createIndex({field1:1, field2:1, field3:1});
Donor.find({"field1" : "val1", "field2" : "val2"}).sort({field3: -1}).limit(100000).lean().count().exec();
Index is used for fast retrieval of data from database.
Performance can be improved by optimal equality -> sort -> range index.
Also, objects returned when using lean() are plain Javascript objects. Usually, Mongoose objects are returned in normal query.
This article provides useful guidelines for mongodb performance improvement.
Use Index for the field you try to get the count of.

With bookshelf.js, how do you loop through all models in batches?

With bookshelf.js it is easy enough to fetch all records for a given model using Model.fetchAll and then loop through them, like so:
SomeModel.fetchAll().then(function(results) {
results.models.forEach(function(model) {
...
});
});
But this loads the entire result set all at once, which is impractical for very large result sets. Is there a simple way to load the results in smaller batches (e.g. only 1000 at a time, say).
I know it's possible to do this by maintaining an offset counter and using limit() and offset() to roll my own version of this, but really I'm looking for something that hides the nuts and bolts, analogous to ActiveRecord's find_in_batches.
But I can't find anywhere in the docs or from a google search if a batched fetcher method even exists. Is there a simple way to do this?

Does the size of a document affect performance of a find() query?

Can the size of a MongoDB document affect the performance of a find() query?
I'm running the following query on a collection, in the MongoDB shell
r.find({_id:ObjectId("5552966b380c2dbc29472755")})
The entire document is 3MB. When I run this query the operation takes about 8 seconds to perform. The document has a "salaries" property which makes up the bulk of the document's size (about 2.9MB). So when I ommit the salaries property and run the following query it takes less than a second.
r.find({_id:ObjectId("5552966b380c2dbc29472755")},{salaries:0})
I only notice this performance difference when I run the find() query only. When I run a find().count() query there is no difference. It appears that performance degrades only when I want to fetch the entire document.
The collection is never updated (never changes in size), an index is set on _id and I've run repairDatabase() on the database. I've searched around the web but can't find a satisfactory answer to why there is a performance difference. Any insight and recommendations would be appreciated. Thanks.
I think the experiments you've just ran are an answer to your own question.
Mongo will index the _id field by default, so document size shouldn't affect the length of time it takes to locate the document, but if its 3MB then you will likely notice a difference in actually downloading that data. I imagine that's why its taking less time if you omit some of the fields.
To get a better sense of how long your query is actually taking to run, try this:
r.find({
_id: ObjectId("5552966b380c2dbc29472755")
})
.explain(function(err, explaination) {
if (err) throw err;
console.log(explaination);
});
If salaries is the 3MB culprit, and its structured data, then to speed things up you could try A) splitting it up into separate mongo documents or B) querying based on sub-properties of that document, and in both cases A and B you can build indexes to keep those queries fast.

Mongodb document insertion order

I have a mongodb collection for tracking user audit data. So essentially this will be many millions of documents.
Audits are tracked by loginID (user) and their activities on items. example: userA modified 'item#13' on date/time.
Case: I need to query with filters based on user and item. That's Simple. This returns many thousands of documents per item. I need to list them by latest date/time (descending order).
Problem: How can I insert new documents to the top of the stack? (like a capped collection) or Is it possible to find records from the bottom of the stack? (reverse order). I do NOT like the idea of find and sorting because when dealing with thousand and millions of documents sorting is a bottleneck.
Any solutions?
Stack: mongodb, node.js, mongoose.
Thanks!
the top of the stack?
you're implying there is a stack, but there isn't - there's a tree, or more precisely, a B-Tree.
I do NOT like the idea of find and sorting
So you want to sort without sorting? That doesn't seem to make much sense. Stacks are essentially in-memory data structures, they don't work well on disks because they require huge contiguous blocks (in fact, huge stacks don't even work well in memory, and growing stacks requires copying the entire data set, that would hardly work
sorting is a bottleneck
It shouldn't be, at least not for data that is stored closely together (data locality). Sorting is an O(m log n) operation, and since the _id field already encodes a timestamp, you already have a field that you can sort on. m is relatively small, so I don't see the problem here. Have you even tried that? With MongoDB 3.0, index intersectioning has become more powerful, you might not even need _id in the compound index.
On my machine, getting the top items from a large collection, filtered by an index takes 1ms ("executionTimeMillis" : 1) if the data is in RAM. The sheer network overhead will be in the same league, even on localhost. I created the data with a simple network creation tool I built and queried it from the mongo console.
I have encountered the same problem. My solution is to create another additional collection which maintain top 10 records. The good point is that you can query quickly. The bad point is you need update additional collection.
I found this which inspired me. I implemented my solution with ruby + mongoid.
My solution:
collection definition
class TrainingTopRecord
include Mongoid::Document
field :training_records, :type=>Array
belongs_to :training
index({training_id: 1}, {unique: true, drop_dups: true})
end
maintain process.
if t.training_top_records == nil
training_top_records = TrainingTopRecord.create! training_id: t.id
else
training_top_records = t.training_top_records
end
training_top_records.training_records = [] if training_top_records.training_records == nil
top_10_records = training_top_records.training_records
top_10_records.push({
'id' => r.id,
'return' => r.return
})
top_10_records.sort_by! {|record| -record['return']}
#limit training_records' size to 10
top_10_records.slice! 10, top_10_records.length - 10
training_top_records.save
MongoDb's ObjectId is structured in a way that has natural ordering.
This means the last inserted item is fetched last.
You can override that by using: db.collectionName.find().sort({ $natural: -1 }) during a fetch.
Filters can then follow.
You will not need to create any additional indices since this works on _id, which is indexed by default.
This is possibly the only efficient way you can achieve what you want.

Resources