How Mongodb queries works in node.js? - node.js

I have a very easy search query in Node.js Express.js MongoDB with Mongoose:
await Model.find({}).limit(10);
My question is how do the architects work? Is it first to get all Models Data and then limit to 10 or before getting all data will select 10 items from the database? I mean the steps:
Find all data from Model and return as List(Array) --> 2. Limit 10 first items and remove others from List(Array).
Find first 10 items and return as List(Array)
The difference in performance is high cause with first step if we got a million data it will return 1 mill items with a huge 10 20 sec and then limiting the 10 of it which we loose 10 20 seconds of time and when the user are more the server will be done but with the second way even with 100 mil items it will always take same time.

The limit function sets specifies the maximum number of elements a cursor will return. In the case of your example, the cursor will return the first 10 items matching the query only (option 2). You can find more information on how the cursor.limit() works via the links below:
https://docs.mongodb.com/manual/reference/method/cursor.limit/
http://mongodb.github.io/node-mongodb-native/3.5/api/Cursor.html#limit

Related

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?
Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.
I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample
You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

Find documents in MongoDB with non-typical limit

I have a problem, but don't have idea how to resolve it.
I've got PointValues collection in MongoDB.
PointValue schema has 3 parameters:
dataPoint (ref to DataPoint schema)
value (Number)
time (Date)
There is one pointValue for every hour (24 per day).
I have API method to get PointValues for specified DataPoint and time range. Problem is I need to limit it to max 1000 points. Typical limit(1000) method isn't good way, because I need point for whole, specified time range, with time step depends on specified time range and point values count.
So... for example:
Request data for 1 year = 1 * 365 * 24 = 8760
It should return 1000 values but approx 1 value per (24 / (1000 / 365)) = ~9 hours
I don't have idea what method i should use to filter that data in MongoDB.
Thanks for help.
Sampling exactly like that on the database would be quite hard to do and likely not very performant. But an option which gives you a similar result would be to use an aggregation pipeline which $group's the $first best value by $year, $dayOfYear, and $hour (and $minute and $second if you need smaller intervals). That way you can sample values by time steps, but your choices of step lengths are limited to what you have date-operators for. So "hourly" samples is easy, but "9-hourly" samples gets complicated. When this query is performance-critical and frequent, you might want to consider to create additional collections with daily, hourly, minutely etc. DataPoints so you don't need to perform that aggregation on every request.
But your documents are quite lightweight due to the actual payload being in a different collection. So you might consider to get all the results in the requested time range and then do the skipping on the application layer. You might want to consider combining this with the above described aggregation to pre-reduce the dataset. So you could first use an aggregation-pipeline to get hourly results into the application and then skip through the result set in steps of 9 documents. Whether or not this makes sense depends on how many documents you expect.
Also remember to create a sorted index on the time-field.

How to divide 1 mongoDB collection into 2 or more collections

I'm using mongoDB to scrap a dataset using Node.js. The collection which I have has 0.2 million documents and so the Node.js is crashing giving a segmentation fault. Is there a way to split/divide the collection to 2 or more collections so that Node.js doesn't crash.
Thanks!!
Did you try using limit to constraint the no of documents returned? You can take the total document count in collection and then split it using limit and skip For ex: if collection has 200 docs
First time limit 100 docs and skip 0
Second time limit 100 again but this time skip 100
This is oneway i an think of. There may be other ways

Mongodb cursor.toArray() has become the bottle neck

Mongodb cursor.toArray() has become the bottle neck. I need to process 2 million documents and output to a file. I am processing 10,000 at a time using skip and limit options but it didn’t quite work. so I was looking for a driver that is more memory efficient. I have also tried to process 10 documents at a time and it takes forever so I am not sure if .each() can solve the problem? Also does .nextObject makes a network call every time we retrieve a single document?
Node.js also has an internal limit of 1.5GB on memory so I am not sure how I can process these documents. I do believe that this problem can be solved just by using the mongo cursor in the right way at the application level and not doing any database level aggregations.
There shouldn't be any need to hold all the documents since you can write each document to the file as it is received from the server. If you use a cursor with .each and a batchSize, you can write each document to the file, holding no more than batchSize documents on the client side:
db.collection.find(query, { "batchSize" : 100 }).each(writeToFile)
From the Node.js driver API docs
the cursor will only hold a maximum of batch size elements at any given time if batch size is specified
Using skip and limit to break up results is a bad idea. A query with a skip of n and a limit of m generally has to scan at least n + m documents or index entries. If you paginate with skip and limit, you end up making the amount of work the query has to do quadratic in the size of (total number of results / limit), e.g. for 1000 docs and a limit of 100, the total docs scanned would be on the order of
100 + 200 + 300 + 400 + ... + 1000 = 100 (1 + 2 + 3 + 4 + ... + 10)

Strange data access time in Azure Table Storage while using .Take()

this is our situation:
We store user messages in table Storage. The Partition key is the UserId and the RowKey is used as a message id.
When a users opens his message panel we want to just .Take(x) number of messages, we don't care about the sortOrder. But what we have noticed is that the time it takes to get the messages varies very much by the number of messages we take.
We did some small tests:
We did 50 * .Take(X) and compared the differences:
So we did .Take(1) 50 times and .Take(100) 50 times etc.
To make an extra check we did the same test 5 times.
Here are the results:
As you can see there are some HUGE differences. The difference between 1 and 2 is very strange. The same for 199-200.
Does anybody have any clue how this is happening? The Table Storage is on a live server btw, not development storage.
Many thanks.
X: # Takes
Y: Test Number
Update
The problem only seems to come when I'm using a wireless network. But I'm using the cable the times are normal.
Possibly the data is collected in batches of a certain number x. When you request x+1 rows, it would have to take two batches and then drop a certain number.
Try running your test with increments of 1 as the Take() parameter, to confirm or dismiss this assumption.

Resources