fast and efficient pagination in mongodb using express and node - node.js

I have a collection with name product in mongodb, I have more then 2 million products in it. I just want to paginate from first document to last, no filter, no sorting is needed.
I use skip() and limit() but response time is exponential as skip() value gets bigger.
app.get('/products', async (req, res) => {
try {
var query = isNaN(req.query.page) ? 0 : req.query.page <= 0 ? 0 : parseInt(req.query.page) || 0;
const productPerPage = 20;
var totalPages = Product.countDocument();
totalPages = Math.floor(totalPages / 20);
query = query > totalPages ? totalPages : query;
const data = await Product.find()
.sort({_id:1})
.skip(query * productPerPage)
.limit(productPerPage)
res.status(200).render('products', { data, currPage:query, totalPages});
} catch (error) {
console.error(error.message);
res.status(500).send("Internal Server Error");
}
} ```
its working proper but when the database gets larger the response time gets greater.

Using skip and limit to paginate means that the db server needs to load the matching documents, sort them, then apply the skip and limit.
Using the sort on {_id:1} will allow the server to read the _id values from the index in pre-sorted order instead of an in-memory sort, but it will still need to scan all of the values in order to find the first one to return, i.e., on the first call it will start at the first document and return 20, on the second call it will read 40 documents, discard the first 20 and return 20, etc. so on the last call it will read all 2 million documents and discard 1,999,980 of them. This is why it is so much slower on the later pages.
There are alternate pagination methods that perform better. For example, instead of requesting just a page number, if the application were to include the previous _id value in the request, the route could query for Product.find({_id:{$gt: ObjectId(req.query.lastseen)}}).sort({_id:1}).limit(productPerPage) which would both have a more predictable runtime, and would not suffer from missing documents if one were deleted between calls.

Related

Mongodb How do I make this sorting of grabbing an individual players position function faster?

class User {
constructor(doc) {
this.username = doc.username;
this.kills = doc.kills;
this.deaths = doc.deaths;
}
}
const res = await Calls.getAllUsers();
const users = res.map((doc) => new User(doc));
const sorted = users.sort((a, b) => b.kills - a.kills);
const whereIam =
sorted.indexOf(users.find((u) => u.username === latest_user)) + 1;
Hi, everyone I am trying to sort out a players position ( kills)
They would do /stats and they would get a position based on the highest kills
This takes more than five minutes to determind a players position with 26,000 documents and 2,000 documents being made everyday, what are the steps to make this faster or to change?
You look to be getting all users and then doing the sort and search in memory. As your data set gets larger that will get much more memory intensive.
So first suggestion would be to get Mongo to be doing that for you in your query.
Adding to that, since Mongo will be doing the query, you'll want indexes in Mongo on the username and kills fields.
If there's a reason you want all the users in memory though, then you could consider a hash table to help with your user lookups, but there's no way around sorting time. Make sure your sort is using a reasonable algorithm (that's a whole subject on it's own, but Quicksort is the one that has the best average performance).
In this specfic scenario, you can follow these steps:
create index on { username: 1 }
create index on { kills: -1 }
query with find({ username: latest_user } to get this user's kills as latest_user_kills
query with countDocuments({ kills: { $gt: latest_user_kills } }) to get the count of all users whose kills is higher than this latest_user
the resulting count from step 4 is the position of this individual player

How to read an individual column from Dynamo-Db without using Scan in Node-js?

I have 4.5 millions of records in my Dynamo Db.
I want to read the the id of each record as a batchwise.
i am expecting something like offset and limit like how we can read in Mongo Db.
Is there any way suggestions without scan method in Node-JS.
I have done enough research i can only find scan method which buffers the complete records from Dynamo Db and the it starts scanning the records, which is not effective in performance basis.
Please do give me suggestion.
From my point of view, there's no problem doing scans because (according to the Scan doc):
DynamoDB paginates the results from Scan operations
You can use the ProjectionExpression parameter so that Scan only returns some of the attributes, rather than all of them
The default size for pages is 1MB, but you can also specify the max number of items per page with the Limit parameter.
So it's just basic pagination, the same thing MongoDB does with offset and limit.
Here is an example from the docs of how to perform Scan with the node.js SDK.
Now, if you want to get all the IDs as a batchwise, you could wrap the whole thing with a Promise and resolve when there's no LastEvaluatedKey.
Below a pseudo-code of what you could do :
const performScan = () => new Promise((resolve, reject) => {
const docClient = new AWS.DynamoDB.DocumentClient();
let params = {
TableName:"YOUR_TABLE_NAME",
ProjectionExpression: "id",
Limit: 100 // only if you want something else that the default 1MB. 100 means 100 items
};
let items = [];
var scanExecute = cb => {
docClient.scan(params, (err,result) => {
if(err) return reject(err);
items = items.concat(result.Items);
if(result.LastEvaluatedKey) {
params.ExclusiveStartKey = result.LastEvaluatedKey;
return scanExecute();
} else {
return err
? reject(err)
: resolve(items);
}
});
};
scanExecute();
});
performScan().then(items => {
// deal with it
});
First things to know about DynamoDB is that it is a Key-Value Store with support for secondary indexes.
DynamoDB is a bad choice if the application often has to iterate over the entire data set without using indexes(primary or secondary), because the only way to do that is to use the Scan API.
DynamoDB Table Scan's are (a few things I can think off)
Expensive(I mean $$$)
Slow for big data sets
Might use up the provisioned throughput
If you know the primary key of all the items in DynamoDB (some external knowledge like primary is an auto incremented value, is referenced in another DB etc) then you can use BatchGetItem or Query.
So if it is a one off thing then Scan is your only option else you should look into refactoring your application to remove this scenario.

Nodejs & Mongo pagination random order

I am running an iOS app where I display a list of users that are currently online.
I have an API endpoint where I return 10 (or N) users randomly, so that you can keep scrolling and always see new users. Therefore I want to make sure I dont return a user that I already returned before.
I cannot use a cursor or a normal pagination as the users have to be returned randomly.
I tried 2 things, but I am sure there is a better way:
At first what I did was sending in the parameters of the request the IDs of the user that were already seen.
ex:
But if the user keeps scrolling and has gone through 200 profiles then the list is long and it doesnt look clean.
Then, in the database, I tried adding a field to each users "online_profiles_already_sent" where i would store an array of the IDs that were already sent to the user (I am using MongoDB)
I can't figure out how to do it in a better/cleaner way
EDIT:
I found a way to do it with MySQL, using RAND(seed)
but I can't figure out if there is a way to do the same thing with Mongo
PHP MySQL pagination with random ordering
Thank you :)
I think the only way that you will be able to guarentee that users see unique users every time is to store the list of users that have already been seen. Even in the RAND example that you linked to, there is a possibility of intersection with a previous user list because RAND won't necessarily exclude previously returned users.
Random Sampling
If you do want to go with random sampling, consider Random record from MongoDB which suggests using an an Aggregation and the $sample operator. The implementation would look something like this:
const {
MongoClient
} = require("mongodb");
const
DB_NAME = "weather",
COLLECTION_NAME = "readings",
MONGO_DOMAIN = "localhost",
MONGO_PORT = "32768",
MONGO_URL = `mongodb://${MONGO_DOMAIN}:${MONGO_PORT}`;
(async function () {
const client = await MongoClient.connect(MONGO_URL),
db = await client.db(DB_NAME),
collection = await db.collection(COLLECTION_NAME);
const randomDocs = await collection
.aggregate([{
$sample: {
size: 5
}
}])
.map(doc => {
return {
id: doc._id,
temperature: doc.main.temp
}
});
randomDocs.forEach(doc => console.log(`ID: ${doc.id} | Temperature: ${doc.temperature}`));
client.close();
}());
Cache of Previous Users
If you go with maintaining a list of previously viewed users, you could write an implementation using the $nin filter and store the _id of previously viewed users.
Here is an example using a weather database that I have returning entries 5 at a time until all have been printed:
const {
MongoClient
} = require("mongodb");
const
DB_NAME = "weather",
COLLECTION_NAME = "readings",
MONGO_DOMAIN = "localhost",
MONGO_PORT = "32768",
MONGO_URL = `mongodb://${MONGO_DOMAIN}:${MONGO_PORT}`;
(async function () {
const client = await MongoClient.connect(MONGO_URL),
db = await client.db(DB_NAME),
collection = await db.collection(COLLECTION_NAME);
let previousEntries = [], // Track ids of things we have seen
empty = false;
while (!empty) {
const findFilter = {};
if (previousEntries.length) {
findFilter._id = {
$nin: previousEntries
}
}
// Get items 5 at a time
const docs = await collection
.find(findFilter, {
limit: 5,
projection: {
main: 1
}
})
.map(doc => {
return {
id: doc._id,
temperature: doc.main.temp
}
})
.toArray();
// Keep track of already seen items
previousEntries = previousEntries.concat(docs.map(doc => doc.id));
// Are we still getting items?
console.log(docs.length);
empty = !docs.length;
// Print out the docs
docs.forEach(doc => console.log(`ID: ${doc.id} | Temperature: ${doc.temperature}`));
}
client.close();
}());
I have encountered the same issue and can suggest an alternate solution.
TL;DR: Grab all Object ID of the collections on first landing, randomized using NodeJS and used it later on.
Disadvantage: slow first landing if have million of records
Advantage: subsequent execution is probably quicker than the other solution
Let's get to the detail explain :)
For better explain, I will make the following assumption
Assumption:
Assume programming language used NodeJS
Solution works for other programming language as well
Assume you have 4 total objects in yor collections
Assume pagination limit is 2
Steps:
On first execution:
Grab all Object Ids
Note: I do have considered performance, this execution takes spit seconds for 10,000 size collections. If you are solving a million record issue then maybe used some form of partition logic first / used the other solution listed
db.getCollection('my_collection').find({}, {_id:1}).map(function(item){ return item._id; });
OR
db.getCollection('my_collection').find({}, {_id:1}).map(function(item){ return item._id.valueOf(); });
Result:
ObjectId("FirstObjectID"),
ObjectId("SecondObjectID"),
ObjectId("ThirdObjectID"),
ObjectId("ForthObjectID"),
Randomized the array retrive using NodeJS
Result:
ObjectId("ThirdObjectID"),
ObjectId("SecondObjectID"),
ObjectId("ForthObjectID"),
ObjectId("FirstObjectID"),
Stored this randomized array:
If this is a Server side script that randomized pagination for each user, consider storing in Cookie / Session
I suggest Cookie (with timeout expired linked to browser close) for scaling purpose
On each retrieval:
Retrieve the stored array
Grab the pagination item, (e.g. first 2 items)
Find the objects for those item using find $in
.
db.getCollection('my_collection')
.find({"_id" : {"$in" : [ObjectId("ThirdObjectID"), ObjectId("SecondObjectID")]}});
Using NodeJS, sort the retrieved object based on the retrived pagination item
There you go! A randomized MongoDB query for pagination :)

Massive inserts with pg-promise

I'm using pg-promise and I want to make multiple inserts to one table. I've seen some solutions like Multi-row insert with pg-promise and How do I properly insert multiple rows into PG with node-postgres?, and I could use pgp.helpers.concat in order to concatenate multiple selects.
But now, I need to insert a lot of measurements in a table, with more than 10,000 records, and in https://github.com/vitaly-t/pg-promise/wiki/Performance-Boost says:
"How many records you can concatenate like this - depends on the size of the records, but I would never go over 10,000 records with this approach. So if you have to insert many more records, you would want to split them into such concatenated batches and then execute them one by one."
I read all the article but I can't figure it out how to "split" my inserts into batches and then execute them one by one.
Thanks!
UPDATE
Best is to read the following article: Data Imports.
As the author of pg-promise I was compelled to finally provide the right answer to the question, as the one published earlier didn't really do it justice.
In order to insert massive/infinite number of records, your approach should be based on method sequence, that's available within tasks and transactions.
var cs = new pgp.helpers.ColumnSet(['col_a', 'col_b'], {table: 'tableName'});
// returns a promise with the next array of data objects,
// while there is data, or an empty array when no more data left
function getData(index) {
if (/*still have data for the index*/) {
// - resolve with the next array of data
} else {
// - resolve with an empty array, if no more data left
// - reject, if something went wrong
}
}
function source(index) {
var t = this;
return getData(index)
.then(data => {
if (data.length) {
// while there is still data, insert the next bunch:
var insert = pgp.helpers.insert(data, cs);
return t.none(insert);
}
// returning nothing/undefined ends the sequence
});
}
db.tx(t => t.sequence(source))
.then(data => {
// success
})
.catch(error => {
// error
});
This is the best approach to inserting massive number of rows into the database, from both performance point of view and load throttling.
All you have to do is implement your function getData according to the logic of your app, i.e. where your large data is coming from, based on the index of the sequence, to return some 1,000 - 10,000 objects at a time, depending on the size of objects and data availability.
See also some API examples:
spex -> sequence
Linked and Detached Sequencing
Streaming and Paging
Related question: node-postgres with massive amount of queries.
And in cases where you need to acquire generated id-s of all the inserted records, you would change the two lines as follows:
// return t.none(insert);
return t.map(insert + 'RETURNING id', [], a => +a.id);
and
// db.tx(t => t.sequence(source))
db.tx(t => t.sequence(source, {track: true}))
just be careful, as keeping too many record id-s in memory can create an overload.
I think the naive approach would work.
Try to split your data into multiple pieces of 10,000 records or less.
I would try splitting the array using the solution from this post.
Then, multi-row insert each array with pg-promise and execute them one by one in a transaction.
Edit : Thanks to #vitaly-t for the wonderful library and for improving my answer.
Also don't forget to wrap your queries in a transaction, or else it
will deplete the connections.
To do this, use the batch function from pg-promise to resolve all queries asynchronously :
// split your array here to get splittedData
int i = 0
var cs = new pgp.helpers.ColumnSet(['col_a', 'col_b'], {table: 'tmp'})
// values = [..,[{col_a: 'a1', col_b: 'b1'}, {col_a: 'a2', col_b: 'b2'}]]
let queries = []
for (var i = 0; i < splittedData.length; i++) {
var query = pgp.helpers.insert(splittedData[i], cs)
queries.push(query)
}
db.tx(function () {
this.batch(queries)
})
.then(function (data) {
// all record inserted successfully !
}
.catch(function (error) {
// error;
});

Enforce limit on mongodb bulk API

I'd like to delete a large number of old documents from one collection and so it makes sense to use the bulk api. Deleting them is as simple as:
var bulk = db.myCollection.initializeUnorderedBulkOp();
bulk.find({
_id: {
$lt: oldestAllowedId
}
}).remove();
bulk.execute();
The only problem is this will attempt to delete every single document matching this criteria and in this case that is millions of documents, so for performance reasons I don't want to delete them all at once. I want to enforce a limit on the operation so that I can do something like bulk.limit(10000).execute(); and space the operations out by a few seconds to prevent locking the database for longer than necessary. However I have been unable to find any options that can be passed to bulk for limiting the number it executes.
Is there a way to limit bulk operations in this manner?
Before anyone mentions it, I know that bulk will split operations into 1000 document chunks automatically, but it will still execute all of those operations sequentially as fast as it can. This results in a much larger performance impact than I can deal with right now.
You can iterate the array of _id that of those documents that match your query using the .forEach method. The best way to return that array is by using the .distinct() method. You then use "bulk" operations to remove your documents.
var bulk = db.myCollection.initializeUnorderedBulkOp();
var count = 0;
var ids = db.myCollection.distinct('_id', { '_id': { '$lt': oldestAllowedId } } );
ids.forEach(function(id) {
bulk.find( { '_id': id } ).removeOne();
count++;
if (count % 1000 === 0) {
// Execute per 1000 operations and re-init
bulk.execute();
// Here you can sleep for a while
bulk = db.myCollection.initializeUnorderedBulkOp();
}
});
// clean up queues
if (count > 0 ) {
bulk.execute();
}

Resources