I am working on a project that deals with a large amount of data. Fetching of all data at once from mongodb is not an option since it results in bad user experience. I am working on creating an infinite loading setup and with each scroll, I want a fix number of data that is fetched from mongodb and will concatenate the newly fetched data with the previously fetched data to show results on my webpage.
How to do pagination in mongodb using nodejs?
The mongodb node.js driver allows you to use the pagination through the limit and the skip attributes.
// You can first start by counting the number of entities in your collection
collection.countDocuments().then((count) => {
var step = 1000;
var offset = 0;
var limit = step;
//then exploiting your offset and limit variables
//you can limit the number of results you get in each query (a page of results)
while (offset < count) {
process(offset, limit);
offset += step;
}
});
})
.catch((error) => console.error(error));
async function process(offset, limit) {
var entities = collection.find(
{},
{
limit: limit,
skip: offset,
}
);
for await (const entity of entities) {
// do what you want
}
}
You can find more details on the MongoDB documentation page.
https://www.mongodb.com/docs/drivers/node/current/fundamentals/crud/read-operations/limit/
Related
I would like to fetch all the data provided by this api with is a couchdb, https://skimdb.npmjs.com/registry/_all_docs its total size is around 1.9 Million row, I need all this data to do some offline local processing , since the size of the data is huge I would like to fetch it by batch , how I could I do it ?
I tried fetching all the data from node app by it take to much time and it returned empty.
Thanks in advance
I found solution to my question :
There is many ways to do it , one efficient solution is to specify the start_key and the limit which represents the batch s
import fs from 'fs';
const registeryUrl = "https://skimdb.npmjs.com/registry/_all_docs?"
// In my case the start key id is '-'
const startkey="-"
function fetchdata(startkey){
console.log(startkey)
// batch size
const limit = 1000
fetch(registeryUrl +"startkey_docid="+startkey+"&limit=" + limit )
.then((response) => response.json())
.then((data) => {
// check if there is data still to fetch from api
var isStill = ((data["total_rows"]-data["offset"]) >limit) ? true : false;
// the last element will be the next startkey
var last_element=data["rows"].pop()["id"]
/** process you data as you want */
return [last_element,isStill]
})
.then((array) => {
const isStill=array[1]
const last_element=array[0]
// while there is more data be fetched
if (isStill){
fetchdata(last_element)
}
else{
console.log("finished fetching")
}
})
}
fetchdata(startkey)
so I want to delete all documents in a collection using admin SDK, the code below is taken from official documentation in here
async function deleteCollection(db, collectionPath, batchSize) {
const collectionRef = db.collection(collectionPath);
const query = collectionRef.orderBy('__name__').limit(batchSize);
return new Promise((resolve, reject) => {
deleteQueryBatch(db, query, resolve).catch(reject);
});
}
async function deleteQueryBatch(db, query, resolve) {
const snapshot = await query.get();
const batchSize = snapshot.size;
if (batchSize === 0) {
// When there are no documents left, we are done
resolve();
return;
}
// Delete documents in a batch
const batch = db.batch();
snapshot.docs.forEach((doc) => {
batch.delete(doc.ref);
});
await batch.commit();
// Recurse on the next process tick, to avoid
// exploding the stack.
process.nextTick(() => {
deleteQueryBatch(db, query, resolve);
});
}
as you can see, I need to provide a batchSize, which is a number, so what size that I should provide? say for example I have 100.000 documents in a collection.
because from the documentation, there is also a limitation from Firestore
Maximum writes per second per database : 10,000 (up to 10 MiB per
second)
so how to decide the batch size?
The batchSize variable is used to define a Query with the limit() method. This Query is passed to the deleteQueryBatch() method where it is executed and where, based on the query result, a batched write is populated with some delete operations.
Since a batched write can contain up to 500 operations, the maximum value you can assign to batchSize is 500.
PS: I know that I am not exactly answering your question which is more about the recommended batch size but, since 500 is the documented limit for a batched size, you should not encounter any problem with this limit, unless maybe if you delete lexicographically close documents, as explained here in the doc.
I am working on a Node.js application which uses the WordPress JSON API as a kind of headless CMS. When the application spins up, we query out to the WP database and pull in the information we need (using Axios), manipulate it, and store it temporarily.
Simple enough - but one of our post categories in the CMS has a rather large number of entries. For some godforsaken reason, WordPress has capped the API request limit to a maximum of 99 posts at a time, and requires that we write a loop that can send concurrent API requests until all the data has been pulled.
For instance, if we have 250 posts of some given type, I need to hit that route three separate times, specifying the specific "page" of data I want each time.
Per the docs, https://developer.wordpress.org/rest-api/using-the-rest-api/pagination/, I have access to a ?page= query string that I can use to send these requests concurrently. (i.e. ...&page=2)
I also have access to X-WP-Total in the headers object, which gives me the total number of posts within the given category.
However, these API calls are part of a nested promise chain, and the whole process needs to return a promise I can continue chaining off of.
The idea is to make it dynamic so it will always pull all of the data, and return it as one giant array of posts. Here's what I have, which is functional:
const request = require('axios');
module.exports = (request_url) => new Promise((resolve, reject) => {
// START WITH SMALL ARBITRARY REQUEST TO GET TOTAL NUMBER OF POSTS FAST
request.get(request_url + '&per_page=1').then(
(apiData) => {
// SETUP FOR PROMISE.ALL()
let promiseArray = [];
// COMPUTE HOW MANY REQUESTS WE NEED
// ALWAYS ROUND TOTAL NUMBER OF PAGES UP TO GET ALL THE DATA
const totalPages = Math.ceil(apiData.headers['x-wp-total']/99);
for (let i = 1; i <= totalPages; i++) {
promiseArray.push( request.get(`${request_url}&per_page=99&page=${i}`) )
};
resolve(
Promise.all(promiseArray)
.then((resolvedArray) => {
// PUSH IT ALL INTO A SINGLE ARRAY
let compiledPosts = [];
resolvedArray.forEach((axios_response) => {
// AXIOS MAKES US ACCESS W/RES.DATA
axios_response.data.forEach((post) => {
compiledPosts.push(post);
})
});
// RETURN AN ARRAY OF ALL POSTS REGARDLESS OF LENGTH
return compiledPosts;
}).catch((e) => { console.log('ERROR'); reject(e);})
)
}
).catch((e) => { console.log('ERROR'); reject(e);})
})
Any creative ideas to make this pattern better?
I have exactly the same question. In my case, I use Vue Resource :
this.$resource('wp/v2/media').query().then((response) => {
let pagesNumber = Math.ceil(response.headers.get('X-WP-TotalPages'));
for(let i=1; i <= pagesNumber; i++) {
this.$resource('wp/v2/media?page='+ i).query().then((response) => {
this.medias.push(response.data);
this.medias = _.flatten(this.medias);
console.log(this.medias);
});
}
I'm pretty sure there is a better workaround to achieve this.
I'm wondering if it's a good idea (performance wise) to store queries results in variables and update them only every few minute, since I have multiple database queries(MongoDB) in my node application that don't need to be up to date and some of them are a bit complex.
I'm thinking about something like this :
var queryResults = [];
myModel.find().exec(function(err, results) {
queryResults = results;
});
Then :
var interval = 10 * 60 * 1000;
setInterval(function() {
myModel.find().exec(function(err, results) {
queryResults = results;
});
}, interval);
And when I need to send the query results to my views engine :
app.get('/', function(req, res) {
res.render('index.ejs', {entries : queryResults});
});
Is this a good way to cache and display the same queries results to multiple clients?
You can use this module instead of creating your cache layer:
https://www.npmjs.com/package/memory-cache
You have to be careful not to put huge amount of data into memory. If you want to push in there millions of results, probably it is not a good idea.
Performance of individual findOne query is abnormally slow (upwards of 60-85ms). Is there something fundamentally wrong with the design below? What steps should I take to make this operation faster?
Goal (fast count of items within a range, under 10-20ms):
Input max and min time
Query database for document with closest time for max and min
Return the "number" field of both query result
Take the difference of the "number" field to get document count
Setup
MongoDB database
3000 documents, compound ascending index on time_axis, latency_axis, number field
[ { time_axis:1397888153982,latency_axis:5679,number:1},
{ time_axis:1397888156339,latency_axis:89 ,number:2},
...
{ time_axis:1398036817121,latency_axis:122407,number:2999},
{ time_axis:1398036817122,latency_axis:7149560,number:3000} ]
NodeJs
exports.getCount = function (uri, collection_name, min, max, callback) {
var low, high;
var start = now();
MongoClient.connect(uri, function(err, db) {
if(err) {
return callback(err, null);
}
var collection = db.collection(collection_name);
async.parallel([
function findLow(callback){
var query = {time_axis : { $gte : min}};
var projection = { _id: 0, number: 1};
collection.findOne( query, projection, function(err, result) {
low = result.number;
console.log("min query time: "+(now()-start));
callback();
});
},
function findHigh(callback){
var query = {time_axis : { $gte : max}};
var projection = { _id: 0, number: 1};
collection.findOne( query, projection, function(err, result) {
high = result.number;
console.log("max query time: "+(now()-start));
callback();
});
}
],
function calculateCount ( err ){
var count = high - low;
db.close();
console.log("total query time: "+(now()-start));
callback(null, count);
});
});
}
Note: Thank you for Adio for the answer. It turns out mongodb connection only need to be initialized once and handles connection pooling automatically. :)
Looking at your source code, I can see that you create a new connection every time you query MongoDB. Try provide an already created connection and thus reuse the connection you create. Coming from Java world I think you should create some connection pooling.
You can also check this question and its answer.
" You open do MongoClient.connect once when your app boots up and reuse the db object. It's not a singleton connection pool each .connect creates a new connection pool. "
Try to use the --prof option in NodeJs to generate profiling results and you can find out where it spent time on. eg. node --prof app.js
you will get a v8.log file containing the profiling results. The tool for interpreting v8.log is linux-tick-processor, which can be found within the v8 project v8/tools/linux-tick-processor.
To obtain linux-tick-processor:
git clone https://github.com/v8/v8.git
cd v8
make -j8 ia32.release
export D8_PATH=$PWD/out/ia32.release
export PATH=$PATH:$PWD/tools
linux-tick-processor /path/to/v8.log | vim -