How to dump large data sets in mongodb from node.js

How to dump large data sets in mongodb from node.js - node.js

I'm trying to dump approx 2.2 million objects in mongodb (using mongoose). The problem is when I save all the objects one by one It gets stuck. I've kepts a sample code below. If I run this code for 50,000 it works great. But if I increase data size to approx 500,000 it gets stuck.I want to know what is wrong with this approach and I want to find a better way to do this. I'm quite new to nodejs. I've tried loop's and everything no help finally I found this kind of solution. This one works fine for 50k objects but gets stuck for 2.2 Million objects. and I get this after sometime
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
Aborted (core dumped)
var connection = mongoose.createConnection("mongodb://localhost/entity");
var entitySchema = new mongoose.Schema({
name: String
, date: Date
, close : Number
, volume: Number
, adjClose: Number
});
var Entity = connection.model('entity', entitySchema)
var mongoobjs =["2.2 Millions obejcts here populating in code"] // works completely fine till here
async.map(mongoobjs, function(object, next){
Obj = new Entity({
name : object.name
, date: object.date
, close : object.close
, volume: object.volume
, adjClose: object.adjClose
});
Obj.save(next);
}, function(){console.log("Saved")});

Thanks cdbajorin
This seem to be much better way and a little faster batch approach for for doing this. So what I learned was that in my earlier approach, "new Entity(....)" was taking time and causing memory overflow. Still not sure why.
So, What I did was rather than using this line
Obj = new Entity({
name : object.name
, date: object.date
, close : object.close
, volume: object.volume
, adjClose: object.adjClose
});
I just created JSON objects and stored in an array.
stockObj ={
name : object.name
, date: object.date
, close : object.close
, volume: object.volume
, adjClose: object.adjClose
};
mongoobjs.push(stockObj); //array of objs.
and used this command... and Voila It worked !!!
Entity.collection.insert(mongoobjs, function(){ console.log("Saved succesfully")});

nodejs uses v8 which has the unfortunate property, from the perspective of developers coming from other interpreted languages, of severely restricting the amount of memory you can use to something like 1.7GB regardless of available system memory.
There is really only one way, afaik, to get around this - use streams. Precisely how you do this is up to you. For example, you can simply stream data in continuously, process it as it's coming in, and let the processed objects get garbage collected. This has the downside of being difficult to balance input to output.
The approach we've been favoring lately is to have an input stream bring work and save it to a queue (e.g. an array). In parallel you can write a function that is always trying to pull work off the queue. This makes it easy to separate logic and throttle the input stream in case work is coming in (or going out) too quickly.
Say for example, to avoid memory issues, you want to stay below 50k objects in the queue. Then your stream-in function could pause the stream or skip the get() call if the output queue has > 50k entries. Similarly, you might want to batch writes to improve server efficiency. So your output processor could avoid writing unless there are at least 500 objects in the queue or if it's been over 1 second since the last write.
This works because javascript uses an event loop which means that it will switch between asynchronous tasks automatically. Node will stream data in for some period of time then switch to another task. You can use setTimeout() or setInterval() to ensure that there is some delay between function calls, thereby allowing another asynchronous task to resume.
Specifically addressing your problem, it looks like you are individually saving each object. This will take a long time for 2.2 million objects. Instead, there must be a way to batch writes.

As an addition to answers provided in this thread, I was successful with
Bulk Insert (or batch insertion of ) 20.000+ documents (or objects)
Using low memory (250 MB) available within cheap offerings of Heroku
Using one instance, without any parallel processing
The Bulk operation as specified with MongoDB native driver was used, and the following is the code-ish that worked for me:
var counter = 0;
var entity= {}, entities = [];// Initialize Entities from a source such as a file, external database etc
var bulk = Entity.collection.initializeOrderedBulkOp();
var size = MAX_ENTITIES; //or `entities.length` Defined in config, mine was 20.0000
//while and -- constructs is deemed faster than other loops available in JavaScript ecosystem
while(size--){
entity = entities[size];
if( entity && entity.id){
// Add `{upsert:true}` parameter to create if object doesn't exist
bulk.find({id: entity.id}).update({$set:{value:entity.value}});
}
console.log('processing --- ', entity, size);
}
bulk.execute(function (error) {
if(error) return next(error);
return next(null, {message: 'Synced vector data'});
});
Entity is a mongoose model.
Old versions of mongodb may not support Entity type as it was made available from version 3+.
I hope this answer helps someone.
Thanks.

Related

How to handle multiple post requests at the same time while saving one of them on the db?

I am getting n post requests (on each webhook trigger) from a webhook. The data is identical on all requests that come from the same trigger - they all have the same 'orderId'. I'm interested in saving only one of these requests, so on each endpoint hit I'm checking if this specific orderId exists as a row in my database, otherwise - create it.
if (await orderIdExists === null) {
await Order.create(
{
userId,
status: PENDING,
price,
...
}
);
await sleep(3000)
function sleep(ms) {
return new Promise((resolve) => {
setTimeout(resolve, ms);
});
}
}
return res.status(HttpStatus.OK).send({success: true})
} catch (error) {
return res.status(HttpStatus.INTERNAL_SERVER_ERROR).send({success: false})
}
}
else {
return res.status(HttpStatus.UNAUTHORIZED).send(responseBuilder(false, responseErrorCodes.INVALID_API_KEY, {}, req.t));
}
}
Problem is before Sequelize manages to save the new created order in the db (all of the n post requests get to the enpoint in 1 sec - or less), I already get another endpoint hit from the other n post requests, while orderIdExists still equels null, So it ends up creating more identical orders. One (not so good solution) is to make orderId unique in the db, which prevents the creation of on order with the same orderId, but tries to anyway, which results in empty id incrementation in the db. Any idea would be greatly appreciated.
p.s. as you can see, i tried adding a 'sleep' function to no avail.

Your database is failing to complete its save operation before the next request arrives. The problem is similar to the Dogpile Effect or a "cache slam".
This requires some more thinking about how you are framing the problem: in other words the "solution" will be more philosophical and perhaps have less to do with code, so your results on StackOverflow may vary.
The "sleep" solution is no solution at all: there's no guarantee how long the database operation might take or how long you might wait before another duplicate request arrives. As a rule of thumb, any time "sleep" is deployed as a "solution" to problems of concurrency, it usually is the wrong choice.
Let me posit two possible ways of dealing with this:
Option 1: write-only: i.e. don't try to "solve" this by reading from the database before you write to it. Just keep the pipeline leading into the database as dumb as possible and keep writing. E.g. consider a "logging" table that just stores whatever the webhook throws at it -- don't try to read from it, just keep inserting (or upserting). If you get 100 ping-backs about a specific order, so be it: your table would log it all and if you end up with 100 rows for a single orderId, let some other downstream process worry about what to do with all that duplicated data. Presumably, Sequelize is smart enough (and your database supports whatever process locking) to queue up the operations and deal with write repetitions.
An upsert operation here would be helpful if you do want to have a unique constraint on the orderId (this seems sensible, but you may be aware of other considerations in your particular setup).
Option 2: use a queue. This is decidedly more complex, so weigh carefully wether or not your use-case justifies the extra work. Instead of writing data immediately to the database, throw the webhook data into a queue (e.g. a first-in-first-out FIFO queue). Ideally, you would want to choose a queue that supports de-duplication so that exiting messages are guaranteed to be unique, but that infers state, and that usually relies on a database of some sort, which is sort of the problem to begin with.
The most important thing a queue would do for you is it would serialize the messages so you can deal with them one at a time (instead of multiple database operations kicking off concurrently). You can upsert data into the database when you read a message out of the queue. If the webhook keeps firing and more messages enter the queue, that's fine because the queue forces them all to line up single-file and you can handle each insertion one at a time. You'll know that each database operation has completed before it moves on to the next message so you never "slam" the DB. In other words, putting a queue in front of the database will allow it to handle data when the database is ready instead of whenever a webhook comes calling.
The idea of a queue here is similar to what a semaphore accomplishes. Note that your database interface may already implement a kind of queue/pool under-the-hood, so weigh this option carefully: don't reinvent a wheel.
Hope those ideas are useful.

You saved my time #Everett and #april-henig. I found that saving directly into database read to records duplicates. If you store records into an object and deal with one record at time helped me a lot.
May be I would share my solution perhaps some may find it useful in future.
Create an empty object to save success request
export const queueAllSuccessCallBack = {};
Save POST request in object
if (status === 'success') { // I checked the request if is only successfully
const findKeyTransaction = queueAllSuccessCallBack[client_reference_id];
if (!findKeyTransaction) { // check if Id is not added to avoid any duplicates
queueAllSuccessCallBack[client_reference_id] = {
transFound,
body,
}; // save new request id as key and the value as data you want
}
}
Access the object to save into database
const keys = Object.keys(queueAllSuccessCallBack);
keys.forEach(async (key) => {
...
// Do extra checks if you want to do so
// Or save in database direct
});

Correct way to use MongoDB node.js `insertMany` so that it does not block

Within a node.js application, I wanted to use insertMany to insert a lot of documents (well, actually, around 10'000). I encountered the following issue: While insertMany (called with await) is running, the node.js process is not processing anything from the processing loop until the insertMany call has finished.
Is this expected behaviour? How would I do this "the right way", so that my service would still process requests in the meantime? I would have expected the await insertMany to automatically enable this, as it's async, but it seems this is not the case.
Code snippet:
exports.writeOrg = async (req, res, next) => {
logger.debug('orgs.writeOrg()');
// ...
try {
// ...
logger.debug('Starting processing of data.');
const newOrgDocs = await processLdapUsers(tenantId, ldapUsers);
logger.debug('Processing of data finished.');
const orgModel = getOrgModel(tenantId);
// Now delete the entire collection
logger.debug(`Delete entire org collection in tenant ${tenantId}`);
await orgModel.deleteMany({});
// And add the new org information; this replaces what was there before
logger.debug(`Inserting org structure to tenant ${tenantId}`);
// This is the call which seems to block: --->
await orgModel.insertMany(newOrgDocs);
// <---
logger.debug(`Finished inserting org structure to tenant ${tenantId}`);
// ...
} catch (err) {
// ...
// error handling
}
}
The writeOrg function is a regular express request handler; the payload is a JSON array with typically 1000-20000 records; in the test case I have 6000 records with a total JSON size of around 6 MB. Writing locally takes just around 1.5s, writing to MongoDB Atlas (cheapest tier for testing) takes around 20 seconds, which is when this problem occurs.
Workaround: If I split up the data into smaller chunks, e.g. 50 records at a time, the event loop processes some data from time to time for other requests. But still, as the insertMany function is an async function call, I wasn't expecting this to be necessary.

There are multiple issues which make this rather slow, and the most important one is actually something which is not mentioned in the question: I am using Mongoose as an "ORM" wrapper for Mongo DB. I wasn't aware that this could have such a substantial impact on the runtime.
What happens here, after checking out the actual runtime with the Chrome node.js debugging tools, is that Mongoose wraps and validates each single document in the array, and this takes substantial time.
The BSON conversion also takes time, but the document wrapping is what takes up the most time.
This means: Mongoose is not super suitable for fast inserting (or reading, FWIW); if you need speed, going directly for the native Mongo DB driver is the way to go. If your needs for pure speed are not that big, and you want the convenience of Mongoose, Mongoose can add value by doing validations and adding defaults and things like that.

Prevent concurrent processing in NodeJS

I need NodeJS to prevent concurrent operations for the same requests. From what I understand, if NodeJS receives multiple requests, this is what happens:
REQUEST1 ---> DATABASE_READ
REQUEST2 ---> DATABASE_READ
DATABASE_READ complete ---> EXPENSIVE_OP() --> REQUEST1_END
DATABASE_READ complete ---> EXPENSIVE_OP() --> REQUEST2_END
This results in two expensive operations running. What I need is something like this:
REQUEST1 ---> DATABASE_READ
DATABASE_READ complete ---> DATABASE_UPDATE
DATABASE_UPDATE complete ---> REQUEST2 ---> DATABASE_READ ––> REQUEST2_END
---> EXPENSIVE_OP() --> REQUEST1_END
This is what it looks like in code. The problem is the window between when the app starts reading the cache value and when it finishes writing to it. During this window, the concurrent requests don't know that there is already one request with the same itemID running.
app.post("/api", async function(req, res) {
const itemID = req.body.itemID
// See if itemID is processing
const processing = await DATABASE_READ(itemID)
// Due to how NodeJS works,
// from this point in time all requests
// to /api?itemID="xxx" will have processing = false
// and will conduct expensive operations
if (processing == true) {
// "Cheap" part
// Tell client to wait until itemID is processed
} else {
// "Expensive" part
DATABASE_UPDATE({[itemID]: true})
// All requests to /api at this point
// are still going here and conducting
// duplicate operations.
// Only after DATABASE_UPDATE finishes,
// all requests go to the "Cheap" part
DO_EXPENSIVE_THINGS();
}
}
Edit
Of course I can do something like this:
const lockedIDs = {}
app.post("/api", function(req, res) {
const itemID = req.body.itemID
const locked = lockedIDs[itemID] ? true : false // sync equivalent to async DATABASE_READ(itemID)
if (locked) {
// Tell client to wait until itemID is processed
// No need to do expensive operations
} else {
lockedIDs[itemID] = true // sync equivalent to async DATABASE_UPDATE({[itemID]: true})
// Do expensive operations
// itemID is now "locked", so subsequent request will not go here
}
}
lockedIDs here behaves like an in-memory synchronous key-value database. That is ok, if it is just one server. But what if there are multiple server instances? I need to have a separate cache storage, like Redis. And I can access Redis only asynchronously. So this will not work, unfortunately.

Ok, let me take a crack at this.
So, the problem I'm having with this question is that you've abstracted the problem so much that it's really hard to help you optimize. It's not clear what your "long running process" is doing, and what it is doing will affect how to solve the challenge of handling multiple concurrent requests. What's your API doing that you're worried about consuming resources?
From your code, at first I guessed that you're kicking off some kind of long-running job (e.g. file conversion or something), but then some of the edits and comments make me think that it might be just a complex query against the database which requires a lot of calculations to get right and so you want to cache the query results. But I could also see it being something else, like a query against a bunch of third party APIs that you're aggregating or something. Each scenario has some nuance that changes what's optimal.
That said, I'll explain the 'cache' scenario and you can tell me if you're more interested in one of the other solutions.
Basically, you're in the right ballpark for the cache already. If you haven't already, I'd recommend looking at cache-manager, which simplifies your boilerplate a little for these scenarios (and let's you set cache invalidation and even have multi-tier caching). The piece that you're missing is that you essentially should always respond with whatever you have in the cache, and populate the cache outside the scope of any given request. Using your code as a starting point, something like this (leaving off all the try..catches and error checking and such for simplicity):
// A GET is OK here, because no matter what we're firing back a response quickly,
// and semantically this is a query
app.get("/api", async function(req, res) {
const itemID = req.query.itemID
// In this case, I'm assuming you have a cache object that basically gets whatever
// is cached in your cache storage and can set new things there too.
let item = await cache.get(itemID)
// Item isn't in the cache at all, so this is the very first attempt.
if (!item) {
// go ahead and let the client know we'll get to it later. 202 Accepted should
// be fine, but pick your own status code to let them know it's in process.
// Other good options include [503 Service Unavailable with a retry-after
// header][2] and [420 Enhance Your Calm][2] (non-standard, but funny)
res.status(202).send({ id: itemID });
// put an empty object in there so we know it's working on it.
await cache.set(itemID, {});
// start the long-running process, which should update the cache when it's done
await populateCache(itemID);
return;
}
// Here we have an item in the cache, but it's not done processing. Maybe you
// could just check to see if it's an empty object or not, but I'm assuming
// that we've setup a boolean flag on the cached object for when it's done.
if (!item.processed) {
// The client should try again later like above. Exit early. You could
// alternatively send the partial item, an empty object, or a message.
return res.status(202).send({ id: itemID });
}
// if we get here, the item is in the cache and done processing.
return res.send(item);
}
Now, I don't know precisely what all your stuff does, but if it's me, populateCache from above is a pretty simple function that just calls whatever service we're using to do the long-running work and then puts it into the cache.
async function populateCache(itemId) {
const item = await service.createThisWorkOfArt(itemId);
await cache.set(itemId, item);
return;
}
Let me know if that's not clear or if your scenario is really different from what I'm guessing.
As mentioned in the comments, this approach will cover most normal issues you might have with your described scenario, but it will still allow two requests to both fire off the long-running process, if they come in faster than the write to your cache store (e.g. Redis). I judge the odds of that happening are pretty low, but if you're really concerned about that then the next more paranoid version of this would be to simply remove the long-running process code from your web API altogether. Instead, your API just records that someone requested that stuff to happen, and if there's nothing in the cache then respond as I did above, but completely remove the block that actually calls populateCache altogether.
Instead, you would have a separate worker process running that would periodically (how often depends on your business case) check the cache for unprocessed jobs and kick off the work for processing them. By doing it this way, even if you have 1000's of concurrent requests for the same item, you can ensure that you're only processing it one time. The downside of course is that you add whatever the periodicity of the check is to the delay in getting the fully processed data.

You could create a local Map object (in memory for synchronous access) that contains any itemID as a key that is being processed. You could make the value for that key be a promise that resolves with whatever the result is from anyone who has previously processed that key. I think of this like a gate keeper. It keeps track of which itemIDs are being processed.
This scheme tells future requests for the same itemID to wait and does not block other requests - I thought that was important rather than just using a global lock on all requests related to itemID processing.
Then, as part of your processing, you first check the local Map object. If that key is in there, then it's currently being processed. You can then just await the promise from the Map object to see when it's done being processed and get any result from prior processing.
If it's not in the Map object, then it's not being processed now and you can immediately put it in Map to mark it as "in process". If you set a promise as the value, then you can resolve that promise with whatever result you get from this processing of the object.
Any other requests that come along will end up just waiting on that promise and you will thus only process this ID once. The first one to start with that ID will process it and all other requests that come along while it's processing will use the same shared result (thus saving the duplication of your heavy computation).
I tried to code up an example, but did not really understand what your psuedo-code was trying to do well enough to offer a code example.
Systems like this have to have perfect error handling so that all possible error paths handle the Map and promise embedded in the Map properly.
Based on your fairly light pseudo-code example, here's a similar pseudo code example that illustrates the above concept:
const itemInProcessCache = new Map();
app.get("/api", async function(req, res) {
const itemID = req.query.itemID
let gate = itemInProcessCache.get(itemID);
if (gate) {
gate.then(val => {
// use cached result here from previous processing
}).catch(err => {
// decide what to do when previous processing had an error
});
} else {
let p = DATABASE_UPDATE({itemID: true}).then(result => {
// expensive processing done
// return final value so any others waiting on the gate can just use that value
// decide if you want to clear this item from itemInProcessCache or not
}).catch(err => {
// error on expensive processing
// remove from the gate cache because we didn't get a result
// expensive processing will have to be done by someone else
itemInProcessCache.delete(itemID);
});
// mark this item as being processed
itemInProcessCache.set(itemID, p);
}
});
Note: This relies on the single-threadedness of node.js. No other request can get started until the request handler here returns so that itemInProcessCache.set(itemID, p); gets called before any other requests for this itemID could get started.
Also, I don't know databases very well, but this seems very much like a feature that a good multi-user database might have built in or have supporting features that makes this easier since it's not an uncommon idea to not want to have multiple requests all trying to do the same database work (or worse yet, trouncing each other's work).

Patterns for asynchronous but sequential requests

I have been writing a lot of NodeJS recently and that has forced me to attack some problems from a different perspective. I was wondering what patterns had developed for the problem of processing chunks of data sequentially (rather than in parallel) in an asynchronous request-environment, but I haven't been able to find anything directly relevant.
So to summarize the problem:
I have a list of data stored in an array format that I need to process.
I have to send this data to a service asynchronously, but the service will only accept a few at a time.
The data must be processed sequentially to meet the restrictions on the service, meaning making a number of parallel asynchronous requests is not allowed
Working in this domain, the simplest pattern I've come up with is a recursive one. Something like
function processData(data, start, step, callback){
if(start < data.length){
var chunk = data.split(start, step);
queryService(chunk, start, step, function(e, d){
//Assume no errors
//Could possibly do some matching between d and 'data' here to
//Update data with anything that the service may have returned
processData(data, start+step, step, callback);
});
}
else{
callback(data);
}
}
Conceptually, this should step through each item, but it's intuitively complex. I feel like there should be a simpler way of doing this. Does anyone have a pattern they tend to follow when approaching this kind of problem?

My first thought process would be to rely on object encapsulation. Create an object that contains all of the information about what needs to be processed and all of the relevant data about what has been processed and is being processed and the callback function will just call the 'next' function for the object, which will in turn start processing on the next piece of data and update the object. Essentially working like a n asynchronous for-loop.

Streaming / Piping JSON.stringify output in Node.js / Express

I have a scenario where I need to return a very large object, converted to a JSON string, from my Node.js/Express RESTful API.
res.end(JSON.stringify(obj));
However, this does not appear to scale well. Specifically, it works great on my testing machine with 1-2 clients connecting, but I suspect that this operation may be killing the CPU & memory usage when many clients are requesting large JSON objects simultaneously.
I've poked around looking for an async JSON library, but the only one I found seems to have an issue (specifically, I get a [RangeError]). Not only that, but it returns the string in one big chunk (eg, the callback is called once with the entire string, meaning memory footprint is not decreased).
What I really want is a completely asynchronous piping/streaming version of the JSON.stringify function, such that it writes the data as it is packed directly into the stream... thus saving me both memory footprint, and also from consuming the CPU in a synchronous fashion.

Ideally, you should stream your data as you have it and not buffer everything into one large object. If you cant't change this, then you need to break stringify into smaller units and allow main event loop to process other events using setImmediate. Example code (I'll assume main object has lots of top level properties and use them to split work):
function sendObject(obj, stream) {
var keys = Object.keys(obj);
function sendSubObj() {
setImmediate(function(){
var key = keys.shift();
stream.write('"' + key + '":' + JSON.stringify(obj[key]));
if (keys.length > 0) {
stream.write(',');
sendSubObj();
} else {
stream.write('}');
}
});
})
stream.write('{');
sendSubObj();
}

It sounds like you want Dominic Tarr's JSONStream. Obviously, there is some assembly required to merge this with express.
However, if you are maxing out the CPU attempting to serialize (Stringify) an object, then splitting that work into chunks may not really solve the problem. Streaming may reduce the memory footprint, but won't reduce the total amount of "work" required.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string