Streaming / Piping JSON.stringify output in Node.js / Express

Streaming / Piping JSON.stringify output in Node.js / Express - node.js

I have a scenario where I need to return a very large object, converted to a JSON string, from my Node.js/Express RESTful API.
res.end(JSON.stringify(obj));
However, this does not appear to scale well. Specifically, it works great on my testing machine with 1-2 clients connecting, but I suspect that this operation may be killing the CPU & memory usage when many clients are requesting large JSON objects simultaneously.
I've poked around looking for an async JSON library, but the only one I found seems to have an issue (specifically, I get a [RangeError]). Not only that, but it returns the string in one big chunk (eg, the callback is called once with the entire string, meaning memory footprint is not decreased).
What I really want is a completely asynchronous piping/streaming version of the JSON.stringify function, such that it writes the data as it is packed directly into the stream... thus saving me both memory footprint, and also from consuming the CPU in a synchronous fashion.

Ideally, you should stream your data as you have it and not buffer everything into one large object. If you cant't change this, then you need to break stringify into smaller units and allow main event loop to process other events using setImmediate. Example code (I'll assume main object has lots of top level properties and use them to split work):
function sendObject(obj, stream) {
var keys = Object.keys(obj);
function sendSubObj() {
setImmediate(function(){
var key = keys.shift();
stream.write('"' + key + '":' + JSON.stringify(obj[key]));
if (keys.length > 0) {
stream.write(',');
sendSubObj();
} else {
stream.write('}');
}
});
})
stream.write('{');
sendSubObj();
}

It sounds like you want Dominic Tarr's JSONStream. Obviously, there is some assembly required to merge this with express.
However, if you are maxing out the CPU attempting to serialize (Stringify) an object, then splitting that work into chunks may not really solve the problem. Streaming may reduce the memory footprint, but won't reduce the total amount of "work" required.

Related

When should I use worker-threads?

I am currently working on a backend which provides rest endpoints for my frontend with nestjs. In some endpoints I receive e.g. an array of elements which I need to process.
Concrete Example:
I receive an array of 50 elements. For each element I need to make a SQL request. Therefore I need to loop over the array and do stuff in SQL.
I always ask myself: At what amount of elements should I use for example worker threads to not block the event loop?
Maybe I misunderstood the blocking of the event loop and someone can enlight me.

I don't think that you'll need worker-threads in this scenario. As long as the sql-queries are executed asynchronsouly, i.e. the sql-query calls do not block, you will be fine. You can use Promise.all to speed up the processing of the loop, as the queries will be executed in parallel, e.g.
const dbQueryPromises = [];
for(const entry of data) {
dbQueryPromises.push(dbConnection.query(buildQuery(entry)));
}
await Promise.all(dbQueryPromises);
If, however, your code performs computation-heavy operations inside the loop, then you should consider worker-threads as the long running operations on your call stack will block the eventloop.

Only use them if you need to do CPU-intensive tasks with large amounts of data. They allow you to avoid the serialization step of the data. 50 Is not enough I believe

What's better readSync or createReadStream (with Symbol.asyncIterator)?

createReadStream (with Symbol.asyncIterator)
async function* readChunkIter(chunksAsync) {
for await (const chunk of chunksAsync) {
// magic
yield chunk;
}
}
const fileStream = fs.createReadStream(filePath, { highWaterMark: 1024 * 64 });
const readChunk = readChunkIter(fileStream);
readSync
function* readChunkIter(fd) {
// loop
// magic
fs.readSync(fd, buffer, 0, chunkSize, bytesRead);
yield buffer;
}
const fd = fs.openSync(filePath, 'r');
const readChunk = readChunkIter(fd);
What's better to use with a generator function and why?
upd: I'm not looking for a better way, I want to know the difference between using these features

To start with, you're comparing a synchronous file operation fs.readSync() with an asynchronous one in the stream (which uses fs.read() internally). so, that's a bit like apples and oranges for server use.
If this is on a server, then NEVER use synchronous file I/O except at server startup time because when processing requests or any other server events, synchronous file I/O blocks the entire event loop during the file read operation which drastically reduces your server scalability. Only use asynchronous file I/O, which between your two cases would be the stream.
Otherwise, if this is not on a server or any process that cares about blocking the node.js event loop during a synchronous file operation, then it's entirely up to you on which interface you prefer.
Other comments:
It's also unclear why you wrap for await() in a generator. The caller can just use for await() themselves and avoid the wrapping in a generator.
Streams for reading files are usually used in an event driven manner by adding an event listener to the data event and responding to data as it arrives. If you're just going to asynchronously read chunks of data from the file, there's really no benefit to a stream. You may as well just use fs.read() or fs.promises.read().
We can't really comment on the best/better way to solve a problem without seeing the overall problem you're trying to code for. You've just shown one little snippet of reading data. The best way to structure that depends upon how the higher level code can most conveniently use/consume the data (which you don't show).
I really didn't ask the right question. I'm not looking for a better way, I want to know the difference between using these features.
Well, the main difference is that fs.readSync() is blocking and synchronous and thus blocks the event loop, ruining the scalability of a server and should never be used (except during startup code) in a server environment. Streams in node.js are asynchronous and do not block the event loop.
Other than that difference, streams are a higher level construct than just reading the file directly and should be used when you're actually using features of the streams and should probably not be used when you're just reading chunks from the file directly and aren't using any features of streams.
In particular, error handling is not always so clear with streams, particularly when trying to use await and promises with streams. This is probably because readstreams were originally designed to be an event driven object and that means communicating errors indirectly on an error event which complicates the error handling on straight read operations. If you're not using the event driven nature of readstreams or some transform feature or some other major feature of streams, I wouldn't use them - I'd use the more traditional fs.promises.readFile() to just read data.

Does the .pipe() perform a memcpy in node.js?

This is a conceptual query regarding system level optimisation. My understanding by reading the NodeJS Documentation is that pipes are handy to perform flow control on streams.
Background: I have microphone stream coming in and I wanted to avoid an extra copy operation to conserve overall system MIPS. I understand that for audio streams this is not a great deal of MIPS being spent even if there was a memcopy under the hood, but I also have an extension planned to stream in camera frames at 30fps and UHD resolution. Making multiple copies of UHD resolution pixel data at 30fps is super inefficient, so needed some advice around this.
Example Code:
var spawn = require('child_process').spawn
var PassThrough = require('stream').PassThrough;
var ps = null;
//var audioStream = new PassThrough;
//var infoStream = new PassThrough;
var start = function() {
if(ps == null) {
ps = spawn('rec', ['-b', 16, '--endian', 'little', '-c', 1, '-r', 16000, '-e', 'signed-integer', '-t', 'raw', '-']);
//ps.stdout.pipe(audioStream);
//ps.stderr.pipe(infoStream);
exports.audioStream = ps.stdout;
exports.infoStream = ps.stderr;
}
};
var stop = function() {
if(ps) {
ps.kill();
ps = null;
}
};
//exports.audioStream = audioStream;
//exports.infoStream = infoStream;
exports.startCapture = start;
exports.stopCapture = stop;
Here are the questions:
To be able to perform flow control, does the source.pipe(dest) perform a memcpy from the source memory to the destination memory under the hood OR would it pass the reference in memory to the destination?
The commented code contains a PassThrough class instantiation - I am currently assuming the PassThrough causes memcopies as well, and so I am saving one memcpy operation in the entire system because I added in the above comments?
If I had to create a pipe between a Process and a Spawned Child process (using child_process.spawn() as shown in How to transfer/stream big data from/to child processes in node.js without using the blocking stdio?), I presume that definitely results in memcpy? Is there anyway to make that a reference rather than copy?
Does this behaviour differ from OS to OS? I presume it should be OS agnostic, but asking this anyways.
Thanks in advance for your help. It will help my architecture a great deal.

some url's for reference: https://github.com/nodejs/node/
https://github.com/nodejs/node/blob/master/src/stream_wrap.cc
https://github.com/nodejs/node/blob/master/src/stream_base.cc
https://github.com/libuv/libuv/blob/v1.x/src/unix/stream.c
https://github.com/libuv/libuv/blob/v1.x/src/win/stream.c
i tried writing a complicated / huge explaination based on theese and some other files however i came to the conclusion it would be best to give you a summary of how my experience / reading tells me node internally works:
pipe simply connects streams making it appear as if .on("data", …) is called by .write(…) without anything bloated in between.
now we need to separate the js world from the c++ / c world.
when dealing with data in js we use buffers. https://github.com/nodejs/node/blob/master/src/node_buffer.cc
they simply represent allocated memory with some candy on top to operate with it.
if you connect stdout of a process to some .on("data", …) listener it will copy the incoming chunk into a Buffer object for further usage inside the js world.
inside the js world you have methods like .pause() etc. (as you can see in nodes steam api documentation) to prevent the process to eat memory in case incoming data flows faster than its processed.
connecting stdout of a process and for example an outgoing tcp port through pipe will result in a connection similar to how nginx operates. it will connect theese streams as if they would directly talk to each other by copying incoming data directly to the outgoing stream.
as soon as you pause a stream, node will use internal buffering in case its unable to pause the incoming stream.
so for your scenario you should just do testing.
try to receive data through an incoming stream in node, pause the stream and see what happens.
i'm not sure if node will use internal buffering or if the process you try to run will just halt untill it can continue to send data.
i expect the process to halt untill you continue the stream.
for transfering huge images i recommend transfering them in chunks or to pipe them directly to an outgoing port.
the chunk way would allow you to send the data to multiple clients at once and would keep the memory footprint pretty low.
PS you should take a look at this gist that i just found: https://gist.github.com/joyrexus/10026630
it explains in depth how you can interact with streams

How to dump large data sets in mongodb from node.js

I'm trying to dump approx 2.2 million objects in mongodb (using mongoose). The problem is when I save all the objects one by one It gets stuck. I've kepts a sample code below. If I run this code for 50,000 it works great. But if I increase data size to approx 500,000 it gets stuck.I want to know what is wrong with this approach and I want to find a better way to do this. I'm quite new to nodejs. I've tried loop's and everything no help finally I found this kind of solution. This one works fine for 50k objects but gets stuck for 2.2 Million objects. and I get this after sometime
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
Aborted (core dumped)
var connection = mongoose.createConnection("mongodb://localhost/entity");
var entitySchema = new mongoose.Schema({
name: String
, date: Date
, close : Number
, volume: Number
, adjClose: Number
});
var Entity = connection.model('entity', entitySchema)
var mongoobjs =["2.2 Millions obejcts here populating in code"] // works completely fine till here
async.map(mongoobjs, function(object, next){
Obj = new Entity({
name : object.name
, date: object.date
, close : object.close
, volume: object.volume
, adjClose: object.adjClose
});
Obj.save(next);
}, function(){console.log("Saved")});

Thanks cdbajorin
This seem to be much better way and a little faster batch approach for for doing this. So what I learned was that in my earlier approach, "new Entity(....)" was taking time and causing memory overflow. Still not sure why.
So, What I did was rather than using this line
Obj = new Entity({
name : object.name
, date: object.date
, close : object.close
, volume: object.volume
, adjClose: object.adjClose
});
I just created JSON objects and stored in an array.
stockObj ={
name : object.name
, date: object.date
, close : object.close
, volume: object.volume
, adjClose: object.adjClose
};
mongoobjs.push(stockObj); //array of objs.
and used this command... and Voila It worked !!!
Entity.collection.insert(mongoobjs, function(){ console.log("Saved succesfully")});

nodejs uses v8 which has the unfortunate property, from the perspective of developers coming from other interpreted languages, of severely restricting the amount of memory you can use to something like 1.7GB regardless of available system memory.
There is really only one way, afaik, to get around this - use streams. Precisely how you do this is up to you. For example, you can simply stream data in continuously, process it as it's coming in, and let the processed objects get garbage collected. This has the downside of being difficult to balance input to output.
The approach we've been favoring lately is to have an input stream bring work and save it to a queue (e.g. an array). In parallel you can write a function that is always trying to pull work off the queue. This makes it easy to separate logic and throttle the input stream in case work is coming in (or going out) too quickly.
Say for example, to avoid memory issues, you want to stay below 50k objects in the queue. Then your stream-in function could pause the stream or skip the get() call if the output queue has > 50k entries. Similarly, you might want to batch writes to improve server efficiency. So your output processor could avoid writing unless there are at least 500 objects in the queue or if it's been over 1 second since the last write.
This works because javascript uses an event loop which means that it will switch between asynchronous tasks automatically. Node will stream data in for some period of time then switch to another task. You can use setTimeout() or setInterval() to ensure that there is some delay between function calls, thereby allowing another asynchronous task to resume.
Specifically addressing your problem, it looks like you are individually saving each object. This will take a long time for 2.2 million objects. Instead, there must be a way to batch writes.

As an addition to answers provided in this thread, I was successful with
Bulk Insert (or batch insertion of ) 20.000+ documents (or objects)
Using low memory (250 MB) available within cheap offerings of Heroku
Using one instance, without any parallel processing
The Bulk operation as specified with MongoDB native driver was used, and the following is the code-ish that worked for me:
var counter = 0;
var entity= {}, entities = [];// Initialize Entities from a source such as a file, external database etc
var bulk = Entity.collection.initializeOrderedBulkOp();
var size = MAX_ENTITIES; //or `entities.length` Defined in config, mine was 20.0000
//while and -- constructs is deemed faster than other loops available in JavaScript ecosystem
while(size--){
entity = entities[size];
if( entity && entity.id){
// Add `{upsert:true}` parameter to create if object doesn't exist
bulk.find({id: entity.id}).update({$set:{value:entity.value}});
}
console.log('processing --- ', entity, size);
}
bulk.execute(function (error) {
if(error) return next(error);
return next(null, {message: 'Synced vector data'});
});
Entity is a mongoose model.
Old versions of mongodb may not support Entity type as it was made available from version 3+.
I hope this answer helps someone.
Thanks.

Patterns for asynchronous but sequential requests

I have been writing a lot of NodeJS recently and that has forced me to attack some problems from a different perspective. I was wondering what patterns had developed for the problem of processing chunks of data sequentially (rather than in parallel) in an asynchronous request-environment, but I haven't been able to find anything directly relevant.
So to summarize the problem:
I have a list of data stored in an array format that I need to process.
I have to send this data to a service asynchronously, but the service will only accept a few at a time.
The data must be processed sequentially to meet the restrictions on the service, meaning making a number of parallel asynchronous requests is not allowed
Working in this domain, the simplest pattern I've come up with is a recursive one. Something like
function processData(data, start, step, callback){
if(start < data.length){
var chunk = data.split(start, step);
queryService(chunk, start, step, function(e, d){
//Assume no errors
//Could possibly do some matching between d and 'data' here to
//Update data with anything that the service may have returned
processData(data, start+step, step, callback);
});
}
else{
callback(data);
}
}
Conceptually, this should step through each item, but it's intuitively complex. I feel like there should be a simpler way of doing this. Does anyone have a pattern they tend to follow when approaching this kind of problem?

My first thought process would be to rely on object encapsulation. Create an object that contains all of the information about what needs to be processed and all of the relevant data about what has been processed and is being processed and the callback function will just call the 'next' function for the object, which will in turn start processing on the next piece of data and update the object. Essentially working like a n asynchronous for-loop.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string