Most efficiently downloading, unzipping, and analyzing many files in Node JS - node.js

I have to download a large amount of compressed files onto my Node JS server from a third party host, unzip them, analyze them, and store them. These files are a little over 18000 XMLs, each between about 0.01 and 0.06mb. The files are split into 8 compressed folders of greatly varying size.
Right now, this is my process:
Download the compressed files using the request library
request({ url: fileUrl, encoding: null }, function(err, resp, body) {...});
Write the downloaded files to a directory
fs.writeFile(output, body, function(err) {...});
Unzip the downloaded material using extract-zip and place in a new directory
unzip(output, { dir : directory }, function (err) {...});
Delete the downloaded zip file
fs.unlink('/a-directory/' + output, (err) => { if (err) console.log(err); });
Get the items in the directory
fs.readdir(fromDir, function(err, items) {...});
For each item (XML file), read it
fs.readFile(fromDir + '/' + item, 'utf8', function(err, xmlContents) {...});
For each read XML file, convert it to a JSON
let bill = xmlToJsonParser.toJson(xmlContents)
Will do some other stuff, but I haven't written that part yet
I can post more complete code if that would help anyone.
As you can see, there are a bunch of steps here, and I have a hunch that some of them can be removed or at least made more efficient.
What are your suggestions for improving the performance?––right now the process completes, but I hit 100% CPU every time, which I am fairly certain is bad.

Some general guidelines for scaling this type of work:
Steps that are entirely async I/O scale really well in node.js.
When doing lots of I/O operations, you will want to be able to control how many are in-flight at the same time to control memory usage and TCP resource usage. So, you probably would launch several hundred requests at a time, not 18,000 requests all at once. As one finishes, you launch the next one.
Steps that use a meaningful amount of CPU should be in a process that you can run N of them (often as many as you have CPUs). This helps your CPU usage work scale.
Try to avoid keeping more in memory than you need to. If you can pipe something directly from network to disk, that can significantly reduce memory usage vs. buffer the entire file and then writing the whole thing to disk.
Figure out some way to manage a work queue of jobs waiting for the worker processes to run. You can either have your main app maintain a queue and use http to ask it for the next job or you can even work it all through the file system with lock files.
So, here are some more specifics based on these guidelines:
I'd say use your main server process for steps 1 and 2. Neither of the first two steps are CPU intensive so a single server process should be able to handle a zillion of those. All they are doing is async I/O. You will have to manage how many request() operations are in flight at the same time to avoid overloading your TCP stack or your memory usage, but other than that, this should scale just fine since it's only doing async I/O.
You can reduce memory usage in steps 1 and 2 by piping the response directly to the output file so as bytes arrive, they are immediately written to disk without holding the entire file in memory.
Then write another node.js app that caries out steps 3 - 8 (steps 3 and perhaps 7 are CPU intensive). If you write them in a way that they just "check out" a file from a known directory and work on it, you should be able to make it so that you can run as many of these processes as you have CPUs and thus gain scale while also keeping the CPU load away from your main process.
The check-out function can either be done via one central store (like a redis store or even just a simple server of your own that maintains a work queue) that keeps track of which files are available for work or you could even implement it entirely with file system logic using lock files.
right now the process completes, but I hit 100% CPU every time, which I am fairly certain is bad.
If you only have one process and it's at 100% CPU, then you can increase scale by getting more processes involved.
As you can see, there are a bunch of steps here, and I have a hunch that some of them can be removed or at least made more efficient.
Some ideas:
As mentioned before, pipe your request directly to the next operation rather than buffer the whole file.
If you have the right unzip tools, you could even pipe the request right to an unzipper which is piped directly to a file. If you did this, you'd have to scale the main process horizontally to get more CPUs involved, but this would save reading and writing the compressed file to/from disk entirely. You could conceivably combine steps 1-4 into one stream write with an unzip transform.
If you did the transform stream described in step 2, you would then have a separate set of processes that carry out steps 5-8.
Here are a couple libraries that can be used to combine pipe and unzip:
unzip-stream
node-unzip-2

Related

What is the fastest way to read & write tiny but MANY files using Nodejs?

I have a node application which handles JSON files: it reads, parses files and writes new files. And sometimes, by necessary, the files become a massive swarm. First, I think current reading speed looks reasonalbe, but writing speed seems little bit slow.
I'd like to improve this processing speed.
Before I touch this program, I'd tried multi-threading to my python application first, it does similar tasks but handles image files, and the threading successfully reduced its response time.
I wonder if it's okay to use node's worker_thread to get the same effect. Because Node document says
They do not help much with I/O-intensive work. The Node.js built-in asynchronous I/O operations are more efficient than Workers can be.
https://nodejs.org/api/worker_threads.html
The problem is the truth that I don't know whether the current speed is the fastest which the node environment could show or still enhancable without worker_thread.
These are my attempts for imporvemnt: My program reads and writes files one by one from a list of file's paths, with fs-sync functions - readFileSync(), writeFileSync(). First, I thought accessing many files synchronously is not node-ish, so I promisified fs functions(readFile(), writeFile()) and pushed to a list of promise objects. Then I call await Promise.all(promisesList). But this didn't help at all. Even slower.
For the second try, I gave up generating tones of promises, and made a single promise. It kept watching the number of processed files, and call resolve() when the number is equal with the length of total files.
const waiter = new Promise<boolean>((resolve, rejects) => {
const loop: () => void = () =>
processedCount === fileLen ? resolve(true) : setTimeout(loop);
loop();
});
I had only waited this promise, and this was the slowest.
Now I think this shows the "asynchronous" does not mean "parallel". So, am I misunderstanding the document's explanation? And should I use worker_threads to improve the file IO speed in this case? Or is there any better solution? Maybe it could be the answer not to use Node for these kind of process, I'd love to but today is Nov 25th sadly...
The real bottleneck here will be the file system implementation. Running up multiple threads to read and / or write multiple files in parallel will give you some speedup, but you quickly run into the file system bottleneck.
As a general rule, typical file systems do not handle the use-case of "gazzillions of tiny files" well. And it gets progressively worse if the files are on slow disks, a remote file system, a hierarchical storage system, etc.
The best solution is to redesign your application so that it doesn't organize its data like that. Better alternatives involve combinations of:
using an SQL or NOSQL database to store the data
using a flat-file database like SQLite or BDB
reading and writing TAR or ZIP archives
storing / buffering the data in memory.
If you are trying to get better performance, Python is not where you should look. For a start, a CPU bound multi-threaded application is effectively constrained to a single core ... due to the GIL. And your python code is typically not compiled to native code.
A language like C, C++ or even Java would be a better choice. But parallelizing an application's file I/O is difficult and the results tend to be disappointing. It is generally better to do it a different way; i.e. avoid application architectures that use lots of little files.
Have you tried node streams API. Also there is JSONStream npm package to parse json stream data. Please have look.
const fs = require('fs');
let sourceFileStream = fs.createReadStream('./file1.json')
let destinationFileStream = fs.createWriteStream('./temp/file1.json')
sourceFileStream.pipe(destinationFileStream)

How do I perform operations like read/write to a heavy file in node.js?

I am quite new to node.js and I want to perform operations(like read,write or store in DB) to large files(typically 5GB ~ 10GB).
What are the possible ways to do it fast and without affecting the main thread(UI).Do I need to implement multithreading?
I think since I/O operations are asynchronous,it will never affect the main thread. And I had tried to read a large file and write the contents to response object of HTTP like this -
var http = require('http'),
fs = require('fs');
fs.readFile('largefile.txt',function(err,data){
if(err) {
throw err;
}
http.createServer(function(request,response){
response.writeHead(200,{
"Content-Type" : "text/plain"
});
response.end(data);
}).listen(8080);
console.log("server started");
});
The size of largefile.txt here is .25GB only, and it has taken almost 5 minutes for this program to run. Now in actual, I want the size to be (as I mentioned earlier) 5~10GB and type of file can be .csv,.xls. How should I do that, please tell the approach with examples(if possible).
Reading from disk to working program memory is very slow. This is a hardware limitation.
If the file is CSV (Comma-separated values separated by newlines), you probably want to read it line by line, or search through for the right line and then read, instead of reading the whole thing into memory and then printing the whole thing out. If you read it line by line at least you're updating something as it's being read.
For a start, you can use fs.read instead of fs.readFile to read the file character by character, looking for a newline character.
But a quick search for "nodejs read file line" shows there are many other ways to approach this with Node.
Edit:
I can't comment yet, but regarding child processes, as jfriend00 and SirDemon said, although NodeJS uses non-blocking IO (reading disk to memory doesn't block code) and it's generally event-oriented/asynchronous in design (execution may swap between sections of code while it's waiting on things) the code is only run single-threaded on a single CPU (code still blocks code). So a child process allows you to make use of another CPU. It was all designed for dynamic servers, so you could have code running and files being read almost all the time, but without the overhead of maintaining a new thread/process for each file read (which servers typically use thread pools for). (I think that's correct?)

How to pipeline in node.js to redis?

I have lot's of data to insert (SET \ INCR) to redis DB, so I'm looking for pipeline \ mass insertion through node.js.
I couldn't find any good example/ API for doing so in node.js, so any help would be great!
Yes, I must agree that there is lack of examples for that but I managed to create the stream on which I sent several insert commands in batch.
You should install module for redis stream:
npm install redis-stream
And this is how you use the stream:
var redis = require('redis-stream'),
client = new redis(6379, '127.0.0.1');
// Open stream
var stream = client.stream();
// Example of setting 10000 records
for(var record = 0; record < 10000; record++) {
// Command is an array of arguments:
var command = ['set', 'key' + record, 'value'];
// Send command to stream, but parse it before
stream.redis.write( redis.parse(command) );
}
// Create event when stream is closed
stream.on('close', function () {
console.log('Completed!');
// Here you can create stream for reading results or similar
});
// Close the stream after batch insert
stream.end();
Also, you can create as many streams as you want and open/close them as you want at any time.
There are several examples of using redis stream in node.js on redis-stream node module
In node_redis there all commands are pipelined:
https://github.com/mranney/node_redis/issues/539#issuecomment-32203325
You might want to look at batch() too. The reason why it'd be slower with multi() is because it's transactional. If something failed, nothing would be executed. That may be what you want, but you do have a choice for speed here.
The redis-stream package doesn't seem to make use of Redis' mass insert functionality so it's also slower than the mass insert Redis' site goes on to talk about with redis-cli.
Another idea would be to use redis-cli and give it a file to stream from, which this NPM package does: https://github.com/almeida/redis-mass
Not keen on writing to a file on disk first? This repo: https://github.com/eugeneiiim/node-redis-pipe/blob/master/example.js
...also streams to Redis, but without writing to file. It streams to a spawned process and flushes the buffer every so often.
On Redis' site under mass insert (http://redis.io/topics/mass-insert) you can see a little Ruby example. The repo above basically ported that to Node.js and then streamed it directly to that redis-cli process that was spawned.
So in Node.js, we have:
var redisPipe = spawn('redis-cli', ['--pipe']);
spawn() returns a reference to a child process that you can pipe to with stdin. For example: redisPipe.stdin.write().
You can just keep writing to a buffer, streaming that to the child process, and then clearing it every so often. This then won't fill it up and will therefore be a bit better on memory than perhaps the node_redis package (that literally says in its docs that data is held in memory) though I haven't looked into it that deeply so I don't know what the memory footprint ends up being. It could be doing the same thing.
Of course keep in mind that if something goes wrong, it all fails. That's what tools like fluentd were created for (and that's yet another option: http://www.fluentd.org/plugins/all - it has several Redis plugins)...But again, it means you're backing data on disk somewhere to some degree. I've personally used Embulk to do this too (which required a file on disk), but it did not support mass inserts, so it was slow. It took nearly 2 hours for 30,000 records.
One benefit to a streaming approach (not backed by disk) is if you're doing a huge insert from another data source. Assuming that data source returns a lot of data and your server doesn't have the hard disk space to support all of it - you can stream it instead. Again, you risk failures.
I find myself in this position as I'm building a Docker image that will run on a server with not enough disk space to accommodate large data sets. Of course it's a lot easier if you can fit everything on the server's hard disk...But if you can't, streaming to redis-cli may be your only option.
If you are really pushing a lot of data around on a regular basis, I would probably recommend fluentd to be honest. It comes with many great features for ensuring your data makes it to where it's going and if something fails, it can resume.
One problem with all of these Node.js approaches is that if something fails, you either lose it all or have to insert it all over again.
By default, node_redis, the Node.js library sends commands in pipelines and automatically chooses how many commands will go into each pipeline [(https://github.com/NodeRedis/node-redis/issues/539#issuecomment-32203325)][1]. Therefore, you don't need to worry about this. However, other Redis clients may not use pipelines by default; you will need to check out the client documentation to see how to take advantage of pipelines.

Are callbacks for requests a bad practice in node.js?

Imagine you want to download an image or a file, this would be the first way the internet will teach you to go ahead:
request(url, function(err, res, body) {
fs.writeFile(filename, body);
});
But doesn't this accumulate all data in body, filling the memory?
Would a pipe be totally more efficient?
request(url).pipe(fs.createWriteStream(filename));
Or is this handled internally in a similar matter, buffering the stream anyway, making this irrelevant?
Furthermore, if I want to use the callback but not the body (because you can still pipe), will this memory buffer still be filled?
I am asking because the first (callback) method allows me to chain downloads in stead of launching them in parallel(*), but I don't want to fill a buffer I'm not gonna use either. So I need the callback if I don't want to resort to something fancy like async just to use queue to prevent this.
(*) Which is bad because if you just request too many files before they are complete, the async nature of request will cause node to choke to death in an overdose of events and memory loss. First you'll get these:
"possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit."
And when stretching it, 500 piped requests will fill your memory up and crash node. That's why you need the callback in stead of the pipe, so you know when to start the next file.
But doesn't this accumulate all data in body, filling the memory?
Yes, many operations such as your first snippet buffer data into memory for processing. Yes this uses memory, but it is at least convenient and sometimes required depending on how you intend to process that data. If you want to load an HTTP response and parse the body as JSON, that is almost always done via buffering, although it's possible with a streaming parser, it is much more complicated and usually unnecessary. Most JSON data is not sufficiently large such that streaming is a big win.
Or is this handled internally in a similar matter, making this irrelevant?
No, APIs that provide you an entire piece of data as a string use buffering and not streaming.
However, multimedia data, yes, you cannot realistically buffer it to memory and thus streaming is more appropriate. Also that data tends to be opaque (you don't parse it or process it), which is also good for streaming.
Streaming is nice when circumstances permit it, but that doesn't mean there's anything necessarily wrong with buffering. The truth is buffering is how the vast majority of things work most of the time. In the big picture, streaming is just buffering 1 chunk at a time and capping them at some size limit that is well within the available resources. Some portion of the data needs to go through memory at some point if you are going to process it.
Because if you just request too many files one by one, the async nature of request will cause node to choke to death in an overdose of events and memory loss.
Not sure exactly what you are stating/asking here, but yes, writing effective programs requires thinking about resources and efficiency.
See also substack's rant on streaming/pooling in the hyperquest README.
I figured out a solution that renders the questions about memory irrelevant (although I'm still curious).
if I want to use the callback but not the body (because you can still pipe), will this memory buffer still be filled?
You don't need the callback from request() in order to know when the request is finished. The pipe() will close itself when the stream 'ends'. The close emits an event and can be listened for:
request(url).pipe(fs.createWriteStream(filename)).on('close', function(){
next();
});
Now you can queue all your requests and download files one by one.
Of course you can vacuum the internet using 8 parallel requests all the time with libraries such as async.queue, but if all you want to do is get some files with a simple script, async is probably overkill.
Besides, you're not gonna want to max out your system resources for a single trick on a multi-user system anyway.

Node.js async parallel - what consequences are?

There is code,
async.series(tasks, function (err) {
return callback ({message: 'tasks execution error', error: err});
});
where, tasks is array of functions, each of it peforms HTTP request (using request module) and calling MongoDB API to store the data (to MongoHQ instance).
With my current input, (~200 task to execute), it takes
[normal mode] collection cycle: 1356.843 sec. (22.61405 mins.)
But simply trying change from series to parallel, it gives magnificent benefit. The almost same amount of tasks run in ~30 secs instead of ~23 mins.
But, knowing that nothing is for free, I'm trying to understand what the consequences of that change? Can I tell that number of open sockets will be much higher, more memory consumption, more hit to DB servers?
Machine that I run the code is only 1GB of RAM Ubuntu, so I so that app hangs there one time, can it be caused by lacking of resources?
Your intuition is correct that the parallelism doesn't come for free, but you certainly may be able to pay for it.
Using a load testing module (or collection of modules) like nodeload, you can quantify how this parallel operation is affecting your server to determine if it is acceptable.
Async.parallelLimit can be a good way of limiting server load if you need to, but first it is important to discover if limiting is necessary. Testing explicitly is the best way to discover the limits of your system (eachLimit has a different signature, but could be used as well).
Beyond this, common pitfalls using async.parallel include wanting more complicated control flow than that function offers (which, from your description doesn't seem to apply) and using parallel on too large of a collection naively (which, say, may cause you to bump into your system's file descriptor limit if you are writing many files). With your ~200 request and save operations on 1GB RAM, I would imagine you would be fine as long as you aren't doing much massaging in the event handlers, but if you are experiencing server hangs, parallelLimit could be a good way out.
Again, testing is the best way to figure these things out.
I would point out that async.parallel executes multiple functions concurrently not (completely) parallely. It is more like virtual parallelism.
Executing concurrently is like running different programs on a single CPU core, via multitasking/scheduling. True parallel execution would be running different program on each core of multi-core CPU. This is important as node.js has single-threaded architecture.
The best thing about node is that you don't have to worry about I/O. It handles I/O very efficiently.
In your case you are storing data to MongoDB, is mostly I/O. So running them parallely will use up your network bandwidth and if reading/writing from disk then disk bandwidth too. Your server will not hang because of CPU overload.
The consequence of this would be that if you overburden your server, your requests may fail. You may get EMFILE error (Too many open files). Each socket counts as a file. Usually connections are pooled, meaning to establish connection a socket is picked from the pool and when finished return to the pool. You can increase the file descriptor with ulimit -n xxxx.
You may also get socket errors when overburdened like ECONNRESET(Error: socket hang up), ECONNREFUSED or ETIMEDOUT. So handle them with properly. Also check the maximum number of simultaneous connections for mongoDB server too.
Finally the server can hangup because of garbage collection. Garbage collection kicks in after your memory increases to a certain point, then runs periodically after some time. The max heap memory V8 can have is around 1.5 GB, so expect GC to run frequently if its memory is high. Node will crash with process out of memory if asking for more, than that limit. So fix the memory leaks in your program. You can look at these tools.
The main downside you'll see here is a spike in database server load. That may or may not be okay depending on your setup.
If your database server is a shared resource then you will probably want to limit the parallel requests by using async.eachLimit instead.
you'll realize the difference if multiple users connect:
in this case the processor can handle multiple operations
asynch tries to run several operations of multiple users relative equal
T = task
U = user
(T1.U1 = task 1 of user 1)
T1.U1 => T1.U2 => T2.U1 => T8.U3 => T2.U2 => etc
this is the oposite of atomicy (so maybe watch for atomicy on special db operations - but thats another topic)
so maybe it is faster to use:
T2.U1 before T1.U1
- this is no problem until
T2.U1 is based on T1.U1
- this is preventable by using callbacks/ or therefore are callbacks
...hope this is what you wanted to know... its a bit late here

Resources