Streaming Large Files in Node.js

Streaming Large Files in Node.js - node.js

I have a situation where I'm going to need to stream for example 100 files concurrently, but I don't know weather I need subprocess or not because i'm not sure if have one really large file will block the other ones from streaming. Can anyone help clear up what should be done in this situation. Am I going to need to spawn a subprocess? Or can i just stream them all at the same time in a single process?

Node is async, read file or send data will not block your process, so you do not need to spawn a sub process.
Make sure that your app do not call any sync functions such as fs.readFileSync fs.readdirSync, etc.

Related

Node.js pipe console error to another program (make it async)

From Expressjs documentation:
To keep your app purely asynchronous, you’d still want to pipe
console.err() to another program
Qestions:
Is it enough to run my node app with stdout and stderr redirect to not block event loop? Like this: node app 2>&1 | tee logFile ?
If ad.1 answer is true, then how to achieve non-blocking logging while using Winston or Bunyan? They have some built in mechanism to achieve this or they just save data to specific file wasting cpu time of current Node.js process? Or maybe to achieve trully async logging they should pipe data to child process that performs "save to file" (is it still performance positive?) ? Can anyone explain or correct me if my way of thinking is just wrong?
Edited part: I can assume that piping data from processes A, B, ...etc to process L is cheaper for this specific processes (A, B, ...) than writing it to file (or sending over network).
To the point:
I am designing logger for application that uses nodejs cluster.
Briefly - one of processes (L) will handle data streams from others, (A, B, ...).
Process L will queue messages (for example line by line or some other special separator) and log it one by one into file, db or anywhere else.
Advantage of this approach is reducing load of processes that can spent more time on doing their job.
One more thing - assumption is to simplify usage of this library so user will only include this logger without any additional interaction (stream redirection) via shell.
Do you think this solution makes sense? Maybe you know a library that already doing this?

Let's set up some ground level first...
Writing to a terminal screen (console.log() etc.), writing to a file (fs.writeFile(), fs.writeFileSync() etc.) or sending data to a stream process.stdout.write(data) etc.) will always "block the event loop". Why? Because some part of those functions is always written in JavaScript. The minimum amount of work needed by these functions would be to take the input and hand it over to some native code, but some JS will always be executed.
And since JS is involved, it will inevitably "block" the event loop because JavaScript code is always executed on a single thread no matter what.
Is this a bad thing...?
No. The amount of time required to process some log data and send it over to a file or a stream is quite low and does not have significant impact on performance.
When would this be a bad thing, then...?
You can hurt your application by doing something generally called a "synchronous" I/O operation - that is, writing to a file and actually not executing any other JavaScript code until that write has finished. When you do this, you hand all the data to the underlying native code and while theoretically being able to continue doing other work in JS space, you intentionally decide to wait until the native code responds back to you with the results. And that will "block" your event loop, because these I/O operations can take much much longer than executing regular code (disks/networks tend to be the slowest part of a computer).
Now, let's get back to writing to stdout/stderr.
From Node.js' docs:
process.stdout and process.stderr differ from other Node.js streams in important ways:
They are used internally by console.log() and console.error(), respectively.
They cannot be closed (end() will throw).
They will never emit the 'finish' event.
Writes may be synchronous depending on what the stream is connected to and whether the system is Windows or POSIX:
Files: synchronous on Windows and POSIX
TTYs (Terminals): asynchronous on Windows, synchronous on POSIX
Pipes (and sockets): synchronous on Windows, asynchronous on POSIX
I am assuming we are working with POSIX systems below.
In practice, this means that when your Node.js' output streams are not piped and are sent directly to the TTY, writing something to the console will block the event loop until the whole chunk of data is sent to the screen. However, if we redirect the output streams to something else (a process, a file etc.) now when we write something to the console Node.js will not wait for the completion of the operation and continue executing other JavaScript code while it writes the data to that output stream.
In practice, we get to execute more JavaScript in the same time period.
With this information you should be able to answer all your questions yourself now:
You do not need to redirect the stdout/stderr of your Node.js process if you do not write anything to the console, or you can redirect only one of the streams if you do not write anything to the other one. You may redirect them anyway, but if you do not use them you will not gain any performance benefit.
If you configure your logger to write the log data to a stream then it will not block your event loop too much (unless some heavy processing is involved).
If you care this much about your app's performance, do not use Winston or Bunyan for logging - they are extremely slow. Use pino instead - see the benchmarks in their readme.

To answer (1) we can dive into the Express documentation, you will see a link to the Node.js documentation for Console, which links to the Node documentation on the process I/O. There it describes how process.stdout and process.stderr behaves:
process.stdout and process.stderr differ from other Node.js streams in important ways:
They are used internally by console.log() and console.error(), respectively.
They cannot be closed (end() will throw).
They will never emit the 'finish' event.
Writes may be synchronous depending on what the stream is connected to and whether the system is Windows or POSIX:
Files: synchronous on Windows and POSIX
TTYs (Terminals): asynchronous on Windows, synchronous on POSIX
Pipes (and sockets): synchronous on Windows, asynchronous on POSIX
With that we can try to understand what will happen with node app 2>&1 | tee logFile:
Stdout and stderr is piped to a process tee
tee writes to both the terminal and the file logFile.
The important part here is that stdout and stderr is piped to a process, which means that it should be asynchronous.
Regarding (2) it would depend on how you configured Bunyan or Winston:
Winston has the concept of Transports, which essentially allows you to configure where the log will go. If you want asynchronous logs, you should use any logger other than the Console Transport. Using the File Transport should be ok, as it should create a file stream object for this and that is asynchronous, and won't block the Node process.
Bunyan has a similar configuration option: Streams. According to their doc, it can accept any stream interface. As long as you avoid using the process.stdout and process.stderr streams here you should be ok.

atomically appending to a log file in nodejs

NodeJS is asynchronous, so for example if running an Express server, one might be in the middle of servicing one request, log it, and then start servicing another request and try to log it before the first has finished.
Since these are log files, it's not a simple write. Even if a write was atomic, maybe another process actually winds up writing at the offset the original process is about to and it winds up overwriting.
There is a synchronous append function (fs.appendFile) but this would require us to delay servicing a request to wait for a log file write to complete and I'm still not sure if that guarantees an atomic append. What is the best practice for writing to log files in NodeJS while ensuring atomicitiy?

one might be in the middle of servicing one request, log it, and then start servicing another request and try to log it before the first has finished.
The individual write calls will be atomic, so as long as you make a single log write call per request, you won't have any corruption of log messages. It is normal, however if you log multiple messages while processing a request, for those to be interleaved between many different concurrent requests. Each message is intact, but they are in the log file in chronological order, not grouped by request. That is fine. You can filter on a request UUID if you want to follow a single request in isolation.
Even if a write was atomic, maybe another process actually winds up writing at the offset the original process is about to and it winds up overwriting.
Don't allow multiple processes to write to the same file or log. Use process.stdout and all will be fine. Or if you really want to log directly to the filesystem, use an exclusive lock mechanism.
What is the best practice for writing to log files in NodeJS while ensuring atomicitiy?
process.stdout, one write call per coherent log message. You can let your process supervisor (systemd or upstart) write your logs for you, or use a log manager such as multilog, sysvlogd and pipe your stdout to them and let them handle writing to disk.

fs.createWriteStream over several processes

How can I implement a system where multiple Node.js processes write to the same file with fs.createWriteStream, such that they don't overwrite data? It looks like the default setup for fs.createWriteStream is that the file is cleared out when that method is called. My goal is to clear out the file once, and then have all other subsequent writers only append data.
Should I use fs.createWriteStream and then fs.appendFile? Or is there a way to open up a stream for each process, not just for the first process to open the file?

Should I use fs.createWriteStream and then fs.appendFile?
you can use either.
with fs.createWriteStream you have to change the flag like this:
fs.createWriteStream('your_file',{
flags: 'a+', // default is 'w' (just 'a' might be enough here, i'm not sure)
})
this should create the file if it doesn't exist or open it with write access if it exists and set the pointer to end. (append mode)
How to use fs.appendFile should be clear and it does pretty much the same.
Now the problem with multiple processes accessing the same file. Obviously only one process can open the same file with write access at the same time.
Therefore you need to wait for the file to be released if another process has the write access. You will probably need a library for that.
this one for example: https://www.npmjs.com/package/lockup
or this one: https://github.com/Perennials/mutex-node
you can also find alot more here: https://www.npmjs.com/browse/keyword/lock
or here: https://www.npmjs.com/browse/keyword/mutex
I have not tried any of those libraries but the one I posted and several others on the list should do exactly what you need.

Writing on a single file from multiple processes, ensuring data integrity, it is a fairly complex operation that you can orchestrate using File locking.
However, you have two simpler approaches:
Writing on a temporary file for each process, and then concatenate
the files at the end of the operations.
Transmitting what you need to write to a dedicated, single process and delegate the writing execution to it. Keep in mind that sending messages among processes can be expensive.

What's the right way to write to file from Node.js to avoid bottlenecking?

I'm curious what the correct methodology is to write to a log file from a process that might be called dozens (or maybe even thousands) of times simultaneously.
I have a node process which is called via http and I wish to log from it, but I don't want it to bottleneck as it attempts to open/write/close the same file from all the various simultaneous requests.
I've read that stderr might be the answer to this problem, but am curious what makes that approach any less bottlenecky. At the end of the day, if stderr is going to some central location, isn't it going to have the exact same problem?

Best practice for node (e.g. http://12factor.net/) is to write to stdout or stderr. The expectation is that the OS will handle the file management / throughput that you want, or else you can have a custom-written log collector that can do it the way you want and redirect stdout or stderr to it.

Difference between fs.writeFile and fs.writeStream

I'm a little confused between the 2 methods, hope somebody could enlighten
me on the difference between fs.open->fs.write, fs.writeFile, fs.writeStream.

fs.open and fs.write are for low-level access, similar to what you get when you code in C. fs.open opens a file and fs.write writes to it.
A fs.WriteStream is a stream that opens the file in the background and queues writes until the file is ready. Also, as it implements the stream API, you can use it in a more generic way, just like a network stream or so. You'll e.g. want this when a user uploads a file to your server - take the incoming HTTP POST stream, pipe() it to the WriteStream. Very easy.
fs.writeFile is a high-level method for writing a bunch of data you have in RAM to a file. It doesn't support streaming or so, so it's a bad idea for large files or performance-critical stuff. You'll want this if you write out small JSON files or so in your code.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string