stdout collisions on node

stdout collisions on node - node.js

If multiple processes are simultaneously writing to stdout then there is nothing to stop the streams from interleaving. This is what I mean by a collision.
According to the comments in the node source, it should be possible to avoid collisions in process.stdout. I tried this and it helps but I still get collisions. Without the writing flag, I get collisions every time, with the flag it drops to about 40%. Still very significant.
page.on('onConsoleMessage', function log(message) {
var writing = process.stdout._writableState.writing
|| process.stdout._writableState.bufferProcessing
|| process.stdout.bufferSize ;
if(writing)
process.nextTick(message => log(message));
else
process.stdout.write('-> ' + message + '\n')
});
What is the best way to avoid collisions on process.stdout?
The above routine is competing with Winston for stdout.
node v5.12.0
Windows 10
This problem only happens when using the Run console in webstorm, the output is not mixed up when running node in powershell or from cmd. I raised a ticket for this at jetbrains.

If you have multiple writers to the same stream, I don't think that you'll get interleaving. Even if a single writer is logging multiple lines in succession, those lines will be buffered (if there is buffering going on) in the correct order, and when another writer is logging its lines, they will be appended to the buffer, behind the previously logged lines.
Interleaving can occur when you have writers writing to a different stream, like one writing to stdout and the other to stderr. At some point, when the output buffer of one stream fills up, it gets flushed to the console, regardless of any other streams that may also be writing to console.

Related

How to write to stdout without being blocking under linux?

I've written a log-to-stdout program which produces logs, and another exe read-from-stdin (for example filebeat) to collect logs from stdin. My problem is that my log-to-stdout speed may burst in a short period which exceeds read-from-stdin can accept, that will blocking log-to-stdout process, I'd like to know if there is a Linux API to tell if the stdout file descriptor can be written to (up to N bytes) without being blocked?
I've found some comments in nodejs process.stdout
In the case they refer to pipes:
They are blocking in Linux/Unix.
They are non-blocking like other streams in Windows.
Does that mean under Linux it's impossible to do non-blocking write on stdout? Some documents reference non-blocking file operate mode (https://www.linuxtoday.com/blog/blocking-and-non-blocking-i-0/), does it apply to stdout too? Because I'm using third-party logging (which expect stdout working at blocking mode), can I check stdout writable in non-blocking mode (before calling logging library), and then switch stdout back to blocking mode, so from logging library perspective, stdout fd still works as previously? (if I can tell stdout will be blocking, I'll throw output, since not being block is more important than output complete logging in my usage)
(Or if there is a auto-drop-pipe command, which can auto drop lines if pipeline will block, so I can call
log-to-stdout | auto-drop-pipe --max-lines=100 --drop-head-if-full | read-from-stdin)

Node.js pipe console error to another program (make it async)

From Expressjs documentation:
To keep your app purely asynchronous, you’d still want to pipe
console.err() to another program
Qestions:
Is it enough to run my node app with stdout and stderr redirect to not block event loop? Like this: node app 2>&1 | tee logFile ?
If ad.1 answer is true, then how to achieve non-blocking logging while using Winston or Bunyan? They have some built in mechanism to achieve this or they just save data to specific file wasting cpu time of current Node.js process? Or maybe to achieve trully async logging they should pipe data to child process that performs "save to file" (is it still performance positive?) ? Can anyone explain or correct me if my way of thinking is just wrong?
Edited part: I can assume that piping data from processes A, B, ...etc to process L is cheaper for this specific processes (A, B, ...) than writing it to file (or sending over network).
To the point:
I am designing logger for application that uses nodejs cluster.
Briefly - one of processes (L) will handle data streams from others, (A, B, ...).
Process L will queue messages (for example line by line or some other special separator) and log it one by one into file, db or anywhere else.
Advantage of this approach is reducing load of processes that can spent more time on doing their job.
One more thing - assumption is to simplify usage of this library so user will only include this logger without any additional interaction (stream redirection) via shell.
Do you think this solution makes sense? Maybe you know a library that already doing this?

Let's set up some ground level first...
Writing to a terminal screen (console.log() etc.), writing to a file (fs.writeFile(), fs.writeFileSync() etc.) or sending data to a stream process.stdout.write(data) etc.) will always "block the event loop". Why? Because some part of those functions is always written in JavaScript. The minimum amount of work needed by these functions would be to take the input and hand it over to some native code, but some JS will always be executed.
And since JS is involved, it will inevitably "block" the event loop because JavaScript code is always executed on a single thread no matter what.
Is this a bad thing...?
No. The amount of time required to process some log data and send it over to a file or a stream is quite low and does not have significant impact on performance.
When would this be a bad thing, then...?
You can hurt your application by doing something generally called a "synchronous" I/O operation - that is, writing to a file and actually not executing any other JavaScript code until that write has finished. When you do this, you hand all the data to the underlying native code and while theoretically being able to continue doing other work in JS space, you intentionally decide to wait until the native code responds back to you with the results. And that will "block" your event loop, because these I/O operations can take much much longer than executing regular code (disks/networks tend to be the slowest part of a computer).
Now, let's get back to writing to stdout/stderr.
From Node.js' docs:
process.stdout and process.stderr differ from other Node.js streams in important ways:
They are used internally by console.log() and console.error(), respectively.
They cannot be closed (end() will throw).
They will never emit the 'finish' event.
Writes may be synchronous depending on what the stream is connected to and whether the system is Windows or POSIX:
Files: synchronous on Windows and POSIX
TTYs (Terminals): asynchronous on Windows, synchronous on POSIX
Pipes (and sockets): synchronous on Windows, asynchronous on POSIX
I am assuming we are working with POSIX systems below.
In practice, this means that when your Node.js' output streams are not piped and are sent directly to the TTY, writing something to the console will block the event loop until the whole chunk of data is sent to the screen. However, if we redirect the output streams to something else (a process, a file etc.) now when we write something to the console Node.js will not wait for the completion of the operation and continue executing other JavaScript code while it writes the data to that output stream.
In practice, we get to execute more JavaScript in the same time period.
With this information you should be able to answer all your questions yourself now:
You do not need to redirect the stdout/stderr of your Node.js process if you do not write anything to the console, or you can redirect only one of the streams if you do not write anything to the other one. You may redirect them anyway, but if you do not use them you will not gain any performance benefit.
If you configure your logger to write the log data to a stream then it will not block your event loop too much (unless some heavy processing is involved).
If you care this much about your app's performance, do not use Winston or Bunyan for logging - they are extremely slow. Use pino instead - see the benchmarks in their readme.

To answer (1) we can dive into the Express documentation, you will see a link to the Node.js documentation for Console, which links to the Node documentation on the process I/O. There it describes how process.stdout and process.stderr behaves:
process.stdout and process.stderr differ from other Node.js streams in important ways:
They are used internally by console.log() and console.error(), respectively.
They cannot be closed (end() will throw).
They will never emit the 'finish' event.
Writes may be synchronous depending on what the stream is connected to and whether the system is Windows or POSIX:
Files: synchronous on Windows and POSIX
TTYs (Terminals): asynchronous on Windows, synchronous on POSIX
Pipes (and sockets): synchronous on Windows, asynchronous on POSIX
With that we can try to understand what will happen with node app 2>&1 | tee logFile:
Stdout and stderr is piped to a process tee
tee writes to both the terminal and the file logFile.
The important part here is that stdout and stderr is piped to a process, which means that it should be asynchronous.
Regarding (2) it would depend on how you configured Bunyan or Winston:
Winston has the concept of Transports, which essentially allows you to configure where the log will go. If you want asynchronous logs, you should use any logger other than the Console Transport. Using the File Transport should be ok, as it should create a file stream object for this and that is asynchronous, and won't block the Node process.
Bunyan has a similar configuration option: Streams. According to their doc, it can accept any stream interface. As long as you avoid using the process.stdout and process.stderr streams here you should be ok.

How do I perform operations like read/write to a heavy file in node.js?

I am quite new to node.js and I want to perform operations(like read,write or store in DB) to large files(typically 5GB ~ 10GB).
What are the possible ways to do it fast and without affecting the main thread(UI).Do I need to implement multithreading?
I think since I/O operations are asynchronous,it will never affect the main thread. And I had tried to read a large file and write the contents to response object of HTTP like this -
var http = require('http'),
fs = require('fs');
fs.readFile('largefile.txt',function(err,data){
if(err) {
throw err;
}
http.createServer(function(request,response){
response.writeHead(200,{
"Content-Type" : "text/plain"
});
response.end(data);
}).listen(8080);
console.log("server started");
});
The size of largefile.txt here is .25GB only, and it has taken almost 5 minutes for this program to run. Now in actual, I want the size to be (as I mentioned earlier) 5~10GB and type of file can be .csv,.xls. How should I do that, please tell the approach with examples(if possible).

Reading from disk to working program memory is very slow. This is a hardware limitation.
If the file is CSV (Comma-separated values separated by newlines), you probably want to read it line by line, or search through for the right line and then read, instead of reading the whole thing into memory and then printing the whole thing out. If you read it line by line at least you're updating something as it's being read.
For a start, you can use fs.read instead of fs.readFile to read the file character by character, looking for a newline character.
But a quick search for "nodejs read file line" shows there are many other ways to approach this with Node.
Edit:
I can't comment yet, but regarding child processes, as jfriend00 and SirDemon said, although NodeJS uses non-blocking IO (reading disk to memory doesn't block code) and it's generally event-oriented/asynchronous in design (execution may swap between sections of code while it's waiting on things) the code is only run single-threaded on a single CPU (code still blocks code). So a child process allows you to make use of another CPU. It was all designed for dynamic servers, so you could have code running and files being read almost all the time, but without the overhead of maintaining a new thread/process for each file read (which servers typically use thread pools for). (I think that's correct?)

Logging to a non blocking named pipe?

I have a question, and I could'nt find help anywhere on stackoverflow or the web.
I have a program (celery distributed task queue) and I have multiple instances (workers) each having a logfile (celery_worker1.log, celery_worker2.log).
The important errors are stored to a database, but I like to tail these logs from time to time when running new operations to make sure everything is ok (the loglevel is lower).
My problem: these logs are taking a lot of disk space.
What I would like to do: be able to "watch" the logs (tail -f) only when I need it, without them taking a lot of space.
My ideas until now:
outputing logs to stdout, not to a file: not possible here since I have many workers outputing to different files, but I want to tail them all at once (tail -f celery_worker*.log)
using logrotate: it is an "OK" solution for me. I don't want this to be a daily task but would rather not put a minute crontab for this, and more, the server is not mine so that would mean some work on the admin-sys side
using named pipes: it looked good at first sight but I didn't know that named pipes (linux FIFO) where blocking. Hence, when I don't tail -f ALL of the pipes at the same time, or when I just quit my tail, the writing operations from the logger are blocked.
Is there a way to have a non-blocking named pipe, which would just throw to stdout when tailed, and throw to /dev/null when not?
Or are there technical difficulties to such a type of pipe? If there are, what are they?
Thank you for your answers!

Have each worker log to stdout, but connect each stdout to a utility that automatically spools and rotates logs based on size or time. multilog and svlogd are examples of such. For those programs, you'd merely tail the "current" log file.
You're right that logrotate is not quite the right solution for the problem you have.
Named pipes won't work as you want. At best, your writers could fill up their pipes and then discard subsequent logs, which is the inverse of the behavior you want.

You could try shared memory device man:shm_overview or perhaps a number of them. You need to organise them as circular buffers so they'd store last N kb of your log and whenever you read them with reader it will output everything to your console. This approach is adopted by busybox's syslog/logread suit (see logread.c).

Nonblocking/asynchronous fifo/named pipe in shell/filesystem?

Is there a way to create non blocking/asynchronous named pipe or something similar in shell? So that programs could place lines in it, those lines would stay in ram, and when some program could read some lines from pipe, while leaving what it did not read in fifo? It is also very probable that programs can be writing and reading to this fifo at the same time. At first I though maybe this could be done using files, but after searching a web for a bit it seems nothing good can come from the fact that file is read and written at same time. Named pipes would almost work, just there are two problems: first they block reads/writes if there is no one at the other end, second even if I let writing to blocked and set two processes to write to pipe while no one is reading, by trying to write one line with each process, and then try head -n 1 <fifo> I get just one line as I need, but both writing processes terminate, and second line is lost. Any suggestions?
Edit: maybe some intermediate program could be used to help with this, acting like mediator between writers and readers?

You can use special program for this purpose - buffer. Buffer is designed to try and keep the writer side continuously busy so that it can stream when writing to tape drives, but you can use for other purposes. Internally buffer is a pair of processes communicating via a large circular queue held in shared memory, so your processes will work asynchronously. Your reader process will be blocked in case the queue is full and the writer process - in case the queue is empty. Example:
bzcat archive.bz2 | buffer -m 16000000 -b 100000 | processing_script | bzip2 > archive_processed.bz2
http://linux.die.net/man/1/buffer

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string