Transparent fork-server on linux - linux

Suppose I have application A that takes some time to load (opens a couple of libraries). A processes stdin into some stdout.
I want to serve A on a network over a socket (instead of stdin and stdout).
The simplest way of doing that efficiently that I can think of is by hacking at the code and adding a forking server loop, replacing stdin and stdout with socket input and output.
The performance improvement compared to having an independent server application that spawns A (fork+exec) on each connection comes at a cost however. The latter is much easier to write and I don't need to have access to the source code of A or know the language it's written in.
I want my cake and eat it too. Is there a mechanism that would extract that forking loop?
What I want is something like fast_spawnp("A", "/tmp/A.pid", stdin_fd, stdout_fd, stderr_fd) (start process A unless it's already running, clone A from outside and make sure the standard streams of the child point to the argument-supplied file descriptors).

Related

Node.js pipe console error to another program (make it async)

From Expressjs documentation:
To keep your app purely asynchronous, you’d still want to pipe
console.err() to another program
Qestions:
Is it enough to run my node app with stdout and stderr redirect to not block event loop? Like this: node app 2>&1 | tee logFile ?
If ad.1 answer is true, then how to achieve non-blocking logging while using Winston or Bunyan? They have some built in mechanism to achieve this or they just save data to specific file wasting cpu time of current Node.js process? Or maybe to achieve trully async logging they should pipe data to child process that performs "save to file" (is it still performance positive?) ? Can anyone explain or correct me if my way of thinking is just wrong?
Edited part: I can assume that piping data from processes A, B, ...etc to process L is cheaper for this specific processes (A, B, ...) than writing it to file (or sending over network).
To the point:
I am designing logger for application that uses nodejs cluster.
Briefly - one of processes (L) will handle data streams from others, (A, B, ...).
Process L will queue messages (for example line by line or some other special separator) and log it one by one into file, db or anywhere else.
Advantage of this approach is reducing load of processes that can spent more time on doing their job.
One more thing - assumption is to simplify usage of this library so user will only include this logger without any additional interaction (stream redirection) via shell.
Do you think this solution makes sense? Maybe you know a library that already doing this?
Let's set up some ground level first...
Writing to a terminal screen (console.log() etc.), writing to a file (fs.writeFile(), fs.writeFileSync() etc.) or sending data to a stream process.stdout.write(data) etc.) will always "block the event loop". Why? Because some part of those functions is always written in JavaScript. The minimum amount of work needed by these functions would be to take the input and hand it over to some native code, but some JS will always be executed.
And since JS is involved, it will inevitably "block" the event loop because JavaScript code is always executed on a single thread no matter what.
Is this a bad thing...?
No. The amount of time required to process some log data and send it over to a file or a stream is quite low and does not have significant impact on performance.
When would this be a bad thing, then...?
You can hurt your application by doing something generally called a "synchronous" I/O operation - that is, writing to a file and actually not executing any other JavaScript code until that write has finished. When you do this, you hand all the data to the underlying native code and while theoretically being able to continue doing other work in JS space, you intentionally decide to wait until the native code responds back to you with the results. And that will "block" your event loop, because these I/O operations can take much much longer than executing regular code (disks/networks tend to be the slowest part of a computer).
Now, let's get back to writing to stdout/stderr.
From Node.js' docs:
process.stdout and process.stderr differ from other Node.js streams in important ways:
They are used internally by console.log() and console.error(), respectively.
They cannot be closed (end() will throw).
They will never emit the 'finish' event.
Writes may be synchronous depending on what the stream is connected to and whether the system is Windows or POSIX:
Files: synchronous on Windows and POSIX
TTYs (Terminals): asynchronous on Windows, synchronous on POSIX
Pipes (and sockets): synchronous on Windows, asynchronous on POSIX
I am assuming we are working with POSIX systems below.
In practice, this means that when your Node.js' output streams are not piped and are sent directly to the TTY, writing something to the console will block the event loop until the whole chunk of data is sent to the screen. However, if we redirect the output streams to something else (a process, a file etc.) now when we write something to the console Node.js will not wait for the completion of the operation and continue executing other JavaScript code while it writes the data to that output stream.
In practice, we get to execute more JavaScript in the same time period.
With this information you should be able to answer all your questions yourself now:
You do not need to redirect the stdout/stderr of your Node.js process if you do not write anything to the console, or you can redirect only one of the streams if you do not write anything to the other one. You may redirect them anyway, but if you do not use them you will not gain any performance benefit.
If you configure your logger to write the log data to a stream then it will not block your event loop too much (unless some heavy processing is involved).
If you care this much about your app's performance, do not use Winston or Bunyan for logging - they are extremely slow. Use pino instead - see the benchmarks in their readme.
To answer (1) we can dive into the Express documentation, you will see a link to the Node.js documentation for Console, which links to the Node documentation on the process I/O. There it describes how process.stdout and process.stderr behaves:
process.stdout and process.stderr differ from other Node.js streams in important ways:
They are used internally by console.log() and console.error(), respectively.
They cannot be closed (end() will throw).
They will never emit the 'finish' event.
Writes may be synchronous depending on what the stream is connected to and whether the system is Windows or POSIX:
Files: synchronous on Windows and POSIX
TTYs (Terminals): asynchronous on Windows, synchronous on POSIX
Pipes (and sockets): synchronous on Windows, asynchronous on POSIX
With that we can try to understand what will happen with node app 2>&1 | tee logFile:
Stdout and stderr is piped to a process tee
tee writes to both the terminal and the file logFile.
The important part here is that stdout and stderr is piped to a process, which means that it should be asynchronous.
Regarding (2) it would depend on how you configured Bunyan or Winston:
Winston has the concept of Transports, which essentially allows you to configure where the log will go. If you want asynchronous logs, you should use any logger other than the Console Transport. Using the File Transport should be ok, as it should create a file stream object for this and that is asynchronous, and won't block the Node process.
Bunyan has a similar configuration option: Streams. According to their doc, it can accept any stream interface. As long as you avoid using the process.stdout and process.stderr streams here you should be ok.

Is there a way to make a bash script process messages that have been sent to it using the write command

Is there a way to make a bash script process messages that have been sent to it using the "write" command? So for example, if a user wants to activate a feature in my script, could I make it so that they can send the script a command using the write command?
One possible method I thought of was to configure logging for a screen session and then have the bash script parse text through there, but I'm not sure if there would be a simpler or more efficient way to tackle this
EDIT: I was thinking as an alternative solution I could use a named pipe. I'm worried that it would break though if the tmp partition gets filled up completely (not sure if this would impact write as well?). I'm going to be running this script on a shared box, and every once in a while someone will completely fill up the /tmp partition and then just leave it like that until people start complaining
Hmm, you are trying to really circumvent a poor unix command to ask it something it was not specified for. From the man page (emphasize mine):
The write utility allows you to communicate with other users, by copying
lines from your terminal to theirs
That means that write is intended to copy line directly on terminals. As soon as you say, I will dump terminal output with screen, and then parse the dump file, you loose the simplicity of write (and also need disk space, with the problem of removing old lines from a sequencial file)
Worse, as your script lives on its own, it could (should?) be a daemon script attached to no terminal
So if I have correctly understood your question, your requirements are:
a script that does some tasks and should be able to respond to asynchronous requests - common usages are named pipes or network or unix domain sockets, less common are files in a dedicated folder with a optional signal to have immediate processing, adding lines to a sequential file while being possible is uncommon, because of a synchonization of access problem
a simple and convivial way for users to pass requests. Ok write is nice for that part, but much too hard to interface IMHO
If you do not want to waste time on that part by using standard tools, I would recommend the mail system. It is trivial to alias a mail address to a program that will be called with the mail message as input. But I am not sure it is worth it, because the user could directly call the program with the request as input or command line parameter.
So the client part could be simply a program that:
create a temporary file in a dedicated folder (mkstemp is your friend in C or C++, or mktemp in shell - but beware of race conditions)
write the request to that file
optionaly send a signal to a pid - provided the script write its own PID on startup to a dedicated file

Can I override a system function before calling fork?

I'd like to be able to intercept filenames with a certain prefix from any of my child processes that I launch. This would be names like "pipe://pipe_name". I think wrapping the open() system call would be a good way to do this for my application, but I'd like to do it without having to compile a separate shared library and hooking it with the LD_PRELOAD trick (or using FUSE and having to have a mounted directory)
I'll be forking the processes myself, is there a way to redirect open() to my own function before forking and have it persist in the child after an exec()?
Edit: The thought behind this is that I want to implement multi-reader pipes by having an intermediate process tee() the data from one pipe into all the others. I'd like this to be transparent to my child processes, so that they can take a filename and open() it, and, if it's a pipe, I'll return the file descriptor for it, while if it's a normal file, I'll just pass that to the regular open() function. Any alternative way to do this that makes it transparent to the child processes would interesting to hear. I'd like to not have to compile a separate library that has to be pre-linked though.
I believe the answer here is no, it's not possible. To my knowledge, there's only three ways to achieve this:
LD_PRELOAD trick, compile a .so that's pre-loaded to override the system call
Implement a FUSE filesystem and pass a path into it to the client program, intercept calls.
Use PTRACE to intercept system calls, and fix them up as needed.
2 and 3 will be very slow as they'll have to intercept every I/O call and swap back to user space to handle it. Probably 500% slower than normal. 1 requires building and maintaining an external shared library to link in.
For my application, where I'm just wanting to be able to pass in a path to a pipe, I realized I can open both ends in my process (via pipe()), then go grab the path to the read end in /proc//fd and pass that path to the client program, which gives me everything I need I think.

Communication between processes

What's the simplest way to send a simple string to a process (I know the pid of the process and I don't need to do any check of any sort) in python 3-4 ?
Should I just go for a communication via socket ?
If you want only two processes communicate, you can use a pipe (it can be named using mkfifo or "anonymous" using pipe & dup syscalls).
If you want one server and some clients, then you can use sockets. TCP socket is usable over network, but on unix/linux exists also so-called "unix socket", that look alike named pipe.
Next way to communicate between applicaitons are realtime signals and/or shared memory.
More information about unix socket: http://beej.us/guide/bgipc/output/html/multipage/unixsock.html
About pipes, google will tell you.
But, to correctly answer your question (which do not have only one "right answer"), I think that simplest is using named pipe, because it is used as writing to file on disk.

What's the right way to write to file from Node.js to avoid bottlenecking?

I'm curious what the correct methodology is to write to a log file from a process that might be called dozens (or maybe even thousands) of times simultaneously.
I have a node process which is called via http and I wish to log from it, but I don't want it to bottleneck as it attempts to open/write/close the same file from all the various simultaneous requests.
I've read that stderr might be the answer to this problem, but am curious what makes that approach any less bottlenecky. At the end of the day, if stderr is going to some central location, isn't it going to have the exact same problem?
Best practice for node (e.g. http://12factor.net/) is to write to stdout or stderr. The expectation is that the OS will handle the file management / throughput that you want, or else you can have a custom-written log collector that can do it the way you want and redirect stdout or stderr to it.

Resources