I am developing fault tolerance mechanisms for a distributed application in Rust. I need to simulate failure of one node (and eventually more). The kind of failure to simulate is a node crash. I want the application to completely exit with error in a controlled manner. I want to choose which node fails and I when it does (as much as possible).
The different nodes of the application communicate to each other as peer-to-peer. Each node executes two threads and it would be best if both are be terminated.
In my testing environment I have each node running on a thread (and this thread creates a second one) in my laptop, and a network port assigned to each.
A preliminary idea would be to randomly exit a thread given a probability. This idea does not provide me the control I need to only exit one node and in the exact moment of the application I want to test my fault tolerance mechanisms. Also, this would leave the second thread of a node executing (as far as I know).
I am looking for a way to simulate the node crash in a way I can control and reproduce the same crash whenever I need.
Related
Googling for it results in many “how to persist data in a node app” but I’m looking on a way to store the program counter, memory status, event loop, call stack etc in persistent storage, and resume it later.
Benefits: if you see the runtime (a server, container, serverless function) is about to terminate, instead of using business logic to pause and resume (custom work), use the same way operating systems handle multiple processes / threads. Store everything, then resume it later form a different infrastructure (but with identical specs).
I’m sure there is something like this, but simply can’t find the right search term probably.
Ps this might be an OS feature that I’m looking for and not node specific, but if this can be done from within Node’s API (Eg v8 internals) I can basically get an unlimited / long running lambda ;) (which is a bad idea but I want to know if it’s possible).
(V8 developer here.)
V8 definitely doesn't support this.
What V8 does support is taking a heap snapshot, and deserializing that on renewed process startup (and I believe Node is making use of this functionality). That's quite different from freezing an entire running process though.
I'm not sure what you mean by "the same way operating systems handle multiple processes / threads". Operating systems don't usually let you snapshot a process and transfer it to a different machine.
On the same machine, you could literally just let the OS do it: pause the process (e.g. press Ctrl+Z if you started it at a Linux command line, or use equivalent Task Manager functionality if your OS provides it, or similar), and resume it later. If the process itself doesn't fire any repeated tasks/timers, then that's almost equivalent to simply doing nothing: a process that executes no work won't get scheduled by the kernel anyway; a server that isn't serving any requests can just sit around waiting.
If you actually need to transfer a running process to another machine, your best bet may be a VM which you can snapshot, transfer, resume.
I got several applications working with Node on the back-end and React on the front-end, it works great, I do axios get and post requests from React to Express and I get data back and forth, then on production I use pm2 to get everything up and running.
My question is when two users access the same application at the same time, how does Node treat this, as two separated instances or just one?.
I am considering using socket.io to be able to notify the front-end on changes that are happening on Node, and I wonder if those notifications will be emitted from the back-end no matter what another user might be doing or not.
Thanks.
As you have probably heard node.js is addressed as a "single-threaded" runtime. This is only partially true. Even though node runs on a single thread of your processor it runs the majority of its tasks in a thread pool which can process up to 4 tasks at the same time.
If you want to know about this you might want to look into the node event loop which describes the steps node goes through on each "tick".
So as you see node can often not process one but up to 4 actions on each loop cycle. But there is more, to solve the performance issues that might occur on big applications you can run node on a cluster mode. This allows you to extend the thread pool and add multiple node instances and therefore handle high demand efficiently.
One note to your socket.io question. As you see a high demand of tasks gets queued until it is handled in the node event loop, so sometimes you need to wait. Fortunatly we are in a race of big tech to create the fastest JS-runtime so this thing is pretty fast.
In Node.js cluster mode, if multiple jobs exist in the event loop for one process, should the current job crash the process, what happens to the remaining job?
I'm assuming the remaining jobs in the event loop would go unfulfilled or return a server error. My question is, why is this an acceptable risk? Why would someone opt to use Node.js cluster mode in production then, rather than use something like PHP in production, where there is no risk of this, because PHP handles each request in its own process.
Edit:
Obviously this doesn't just apply to Node.js cluster mode. It can happen on a single instance, in which case obviously the end user would just get a server error. Cluster mode just happens to be my personal use case.
I'm looking for a way to pick back up a job in the queue job should a previous job cause the process to exit, before the subsequent job gets a change to be fulfilled. I am currently reading about how you can use a tool like RabbitMQ to handle your job queue outside of the node.js cluster, and each cluster instance just pulls jobs from the RabbitMQ queue. If anyone has any input on that, that would also be greatly appreciated.
If multiple jobs exist in the event loop for one process. What happens to the remaining jobs if the current job crashes the process?
If a node.js process crashes, the same thing happens to it that happens to any other process. All open sockets get automatically disconnected and the client will receive an immediate close on their socket (socket connection dropped essentially).
If you were using a Java server that was in the middle of handling 10 requests (perhaps in threads) and it crashed, the consequences would be the same. All 10 socket connections would get dropped.
If process isolation from one request to another is your #1 criteria for selecting a server environment, then I guess you wouldn't pick any environment that ever serves multiple requests from the same process. But, you would give up a lot of get that. One of the reasons for the node.js design is that is scales really, really well for a high number of concurrent connections that are all doing mostly I/O things (disk, networking, database stuff, etc...) which happens to be most web servers. Whereas a design that fires up a new process for every incoming connection does not scale as well for a large number of concurrent connections because a process is a much more heavy-weight thing in the eyes of the operating system (memory usage, other system resource usage, task switching overhead, etc...) than the way node.js does things.
And, there are obviously hundreds of other considerations too when choosing a server environment. So, you kind of have to look at the whole picture of what you're designing for and make the best set of tradeoffs.
In general, I wouldn't put this issue anywhere on the radar for why you should choose one over the other unless you expect to be running risky code (perhaps out of your control) that crashes a lot and this issue is therefore more important in your deployment than all the other differences. And, if that was the case, I'd probably isolate the risky code to its own process (even when using nodejs) to alleviate any pain from that crash. You could have a process pool waiting to process risky things. For example, if you were running code submitted by a user, I might run that code in its own isolated VM.
If you're just worried about your own code crashing a lot, then you probably have bigger problems and need more extensive unit testing, more robust error handling and need to take advantage of other tools just as a linter and other code analysis tools to find potential problem areas. With proper design, implementation and error handling, you should be able to keep a single incoming request from harming anything other than itself. That's certainly the philosophy that every server environment that serves multiple requests from the same process advises and the people/companies deploying those servers use.
We use clustering with our express apps on multi cpu boxes. Works well, we get the maximum use out of AWS linux servers.
We inherited an app we are fixing up. It's unusual in that it has two processes. It has an Express API portion, to take incoming requests. But the process that acts on those requests can run for several minutes, so it was build as a seperate background process, node calling python and maya.
Originally the two were tightly coupled, with the python script called by the request to upload the data. But this of course was suboptimal, as it would leave the client waiting for a response for the time it took to run, so it was rewritten as a background process that runs in a loop, checking for new uploads, and processing them sequentially.
So my question is this: if we have this separate node process running in the background, and we run clusters which starts up a process for each CPU, how is that going to work? Are we not going to get two node processes competing for the same CPU. We were getting a bit of weird behaviour and crashing yesterday, without a lot of error messages, (god I love node), so it's bit concerning. I'm assuming Linux will just swap the processes in and out as they are being used. But I wonder if it will be problematic, and I also wonder about someone getting their web session swapped out for several minutes while the longer running process runs.
The smart thing to do would be to rewrite this to run on two different servers, but the files that maya uses/creates are on the server's file system, and we were not given the budget to rebuild the way we should. So, we're stuck with this architecture for now.
Any thoughts now possible problems and how to avoid them would be appreciated.
From an overall architecture prospective, spawning 1 nodejs per core is a great way to go. You have a lot of interdependencies though, the nodejs processes are calling maya which may use mulitple threads (keep that in mind).
The part that is concerning to me is your random crashes and your "process that runs in a loop". If that process is just checking the file system you probably have a race condition where the nodejs processes are competing to work on the same input/output files.
In theory, 1 nodejs process per core will work great and should help to utilize all your CPU usage. Linux always swaps the processes in and out so that is not an issue. You could start multiple nodejs per core and still not have an issue.
One last note, be sure to keep an eye on your memory usage, several linux distributions on EC2 do not have a swap file enabled by default, running out of memory can be another silent app killer, best to add a swap file in case you run into memory issues.
I have an app in NodeJS.
Recently we have been getting a lot more traffic (this is a new experience for me) and so I have been running into the "EMFILE: too many open files" error that is caused when a single process tries to open more files than the filesystem allows.
I have increased this limit, so we are good for now. However I'm not sure how long this solution will last...
I am wondering: What are other commonly used options for scaling a Node Application that is getting increasing amounts of traffic? (specifically with a mind to the open files limit problem.)
The PM2 process manager which allows clustering catches my eye (am I correct in understanding that every instance of the application requires it's own core -- ie you can't run 4 instances on a single core?). Are there any other techniques that are regularly used?
Thanks (in advance)
PM2 is a simple solution when you want to run more than one instance of Node, another common alternative is the cluster module http://nodejs.org/api/cluster.html Keep in mind, that you will need to configure another http server such as Nginx to reverse proxy your user requests to your Node processes.
You can run any number of Node processes, regardless of the amount of cores. But since each node process is a single thread, and each core can execute a single thread a time, the optimal configuration is when the number of cores match the number of Node processes. If the number of Node processes is greater than the number of cores, under load, you will experience reduced performance due to redundant context switches your processor will have to perform.