Node-forever restartall and child processes - node.js

We have a main node process that spawns a number of child processes with child_process.fork(), which themselves each spawn a helper child process. We are using node-forever to manage the lifetime of the main node processes, and often use forever restartall to restart this.
One problem we are seeing occasionally is that the grandchild processes will fail to terminate, and we end up with duplicated child processes running. Ie. what should be this:
Main Process
Child Process 1
Grandchild Process 1
Child Process 2
Grandchild Process 2
Ends up like this after restartall:
Main Process
Child Process 1
Grandchild Process 1
Child Process 2
Grandchild Process 2
Grandchild Process 1
Grandchild Process 2
Unsurprisingly this causes lots of weird problems and we usually have to restart the whole server (or kill processes manually, if we can establish which are the old ones).
As I understand it, forever issues a SIGTERM message to the process when it does a restartall. I believe this message should cascade down to the child and grandchild processes (but please correct me if I've made a false assumption there). Since this problem only occurs maybe once in 100, perhaps it's something related to timing?
What circumstances could cause the grandchild processes to fail to terminate? How to mitigate this?
OS is Debian Squeeze.
EDIT: My initial description was a bit over simplified. I've updated it to include all the details.
EDIT2 : We don't use forever anymore. I recommend PM2

Related

does Node kill spawned child processes automatically?

In the documentation for Node's Child Processes, there is this sentence in the section on child_process.spawn():
On Windows, setting options.detached to true makes it possible for the
child process to continue running after the parent exits.
That makes it sound like (at least on Windows) when you leave options.detached to the default value of false, spawn()'d processes will automatically be killed. That's actually the behavior I want in my application, and in fact I was calling myChildProcess.kill( "SIGINT" ) in my code, but commented it out, and the child processes still went away when my app quit. So that's great, but:
(1) My understanding is that it's necessary to do some tricky stuff with "job objects" as discussed here in order to make this work on Windows. Do you know if Node is doing something tricky like that to make child processes go away? Or perhaps it's more simple than that and Node just keeps a list of the spawned process IDs and kills any of them that are still around when shutting down? Which leads to the closely related question...
(2) If Node is indeed doing something special to kill child processes, do you know if there are cases (e.g., some kind of app crash) that would defeat what it's doing and leave the child processes running?
UPDATE: To clarify, the child processes I'm launching in my case are Python web server processes, not other Node processes. I don't know if there's a difference in behavior between a Node child process and some other child process for the purpose of this question.
A Node instance will quit as long as there is nothing left in the event queue (and no async code pending), so as long as you aren't leaving anything open then naturally a Node process will quit when it's done.
In terms of the process hanging on a crash, unless you are explicitly handling uncaught exceptions the the process will exit immediately.
If you want a child process to be long-running and to survive the termination of the node process itself, as you know you set options.detached = true.
This business of stopping a child process when a parent process stops is operating-system behavior. A parent process (running any programming language system, not just node) owns a non-detached child process. The OS cleans up child processes upon the termination of their parent.
Detaching a process tells the OS to make it no longer a child process, so the OS won't clean it up automatically.
A good practice for node child processes: whenever possible, have them do their assigned task and then exit. In other words, in most cases you should not need to rely on this child / detached behavior.

How can I kill all processes in a process group when there is any signal or interrrupt to any of processes in the group?

I have two processes A and B (should be scalable). A and B can have sub-processes (which includes execl'd processes as well).
If a signal is delevered to any of the subprocesses, the program should terminate gracefully terminating all the sub-processes of A and B including A and B itself.
The challenge here is that there should not be any defunct processes and since there could be execl'd process, the address space gets replaced.
I created a child process first and assigned pid of child process to the the pgid while parent just waits. Also, I created the processes A and B inside this child process, such that I could terminate all the processes with pgrp.
But as I mentioned, when I issue "kill pid of execld process", it created defunct process, with all other processes running.
Also, my requirement is such that I cant go for multi-threading (at least for the intended work)
Please suggest the solution.

What would cause a SIGTERM to not propagate to child processes?

I have a process on Linux that starts up 20 child processes via fork. When I kill the parent process, it will often kill all of the child processes, but sometimes it doesn't kill all of them, and I'm left with some orphaned processes. This isn't a race condition on startup, this is after the processes have been active for several minutes.
What sort of things could cause SIGTERM to not propagate to some child processes properly?
There is no automatic propagation of signals (SIGTERM or otherwise) to children in the process tree.
Inasmuch as killing a parent process can be observed to cause some children to exit, this is due to ancillary effects -- such as SIGPIPEs being caused when the child attempts to read or write to a pipeline with the dead parent on the other side.
If you want to ensure that children are cleaned up when your process receives a SIGTERM, install a signal handler and do it yourself.
If you use process group id (pgid) when sending a signal, the signal would be propagated to parent process and all its children.
To know pgid, use ps a -o pgid,command.

Why is a zombie process necessary?

Wikipedia basically gives all the possible information about zombie processes that I NEED to know but just a simple line on how it might be useful..in that a conflict in PIDs will not exist in the event the parent process creates another child process.
How is this then actually "useful"? Wouldn't the PID be then available if the named zombie process were to be removed instead of being kept there?
Or are there any other reasons as to why the zombie process should exist?
Zombie processes are actually really important and definitely need to exist. First it's important to understand how process creation works in Unix/Linux. The only way to create a new process is for an existing process to create a new child process via fork(). In this way, all of the processes on the system are arranged in a nice orderly tree heirarchy. Try running ps -Hu <your username> on a Linux system to see the heirarchy of processes that you own.
In many programs it is critically important for a parent process to be able to obtain basic information about its child processes that have exited. This basic information includes the exit status and resource usage of the child. When the parent is ready to get information about a dead child process it calls one of the wait() functions to wait for a child to exit and obtain exit status and resource usage info.
But what happens if a child process exits before the parent waits for it? This is where zombie processes become necessary. The operating system can't just discard the child process; the operation of the parent process may be dependent upon knowing the exit status or resource usage of the child. i.e. The parent process might need to know that the child exited abnormally, or it might be collecting CPU usage statistics for its children, etc. So, the only choice is to save off that information and make it available to the parent when it finally does call wait(). This information is what a zombie process is and it's a critical part of how process management works on Unix/Linux. Zombie processes allow the parent to be guaranteed to be able to retreive exit status, accounting information, and process id for child processes, regardless of whether the parent calls wait() before or after the child process exits.
This is why a zombie process is necessary.
Footnote: If the parent process never calls wait(), then the child process is reparented to the init process when the parent process dies, and init will wait() for the child.
The answer is on Wikipedia as well, which is:
This entry is still needed to allow the parent process to read its
child's exit status.
Zombie processes are useful.
Zombie processes allow the parent to be guaranteed to be able to retrieve exit status, accounting information, and process id of the child processes.
A process that doesn't clean up its child zombies isn't programmed properly.

Difference between SIGKILL SIGTERM considering process tree

What is the difference between SIGTERM and SIGKILL when it comes to the process tree?
When a root thread receives SIGKILL does it get killed cleanly or does it leave it's child threads as zombies?
Is there any signal which can be sent to a root thread to cleanly exit by not leaving any zombie threads ?
Thanks.
If you kill the root process (parent process), this should make orphan children, not zombie children. orphan children are made when you kill a process's parent, and the kernel makes init the parent of orphans. init is supposed to wait until orphan dies, then use wait to clean it up.
Zombie children are created when a process (not its parent) ends and its parent does not take up its exit status from the process table.
It sounds to me like you are worried about leaving orphans because by definition, when you kill a zombies parent process, the zombie child itself dies.
To kill your orphans, use kill -9 , which is the equivalent SIGKILL.
Here is a more in depth tutorial for killing stuff on linux:
http://riccomini.name/posts/linux/2012-09-25-kill-subprocesses-linux-bash/
You can't control that by signal; only its parent process can control that, by calling waitpid() or setting signal handlers for SIGCHLD. See SIGCHLD and SA_NOCLDWAIT in the sigaction(2) manpage for details.
Also, what happens to child threads depends on the Linux kernel version. With 2.6's POSIX threads, killing the main thread should cause the other threads to exit cleanly. With 2.4 LinuxThreads, each thread is actually a separate process and SIGKILL doesn't give the root thread a chance to tell the others to shut down, whereas SIGTERM does.

Resources