I am working on a project in which I need to loosely recreate supervisord(job control system) in D. I am using spawnShell() as opposed to spawnProcess() for ease of configuring arguments etc. This has the effect of running sh -c "command". However, it returns the PID of sh NOT of the child process(for obvious reasons). This becomes a problem because my program needs to be able to send a SIGKILL to the process if it doesn't respond to a SIGTERM after a certain period of time. I am able to send a SIGTERM no problem(presumably because sh catches the SIGTERM and passes it to it's child process/processes before exiting). However, for again obvious reasons, SIGKILL stops sh before it gets a chance to send a signal to the child process and it's left orphaned. Which brings me to my questions:
A: Can I safely assume that the PID of the spawned process will always be one higher than the PID of sh? It has behaved as such in all my testing so far.
B: If not, then is there a more elegant way(system call or such) to get a child process's PID knowing only the parent process's PID than having my program just execute pgrep -P <sh PID>?
You just need:
sh -c 'exec command'
the shell replaces itself with your command and gets out of the way, so there is no intermediate process.
No, you cannot assume pids will differ by one.
Can I safely assume that the PID of the spawned process will always be one higher than the PID of sh? It has behaved as such in all my testing so far.
No. Linux is a multitasking OS. While rare, other processes could start in between. Don't rely on a race condition.
If not, then is there a more elegant way (system call or such) to get a child process's PID knowing only the parent process's PID than having my program just execute pgrep -P <sh PID>?
Not really. Trying to navigate the process tree is a sign that your approach is wrong.
You're solving the wrong problem. Get rid of the shell middle man.
Related
I'm working with parallel processing and rather than dealing with cvars and locks I've found it's much easier to run a few commands in a shell script in sequence to avoid race conditions in one place. The new problem is that one of these commands calls another program, which the OS has decided to put into a new process. I need to kill this process from the parent program, but the parent program only knows the pid of the parent (shell script), so this process keeps executing on its own.
Is there a way in bash to set a subprocess to die when the parent dies? I've tried to figure out how to execute it as a daemon because I read daemons exit when the parent dies, but it's tricky and I can't quite get it right. Thanks!
Found the problem, and this fixed it (except for some pesky messages that somehow cannot be redirected to /dev/null).
trap "trap - SIGTERM && kill -- -$$" SIGINT SIGTERM EXIT
From what I gather, programs that run as pid 1 may need to take special precautions such as capturing certain signals.
It's not altogether clear how to correctly write a pid 1. I'd rather not use runit or supervisor in my case. For example, supervisor is written in python and if you install that, it'll result in a much larger container. I'm not a fan of runit.
Looking at the source code for runit is intersting but as usual, comments are virtually non-existent and don't explain what's being done for what reason.
There is a good discussion here:
When the process with pid 1 die for any reason, all other processes
are killed with KILL signal
When any process having children dies for any reason, its children are reparented to process with PID 1
Many signals which have default action of Term do not have one for PID 1.
The relevant part for your question:
you can’t stop process by sending SIGTERM or SIGINT, if process have not installed a signal handler
Restarting a service is often implemented via a PID file - I.e. the process ID is written to some file and based on that number the stop command will kill the process (or before a restart).
When you think about it (or if you don't like this, then search) you'll find that this is problematic as every PID could be reused. Imagine a complete server restart where you call './your-script.sh start' at startup (e.g. #reboot in crontab). Now your-script.sh will kill an arbitrary PID because it has stored the PID from the live before the restart.
One workaround I can imagine is to store an additional information, so that you could do 'ps -pid | grep ' and only if this returns something you kill it. Or are there better options in terms of reliability and/or simplicity?
#!/bin/bash
function start() {
nohub java -jar somejar.jar >> file.log 2>&1 &
PID=$!
# one could even store the "ps -$PID" information but this makes the
# killing too specific e.g. if some arguments will be added or similar
echo "$PID somejar.jar" > $PID_FILE
}
function stop() {
if [[ -f "$PID_FILE" ]]; then
PID=$(cut -f1 -d' ' $PID_FILE)
# now get the second information and grep the process list with this
PID_INFO=$(cut -f2 -d' ' $PID_FILE)
RES=$(ps -$PID | grep $PID_INFO)
if [[ "x$RES" != "x" ]]; then
kill $PID
fi
fi
}
The problem with PID files is multifold, not just limited to recycling and reboot.
The bigger issue is the fact that there is an unavoidable disconnect/race between the information in the PID file and the state of the process.
This is the flow of using PID files:
You fork & exec a process. The "parent" process knows the PID of the fork and has guarantees that this PID is reserved exclusively for his fork.
Your parent writes the PID of the fork to a file.
Your parent dies, along with it the guarantee about PID exclusivity.
A different process reads the number in the PID file.
The different process checks whether there is a process on the system with the same PID as the one he read.
The different process sends a signal to the process with the PID he read.
In (1) everything is fine and dandy. We have a PID and we are guaranteed by the kernel that the number is reserved for our intended process.
In (2) you are yielding control of the PID to other processes that do not have this guarantee. In itself not an issue, but such an act is rarely if ever without fault.
In (3) your parent process dies. It alone had the kernel guarantee on PID exclusivity. It may or may not have done a wait(2) on the PID. The true status of the intended process is lost, all we have left is an identifier in the PID file which may or may not refer to the intended process.
In (4) a process without any guarantees reads the PID file, any use of this number has only arbitrary success.
In (5) a process without any guarantees actually uses the identifier for something, this is the first point where we're actually doing something bad: we're querying the kernel using a process identifier that may or may not refer to the intended process. The answer we'll get back will be on the state of the process with that PID, not necessarily of our intended process at all.
In (6) we make the worst mistake: we're actually performing a mutating action, intended to impact our initially started process but by no means guaranteeing that intent. We could be signalling any random system process instead.
Why is this? What kind of stuff can happen to mess with the PID?
Anywhere after (1), the real process may die. So long as the parent retains his guarantee on the PID's exclusivity, the kernel will not recycle the PID. It will still exist and refer to what used to be your process (we call this a "zombie" process, your real process died but the PID is still reserved for it alone). No other process can use this PID and signalling it will not reach any process at all.
As soon as the parent releases his guarantee or after (3), the kernel recycles the PID of the dead process. The zombie is gone and the PID now free to be used by any other new process that is forked. Say you're compiling something, thousands of small processes get spawned. The kernel picks random or sequential (depending on its configuration) new PIDs for each. You're done, now you restart apache. The kernel reuses the freed PID of your dead process for something important.
The PID file still contains the PID, though. Any process that reads the PID file (4) is assuming that this number refers to your long dead process.
Any action (5) (6) you take with the number you read will target the new process, not the old one.
Not only that, but you cannot perform any check prior to your action since there is an unavoidable race between any check you can perform and any action you can perform. If you first look at ps to see what the "name" of your process is (not that this is a really awesome guarantee of anything, please don't do this), and then signal it, the time between your ps check and your signal could still have seen the process die, and/or get recycled by a new process. The root of all of these problems is that the kernel is not giving you any exclusive use guarantees on the PID, since you are not its parent.
Moral of the story: Do NOT give the PID of your children to anyone else. The parent and only the parent should use it, because he is the only one on the system (save the kernel) with any guarantees on its existence and identity.
This usually means keeping the parent alive and instead of signalling something to terminate the process, talking to the parent instead; by means of sockets or the like. See http://smarden.org/runit/ et al.
As an alternative to runit there is the daemon command from the libslack library that can automatically respawn the client program when it terminates - without using a PID file.
Using a named daemon with the daemon command allows you to manually restart the client program; this, however, will create a PID file which may lead to race conditions as already pointed out by lhunath.
# daemon example without PID file
daemon --respawn --acceptable=10 --delay=10 bash -- -c 'sleep 30'
# from: man daemon
# "If started with the --respawn option, the client process
# will be restarted after it is killed by the SIGTERM signal."
#
# (Problem would be to reliably get e.g. the bash pid in the daemon example above.)
I have a scenario in which after the fork the child executes using the excele() command
a linux system command in which its executes a small shell script .
And the parent does only a wait() after that . So my question is , does the parent executes
wait after an execle() which the child process executes ?
Thanks
Smita
I'm not too sure what you're asking, but the parent is in a wait() system call it will wait there until any child exits. There are other things like signals that will take it out of the exit too.
You do have to be careful in the child process that you don't accidently fall through into the parent code on error.
This (a child process doing some execve after its parent fork-ed, and the parent wait- or waitpid-ing it) is a very common scenario; most shells are working this way. You could e.g. strace -f an interactive bash shell to learn more, or study the source code of simple shells like sash
Notice that after a fork(2) syscall, the parent and the child processes may run simultanously (e.g. at the same time, especially on multi-core machines).
Someone told me that when you killed a parent process in linux, the child would die.
But I doubt it. So I wrote two bash scripts, where father.shwould invoke child.sh
Here is my script:
Now I run bash father.sh, you could check it ps -alf
Then I killed the father.sh by kill -9 24588, and I guessed the child process should be terminated but unfortunately I was wrong.
Could anyone explain why?
thx
No, when you kill a process alone, it will not kill the children.
You have to send the signal to the process group if you want all processes for a given group to receive the signal
For example, if your parent process id has the code 1234, you will have to specify the parentpid adding the symbol minus followed by your parent process id:
kill -9 -1234
Otherwise, orphans will be linked to init, as shown by your third screenshot (PPID of the child has become 1).
-bash: kill: (-123) - No such process
In an interactive Terminal.app session the foreground process group id number and background process group id number are different by design when job control/monitor mode is enabled. In other words, if you background a command in a job-control enabled Terminal.app session, the $! pid of the backgrounded process is in fact a new process group id number (pgid).
In a script having no job control enabled, however, this may not be the case! The pid of the backgrounded process may not be a new pgid but a normal pid! And this is, what causes the error message -bash: kill: (-123) - No such process, trying to kill a process group but only specifying a normal pid (instead of a pgid) to the kill command.
# the following code works in Terminal.app because $! == $pgid
{
sleep 100 &
IFS=" " read -r pgid <<EOF
$(ps -p $! -o pgid=)
EOF
echo $$ $! $pgid
sleep 10
kill -HUP -- -$!
#kill -HUP -- -${pgid} # use in script
}
pkill -TERM -P <ProcessID>
This will kill both Parent as well as child
Generally killing the parent also kills the child.
The reason that you are seeing the child still alive after killing the father is because the child only will die after it "chooses" (the kernel chooses) to handle the SIGKILL event. It doesn't have to handle it right away. Your script is running a sleep() command (i.e. in the kernel), which will not wake up to handle any events whatsoever until the sleep is completed.
Why is PPID #1? The parent has died and is no longer in the process table. child.sh isn't linked inexplicably to init now. It simply has no running parent. Saying it is linked to init creates the impression that if we somehow leave init, that init has control over shutting down the process. It also creates the impression that killing a parent will make the grandparent the owner of a child. Both are not true. That child process still exists in the process table and is running, but no new events based upon it's process ID will be handled until it handles SIGKILL. Which means that the child is a pre-zombie, walking dead, in danger of being labeled .
Killing in the process group is different, and is used to kill the siblings, and the parent by the process group #. It's probably also important to note that "killing a process" is not "killing" per se, in the human way, where you expect the process to be destroyed and all memory returned as though it never was. It just sends a particular event, among many, to the process for it to handle. If the process does not handle it properly, then after a while the OS will come along and "clean it up" forcibly.
It (killing) doesn't happen right away because the child (or even the parent) could have written something to disk and be waiting for I/O to complete or doing some other critical task that could compromise system stability or file integrity.