Bash: Is it possible to stop a PID from being reused? - linux

Is it possible to stop a PID from being reused?
For example if I run a job myjob in the background with myjob &, and get the PID using PID=$!, is it possible to prevent the linux system from re-using that PID until I have checked that the PID no longer exists (the process has finished)?
In other words I want to do something like:
myjob &
PID=$!
do_not_use_this_pid $PID
wait $PID
allow_use_of_this_pid $PID
The reasons for wanting to do this do not make much sense in the example given above, but consider launching multiple background jobs in series and then waiting for them all to finish.
Some programmer dude rightly points out that no 2 processes may share the same PID. That is correct, but not what I am asking here. I am asking for a method of preventing a PID from being re-used after a process has been launched with a particular PID. And then also a method of re-enabling its use later after I have finished using it to check whether my original process finished.
Since it has been asked for, here is a use case:
launch multiple background jobs
get PID's of background jobs
prevent PID's from being re-used by another process after background job terminates
check for PID's of "background jobs" - ie, to ensure background jobs finish
[note if disabled PID re-use for the PID's of the background jobs those PIDs could not be used by a new process which was launched after a background process terminated]*
re-enable PID of background jobs
repeat
*Further explanation:
Assume 10 jobs launched
Job 5 exits
New process started by another user, for example, they login to a tty
New process has same PID as Job 5!
Now our script checks for Job 5 termination, but sees PID in use by tty!

You can't "block" a PID from being reused by the kernel. However, I am inclined to think this isn't really a problem for you.
but consider launching multiple background jobs in series and then waiting for them all to finish.
A simple wait (without arguments) would wait for all the child processes to complete. So, you don't need to worry about the
PIDs being reused.
When you launch several background process, it's indeed possible that PIDs may be reused by other processes.
But it's not a problem because you can't wait on a process unless it's your child process.
Otherwise, checking whether one of the background jobs you started is completed by any means other than wait is always going to unreliable.

Unless you've retrieved the return value of the child process it will exist in the kernel. That also means that it's pid is bound to it and can't being re-used during that time.

Further suggestion to work around this - if you suspect that a PID assigned to one of your background jobs is reassigned, check it in ps to see if it still is your process with your executable and has PPID (parent PID) 1.

If you are afraid of reusing PID's, which won't happen if you wait as other answers explain, you can use
echo 4194303 > /proc/sys/kernel/pid_max
to decrease your fear ;-)

Related

Linux fork, execve - no wait zombies

In Linux & C, will not waiting (waitpid) for a fork-execve launched process create zombies?
What is the correct way to launch a new program (many times) without waiting and without resource leaks?
It would also be launched from a 2nd worker thread.
Can the first program terminate first cleanly if launched programs have not completed?
Additional: In my case I have several threads that can fork-execve processes at ANY TIME and THE SAME TIME -
1) Some I need to wait for completion and want to report any errors codes with waitpid
2) Some I do not want to block the thread and but would like to report errors
3) Some I don't want to wait and don't care about the outcome and could run after the program terminates
For #2, should I have to create an additional thread to do waitpid ?
For #3, should I do a fork-fork-execve and would ending the 1st fork cause the 2nd process to get cleaned up (no zombie) separately via init ?
Additional: I've read briefly (not sure I understand all) about using nohup, double fork, setgpid(0,0), signal(SIGCHLD, SIG_IGN).
Doesn't global signal(SIGCHLD, SIG_IGN) have too many side effects like getting inherited (or maybe not) and preventing monitoring other processes you do want to wait for ?
Wouldn't relying on init to cleanup resources leak while the program continues to run (weeks in my case)?
In Linux & C, will not waiting (waitpid) for a fork-execve launched process create zombies?
Yes, they become zombies after death.
What is the correct way to launch a new program (many times) without waiting and without resource leaks? It would also be launched from a 2nd worker thread.
Set SIGCHLd to SIG_IGN.
Can the first program terminate first cleanly if launched programs have not completed?
Yes, orphaned processes will be adopted by init.
I ended up keeping an array of just the fork-exec'd pids I did not wait for (other fork-exec'd pids do get waited on) and periodically scanned the list using
waitpid( pids[xx], &status, WNOHANG ) != 0
which gives me a chance report outcome and avoid zombies.
I avoided using global things like signal handlers that might affect other code elsewhere.
It seemed a bit messy.
I suppose that fork-fork-exec would be an alternative to asynchronously monitor the other program's completion by the first fork, but then the first fork needs cleanup.
In Windows, you just keep a handle to the process open if you want to check status without worry of pid reuse, or close the handle if you don't care what the other process does.
(In Linux, there seems no way for multiple threads or processes to monitor the status of the same process safely, only the parent process-thread can, but not my issue here.)

Problems with killing jobs

I would like to kill a job serial, which contains several calculations. With the command 'kill PID', where PID refers to process ID, the currently running calculation cancels, but the process has not been stopped. Instead, the next calculation starts, but I would like to kill the entire job, the entire process.
The kill -9 <pid> should do the job. Unfortunately this might not always work if program is poorly programmed you might be unable to kill it.

What special precautions must I make for docker apps running as pid 1?

From what I gather, programs that run as pid 1 may need to take special precautions such as capturing certain signals.
It's not altogether clear how to correctly write a pid 1. I'd rather not use runit or supervisor in my case. For example, supervisor is written in python and if you install that, it'll result in a much larger container. I'm not a fan of runit.
Looking at the source code for runit is intersting but as usual, comments are virtually non-existent and don't explain what's being done for what reason.
There is a good discussion here:
When the process with pid 1 die for any reason, all other processes
are killed with KILL signal
When any process having children dies for any reason, its children are reparented to process with PID 1
Many signals which have default action of Term do not have one for PID 1.
The relevant part for your question:
you can’t stop process by sending SIGTERM or SIGINT, if process have not installed a signal handler

Are PID-files still flawed when doing it 'right'?

Restarting a service is often implemented via a PID file - I.e. the process ID is written to some file and based on that number the stop command will kill the process (or before a restart).
When you think about it (or if you don't like this, then search) you'll find that this is problematic as every PID could be reused. Imagine a complete server restart where you call './your-script.sh start' at startup (e.g. #reboot in crontab). Now your-script.sh will kill an arbitrary PID because it has stored the PID from the live before the restart.
One workaround I can imagine is to store an additional information, so that you could do 'ps -pid | grep ' and only if this returns something you kill it. Or are there better options in terms of reliability and/or simplicity?
#!/bin/bash
function start() {
nohub java -jar somejar.jar >> file.log 2>&1 &
PID=$!
# one could even store the "ps -$PID" information but this makes the
# killing too specific e.g. if some arguments will be added or similar
echo "$PID somejar.jar" > $PID_FILE
}
function stop() {
if [[ -f "$PID_FILE" ]]; then
PID=$(cut -f1 -d' ' $PID_FILE)
# now get the second information and grep the process list with this
PID_INFO=$(cut -f2 -d' ' $PID_FILE)
RES=$(ps -$PID | grep $PID_INFO)
if [[ "x$RES" != "x" ]]; then
kill $PID
fi
fi
}
The problem with PID files is multifold, not just limited to recycling and reboot.
The bigger issue is the fact that there is an unavoidable disconnect/race between the information in the PID file and the state of the process.
This is the flow of using PID files:
You fork & exec a process. The "parent" process knows the PID of the fork and has guarantees that this PID is reserved exclusively for his fork.
Your parent writes the PID of the fork to a file.
Your parent dies, along with it the guarantee about PID exclusivity.
A different process reads the number in the PID file.
The different process checks whether there is a process on the system with the same PID as the one he read.
The different process sends a signal to the process with the PID he read.
In (1) everything is fine and dandy. We have a PID and we are guaranteed by the kernel that the number is reserved for our intended process.
In (2) you are yielding control of the PID to other processes that do not have this guarantee. In itself not an issue, but such an act is rarely if ever without fault.
In (3) your parent process dies. It alone had the kernel guarantee on PID exclusivity. It may or may not have done a wait(2) on the PID. The true status of the intended process is lost, all we have left is an identifier in the PID file which may or may not refer to the intended process.
In (4) a process without any guarantees reads the PID file, any use of this number has only arbitrary success.
In (5) a process without any guarantees actually uses the identifier for something, this is the first point where we're actually doing something bad: we're querying the kernel using a process identifier that may or may not refer to the intended process. The answer we'll get back will be on the state of the process with that PID, not necessarily of our intended process at all.
In (6) we make the worst mistake: we're actually performing a mutating action, intended to impact our initially started process but by no means guaranteeing that intent. We could be signalling any random system process instead.
Why is this? What kind of stuff can happen to mess with the PID?
Anywhere after (1), the real process may die. So long as the parent retains his guarantee on the PID's exclusivity, the kernel will not recycle the PID. It will still exist and refer to what used to be your process (we call this a "zombie" process, your real process died but the PID is still reserved for it alone). No other process can use this PID and signalling it will not reach any process at all.
As soon as the parent releases his guarantee or after (3), the kernel recycles the PID of the dead process. The zombie is gone and the PID now free to be used by any other new process that is forked. Say you're compiling something, thousands of small processes get spawned. The kernel picks random or sequential (depending on its configuration) new PIDs for each. You're done, now you restart apache. The kernel reuses the freed PID of your dead process for something important.
The PID file still contains the PID, though. Any process that reads the PID file (4) is assuming that this number refers to your long dead process.
Any action (5) (6) you take with the number you read will target the new process, not the old one.
Not only that, but you cannot perform any check prior to your action since there is an unavoidable race between any check you can perform and any action you can perform. If you first look at ps to see what the "name" of your process is (not that this is a really awesome guarantee of anything, please don't do this), and then signal it, the time between your ps check and your signal could still have seen the process die, and/or get recycled by a new process. The root of all of these problems is that the kernel is not giving you any exclusive use guarantees on the PID, since you are not its parent.
Moral of the story: Do NOT give the PID of your children to anyone else. The parent and only the parent should use it, because he is the only one on the system (save the kernel) with any guarantees on its existence and identity.
This usually means keeping the parent alive and instead of signalling something to terminate the process, talking to the parent instead; by means of sockets or the like. See http://smarden.org/runit/ et al.
As an alternative to runit there is the daemon command from the libslack library that can automatically respawn the client program when it terminates - without using a PID file.
Using a named daemon with the daemon command allows you to manually restart the client program; this, however, will create a PID file which may lead to race conditions as already pointed out by lhunath.
# daemon example without PID file
daemon --respawn --acceptable=10 --delay=10 bash -- -c 'sleep 30'
# from: man daemon
# "If started with the --respawn option, the client process
# will be restarted after it is killed by the SIGTERM signal."
#
# (Problem would be to reliably get e.g. the bash pid in the daemon example above.)

Why child process still alive after parent process was killed in Linux?

Someone told me that when you killed a parent process in linux, the child would die.
But I doubt it. So I wrote two bash scripts, where father.shwould invoke child.sh
Here is my script:
Now I run bash father.sh, you could check it ps -alf
Then I killed the father.sh by kill -9 24588, and I guessed the child process should be terminated but unfortunately I was wrong.
Could anyone explain why?
thx
No, when you kill a process alone, it will not kill the children.
You have to send the signal to the process group if you want all processes for a given group to receive the signal
For example, if your parent process id has the code 1234, you will have to specify the parentpid adding the symbol minus followed by your parent process id:
kill -9 -1234
Otherwise, orphans will be linked to init, as shown by your third screenshot (PPID of the child has become 1).
-bash: kill: (-123) - No such process
In an interactive Terminal.app session the foreground process group id number and background process group id number are different by design when job control/monitor mode is enabled. In other words, if you background a command in a job-control enabled Terminal.app session, the $! pid of the backgrounded process is in fact a new process group id number (pgid).
In a script having no job control enabled, however, this may not be the case! The pid of the backgrounded process may not be a new pgid but a normal pid! And this is, what causes the error message -bash: kill: (-123) - No such process, trying to kill a process group but only specifying a normal pid (instead of a pgid) to the kill command.
# the following code works in Terminal.app because $! == $pgid
{
sleep 100 &
IFS=" " read -r pgid <<EOF
$(ps -p $! -o pgid=)
EOF
echo $$ $! $pgid
sleep 10
kill -HUP -- -$!
#kill -HUP -- -${pgid} # use in script
}
pkill -TERM -P <ProcessID>
This will kill both Parent as well as child
Generally killing the parent also kills the child.
The reason that you are seeing the child still alive after killing the father is because the child only will die after it "chooses" (the kernel chooses) to handle the SIGKILL event. It doesn't have to handle it right away. Your script is running a sleep() command (i.e. in the kernel), which will not wake up to handle any events whatsoever until the sleep is completed.
Why is PPID #1? The parent has died and is no longer in the process table. child.sh isn't linked inexplicably to init now. It simply has no running parent. Saying it is linked to init creates the impression that if we somehow leave init, that init has control over shutting down the process. It also creates the impression that killing a parent will make the grandparent the owner of a child. Both are not true. That child process still exists in the process table and is running, but no new events based upon it's process ID will be handled until it handles SIGKILL. Which means that the child is a pre-zombie, walking dead, in danger of being labeled .
Killing in the process group is different, and is used to kill the siblings, and the parent by the process group #. It's probably also important to note that "killing a process" is not "killing" per se, in the human way, where you expect the process to be destroyed and all memory returned as though it never was. It just sends a particular event, among many, to the process for it to handle. If the process does not handle it properly, then after a while the OS will come along and "clean it up" forcibly.
It (killing) doesn't happen right away because the child (or even the parent) could have written something to disk and be waiting for I/O to complete or doing some other critical task that could compromise system stability or file integrity.

Resources