Using appropriate POSIX signals - linux

I am currently working on a project which has a daemon process that looks at a queue of tasks, runs those tasks, and then collects information about those tasks. In some cases, the daemon must "kill" a task if it has taken too long to run.
The explanation for SIGTERM is "termination signal" but that's not very informative. I would like to use the most appropriate signal for this.
What is the most appropriate POSIX signal number to use for telling a process "you took too much time to run so you need to stop now"?

If you're in control of the child processes, you can pretty much do as you please, but SIGTERM is the self-documenting signal for this. It asks a process to terminate, politely: the process chooses how to handle the signal and may perform cleanup actions before actually exiting (or may ignore the signal).
The standard way to kill a process, then, is to first send a SIGTERM; then wait for it to terminate with a grace period of, say, five seconds (longer if termination can take a long time, e.g. because of massive disk I/O). If the grace period has expired, send a SIGKILL. That's the "hard" version of SIGTERM and cannot be ignored, but also leaves the process no chance of neatly cleaning up after itself. Having to send a SIGKILL should be considered an issue with the child process and reported as such.

Usually you'll first send SIGTERM to a process. When the process recives this signal it is able to clean up some things an then terminate itself:
kill -15 PID_OF_PROCESS # 15 means SIGTERM
You can check if the process is still running by sending the 0 signal to it's pid.
kill -0 PID_OF_PROCESS # 0 means 0 :)
if [ "$?" == "0" ] ; then
echo "the process is still running"
fi
However, you'll need some grace period to let the process clean up. If the process didn't terminated itself after a grace period, you kill it using SIGKILL this signal can't be handled by the process and the OS will terminate the process immediately.
kill -9 PID_OF_PROCESS # 9 means SIGKILL, means DIE!

Related

Long Running Python Script in VSCode Exits with 'Polite quit request'

I have a long running Python script which is running in Visual Studio Code.
After a while the script stops running, there are no errors just this statement:
"fish: “/usr/bin/python3 /home/ubuntu/.…” terminated by signal SIGTERM (Polite quit request)"
What is happening here?
If a process recieves SIGTERM, some other process sent that signal. That is what happened in your case.
The SIGTERM signal is sent to a process to request its termination. Unlike the SIGKILL signal, it can be caught and interpreted or ignored by the process. This allows the process to perform nice termination releasing resources and saving state if appropriate. SIGINT is nearly identical to SIGTERM.
SIGTERM is not sent automatically by the system. There are a few signals that are sent automatically like SIGHUP when a terminal goes away, SIGSEGV/SIGBUS/SIGILL when a process does things it shouldn't be doing, SIGPIPE when it writes to a broken pipe/socket, etc.
SIGTERM is the signal that is typically used to administratively terminate a process.
That's not a signal that the kernel would send, but that's the signal a process would typically send to terminate (gracefully) another process. It is sent by default by the kill, pkill, killall, fuser -k commands.
Possible reasons why your process recieved such signal are:
execution of the process takes too long
insufficient memory or system resources to continue the execution of the process
But these are some possibilities. In your case, the root of the issue might be related with something different. You can avoid from a SIGTERM signal by telling the procces to ignore the signal but it is not suggested to do.
Refer to this link for more information.
Check this similar question for additional information.

when will /proc/<pid> be removed?

Process A opened && mmaped thousand of files when running. Then killl -9 <pid of process A> is issued. Then I have a question about the sequence of below two events.
a) /proc/<pid of process A> cannot be accessed.
b) all files opened by process A are closed.
More background about the question:
Process A is a multi-thread background service. It is started by cmd ./process_A args1 arg2 arg3.
There is also a watchdog process which checked whether process A is still alive periodically(every 1 second). If process A is dead, then restart it. The way watchdog checks process A is as below.
1) collect all numerical subdir under /proc/
2) compares /proc/<all-pids>/cmdline with cmdline of process A. If these is a /proc/<some-pid>/cmdline matches, then process A is alive and do nothing, otherwise restart process A.
process A will do below stuff when doing initialization.
1) open fileA
2) flock fileA
3) mmap fileA into memory
4) close fileA
process A will mmap thousand of files after initialization.
after several minutes, kill -9 <pid of process A> is issued.
watchdog detect the death of process A, restart it. But sometimes process A stuck at step 2 flock fileA. After some debugging, we found that unlock of fileA is executed when process A is killed. But sometimes this event will happen after step 2 flock fileA of new process.
So we guess the way to check process alive by monitor /proc/<pid of process A>
is not correct.
then kill -9 is issued
This is bad habit. You'll better send a SIGTERM first. Because well behaved processes and well designed programs can catch it (and exit nicely and properly when getting a SIGTERM...). In some cases, I even recommend: sending SIGTERM. Wait two or three seconds. sending SIGQUIT. Wait two seconds. At last, send a SIGKILL signal (for those bad programs who have not been written properly or are misbehaving). A few seconds later, you could send a SIGKILL. Read signal(7) and signal-safety(7). In multi-threaded, but Linux specific, programs, you might use signalfd(2) or the pipe(7) to self trick (well explained in Qt documentation, but not Qt specific).
If your Linux system is systemd based, you could imagine your program-A is started with systemd facilities. Then you'll use systemd facilities to "communicate" with it. In some ways (I don't know the details), systemd is making signals almost obsolete. Notice that signals are not multi-thread friendly and have been designed, in the previous century, for single-thread processes.
we guess the way to check process alive by monitor /proc/ is not correct.
The usual (and faster, and "atomic" enough) way to detect the existence of a process (on which you have enough privileges, e.g. which runs with your uid/gid) is to use kill(2) with a signal number (the second argument to kill) of 0. To quote that manpage:
If sig is 0, then no signal is sent, but existence and permission
checks are still performed; this can be used to check for the
existence of a process ID or process group ID that the caller is
permitted to signal.
Of course, that other process can still terminate before any further interaction with it. Because Linux has preemptive scheduling.
You watchdog process should better use kill(pid-of-process-A, 0) to check existence and liveliness of that process-A. Using /proc/pid-of-process-A/ is not the correct way for that.
And whatever you code, that process-A could disappear asynchronously (in particular, if it has some bug that gives a segmentation fault). When a process terminates (even with a segmentation fault) the kernel is acting on its file locks (and "releases" them).
Don't scan /proc/PID to find out if a specific process has terminated. There are lots of better ways to do that, such as having your watchdog program actually launch the server program and wait for it to terminate.
Or, have the watchdog listen on a TCP socket, and have the server process connect to that and send its PID. If either end dies, the other can notice the connect was closed (hint: send a heartbeat packet every so often, to a frozen peer). If the watchdog receives a connection from another server while the first is still running, it can decide to allow it or tell one of the instances to shut down (via TCP or kill()).

Bash: Is it possible to stop a PID from being reused?

Is it possible to stop a PID from being reused?
For example if I run a job myjob in the background with myjob &, and get the PID using PID=$!, is it possible to prevent the linux system from re-using that PID until I have checked that the PID no longer exists (the process has finished)?
In other words I want to do something like:
myjob &
PID=$!
do_not_use_this_pid $PID
wait $PID
allow_use_of_this_pid $PID
The reasons for wanting to do this do not make much sense in the example given above, but consider launching multiple background jobs in series and then waiting for them all to finish.
Some programmer dude rightly points out that no 2 processes may share the same PID. That is correct, but not what I am asking here. I am asking for a method of preventing a PID from being re-used after a process has been launched with a particular PID. And then also a method of re-enabling its use later after I have finished using it to check whether my original process finished.
Since it has been asked for, here is a use case:
launch multiple background jobs
get PID's of background jobs
prevent PID's from being re-used by another process after background job terminates
check for PID's of "background jobs" - ie, to ensure background jobs finish
[note if disabled PID re-use for the PID's of the background jobs those PIDs could not be used by a new process which was launched after a background process terminated]*
re-enable PID of background jobs
repeat
*Further explanation:
Assume 10 jobs launched
Job 5 exits
New process started by another user, for example, they login to a tty
New process has same PID as Job 5!
Now our script checks for Job 5 termination, but sees PID in use by tty!
You can't "block" a PID from being reused by the kernel. However, I am inclined to think this isn't really a problem for you.
but consider launching multiple background jobs in series and then waiting for them all to finish.
A simple wait (without arguments) would wait for all the child processes to complete. So, you don't need to worry about the
PIDs being reused.
When you launch several background process, it's indeed possible that PIDs may be reused by other processes.
But it's not a problem because you can't wait on a process unless it's your child process.
Otherwise, checking whether one of the background jobs you started is completed by any means other than wait is always going to unreliable.
Unless you've retrieved the return value of the child process it will exist in the kernel. That also means that it's pid is bound to it and can't being re-used during that time.
Further suggestion to work around this - if you suspect that a PID assigned to one of your background jobs is reassigned, check it in ps to see if it still is your process with your executable and has PPID (parent PID) 1.
If you are afraid of reusing PID's, which won't happen if you wait as other answers explain, you can use
echo 4194303 > /proc/sys/kernel/pid_max
to decrease your fear ;-)

Are there suspend\resume signals in Linux?

My application needs to react on hibernation mode so it can do some action on suspending and other actions on resuming. I've found some distributive-specific ways to achieve it(Upower + DBus) but didn't find anything universal. Is there a way to do it?
Thanks!
A simple solution to this is to use a self-pipe. Open up a pipe and periodically write timestamps to it. select on this pipe to read the timestamps and compare them to the current time. When there is a big gap, that means you have just woken up from system suspension or hibernate mode.
As for the other way around, there is not much time when the lid is closed and it flips the switch.
If you really need to act on suspend, then you will need to set powersave hooks like this https://help.ubuntu.com/community/PowerManagement/ReducedPower in pm-utils. It could be as simple as
kill -1 `cat mypid` ; sleep 1
Your process would then trap SIGHUP and do what needs to be done to prepare for suspension. The sleep delays the process long enough for your program to react to the signal.
I believe you are looking for SIGSTOP and SIGCONT signals. You can send these to a running process like so:
kill -STOP pid
sleep 60
kill -CONT pid

How does SIGINT relate to the other termination signals such as SIGTERM, SIGQUIT and SIGKILL?

On POSIX systems, termination signals usually have the following order (according to many MAN pages and the POSIX Spec):
SIGTERM - politely ask a process to terminate. It shall terminate gracefully, cleaning up all resources (files, sockets, child processes, etc.), deleting temporary files and so on.
SIGQUIT - more forceful request. It shall terminate ungraceful, still cleaning up resources that absolutely need cleanup, but maybe not delete temporary files, maybe write debug information somewhere; on some system also a core dump will be written (regardless if the signal is caught by the app or not).
SIGKILL - most forceful request. The process is not even asked to do anything, but the system will clean up the process, whether it like that or not. Most likely a core dump is written.
How does SIGINT fit into that picture? A CLI process is usually terminated by SIGINT when the user hits CRTL+C, however a background process can also be terminated by SIGINT using KILL utility. What I cannot see in the specs or the header files is if SIGINT is more or less forceful than SIGTERM or if there is any difference between SIGINT and SIGTERM at all.
UPDATE:
The best description of termination signals I found so far is in the GNU LibC Documentation. It explains very well that there is an intended difference between SIGTERM and SIGQUIT.
It says about SIGTERM:
It is the normal way to politely ask a program to terminate.
And it says about SIGQUIT:
[...] and produces a core dump when it terminates the process, just like a program error signal.
You can think of this as a program error condition “detected” by the user. [...]
Certain kinds of cleanups are best omitted in handling SIGQUIT. For example, if the program
creates temporary files, it should handle the other termination requests by deleting the temporary
files. But it is better for SIGQUIT not to delete them, so that the user can examine them in
conjunction with the core dump.
And SIGHUP is also explained well enough. SIGHUP is not really a termination signal, it just means the "connection" to the user has been lost, so the app cannot expect the user to read any further output (e.g. stdout/stderr output) and there is no input to expect from the user any longer. For most apps that mean they better quit. In theory an app could also decide that it goes into daemon mode when a SIGHUP is received and now runs as a background process, writing output to a configured log file. For most daemons already running in the background, SIGHUP usually means that they shall reexamine their configuration files, so you send it to background processes after editing config files.
However there is no useful explanation of SIGINT on this page, other than that it is sent by CRTL+C. Is there any reason why one would handle SIGINT in a different way than SIGTERM? If so what reason would this be and how would the handling be different?
SIGTERM and SIGKILL are intended for general purpose "terminate this process" requests. SIGTERM (by default) and SIGKILL (always) will cause process termination. SIGTERM may be caught by the process (e.g. so that it can do its own cleanup if it wants to), or even ignored completely; but SIGKILL cannot be caught or ignored.
SIGINT and SIGQUIT are intended specifically for requests from the terminal: particular input characters can be assigned to generate these signals (depending on the terminal control settings). The default action for SIGINT is the same sort of process termination as the default action for SIGTERM and the unchangeable action for SIGKILL; the default action for SIGQUIT is also process termination, but additional implementation-defined actions may occur, such as the generation of a core dump. Either can be caught or ignored by the process if required.
SIGHUP, as you say, is intended to indicate that the terminal connection has been lost, rather than to be a termination signal as such. But, again, the default action for SIGHUP (if the process does not catch or ignore it) is to terminate the process in the same way as SIGTERM etc. .
There is a table in the POSIX definitions for signal.h which lists the various signals and their default actions and purposes, and the General Terminal Interface chapter includes a lot more detail on the terminal-related signals.
man 7 signal
This is the convenient non-normative manpage of the Linux man-pages project that you often want to look at for Linux signal information.
Version 3.22 mentions interesting things such as:
The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored.
and contains the table:
Signal Value Action Comment
----------------------------------------------------------------------
SIGHUP 1 Term Hangup detected on controlling terminal
or death of controlling process
SIGINT 2 Term Interrupt from keyboard
SIGQUIT 3 Core Quit from keyboard
SIGILL 4 Core Illegal Instruction
SIGABRT 6 Core Abort signal from abort(3)
SIGFPE 8 Core Floating point exception
SIGKILL 9 Term Kill signal
SIGSEGV 11 Core Invalid memory reference
SIGPIPE 13 Term Broken pipe: write to pipe with no
readers
SIGALRM 14 Term Timer signal from alarm(2)
SIGTERM 15 Term Termination signal
SIGUSR1 30,10,16 Term User-defined signal 1
SIGUSR2 31,12,17 Term User-defined signal 2
SIGCHLD 20,17,18 Ign Child stopped or terminated
SIGCONT 19,18,25 Cont Continue if stopped
SIGSTOP 17,19,23 Stop Stop process
SIGTSTP 18,20,24 Stop Stop typed at tty
SIGTTIN 21,21,26 Stop tty input for background process
SIGTTOU 22,22,27 Stop tty output for background process
which summarizes signal Action that distinguishes e.g. SIGQUIT from SIGQUIT, since SIGQUIT has action Core and SIGINT Term.
The actions are documented in the same document:
The entries in the "Action" column of the tables below specify the default disposition for each signal, as follows:
Term Default action is to terminate the process.
Ign Default action is to ignore the signal.
Core Default action is to terminate the process and dump core (see core(5)).
Stop Default action is to stop the process.
Cont Default action is to continue the process if it is currently stopped.
I cannot see any difference between SIGTERM and SIGINT from the point of view of the kernel since both have action Term and both can be caught. It seems that is just a "common usage convention distinction":
SIGINT is what happens when you do CTRL-C from the terminal
SIGTERM is the default signal sent by kill
Some signals are ANSI C and others not
A considerable difference is that:
SIGINT and SIGTERM are ANSI C, thus more portable
SIGQUIT and SIGKILL are not
They are described on section "7.14 Signal handling " of the C99 draft N1256:
SIGINT receipt of an interactive attention signal
SIGTERM a termination request sent to the program
which makes SIGINT a good candidate for an interactive Ctrl + C.
POSIX 7
POSIX 7 documents the signals with the signal.h header: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/signal.h.html
This page also has the following table of interest which mentions some of the things we had already seen in man 7 signal:
Signal Default Action Description
SIGABRT A Process abort signal.
SIGALRM T Alarm clock.
SIGBUS A Access to an undefined portion of a memory object.
SIGCHLD I Child process terminated, stopped,
SIGCONT C Continue executing, if stopped.
SIGFPE A Erroneous arithmetic operation.
SIGHUP T Hangup.
SIGILL A Illegal instruction.
SIGINT T Terminal interrupt signal.
SIGKILL T Kill (cannot be caught or ignored).
SIGPIPE T Write on a pipe with no one to read it.
SIGQUIT A Terminal quit signal.
SIGSEGV A Invalid memory reference.
SIGSTOP S Stop executing (cannot be caught or ignored).
SIGTERM T Termination signal.
SIGTSTP S Terminal stop signal.
SIGTTIN S Background process attempting read.
SIGTTOU S Background process attempting write.
SIGUSR1 T User-defined signal 1.
SIGUSR2 T User-defined signal 2.
SIGTRAP A Trace/breakpoint trap.
SIGURG I High bandwidth data is available at a socket.
SIGXCPU A CPU time limit exceeded.
SIGXFSZ A File size limit exceeded.
BusyBox init
BusyBox's 1.29.2 default reboot command sends a SIGTERM to processes, sleeps for a second, and then sends SIGKILL. This seems to be a common convention across different distros.
When you shutdown a BusyBox system with:
reboot
it sends a signal to the init process.
Then, the init signal handler ends up calling:
static void run_shutdown_and_kill_processes(void)
{
/* Run everything to be run at "shutdown". This is done _prior_
* to killing everything, in case people wish to use scripts to
* shut things down gracefully... */
run_actions(SHUTDOWN);
message(L_CONSOLE | L_LOG, "The system is going down NOW!");
/* Send signals to every process _except_ pid 1 */
kill(-1, SIGTERM);
message(L_CONSOLE, "Sent SIG%s to all processes", "TERM");
sync();
sleep(1);
kill(-1, SIGKILL);
message(L_CONSOLE, "Sent SIG%s to all processes", "KILL");
sync();
/*sleep(1); - callers take care about making a pause */
}
which prints to the terminal:
The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
Here is a minimal concrete example of that.
Signals sent by the kernel
SIGKILL:
OOM killer: What is RSS and VSZ in Linux memory management
As DarkDust noted many signals have the same results, but processes can attach different actions to them by distinguishing how each signal is generated. Looking at the FreeBSD kernel source code (kern_sig.c) I see that the two signals are handled in the same way, they terminate the process and are delivered to any thread.
SA_KILL|SA_PROC, /* SIGINT */
SA_KILL|SA_PROC, /* SIGTERM */
After a quick Google search for sigint vs sigterm, it looks like the only intended difference between the two is whether it was initiated by a keyboard shortcut or by an explicit call to kill.
As a result, you could, for example, intercept sigint and do something special with it, knowing that it was likely sent by a keyboard shortcut. Perhaps refresh the screen or something, instead of dying (not recommended, as people expect ^C to kill the program, just an example).
I also learned that ^\ should send sigquit, which I may start using myself. Looks very useful.
Using kill (both the system call and the utility) you can send almost any signal to any process, given you've got the permission. A process cannot distinguish how a signal came to life and who has sent it.
That being said, SIGINT really is meant to signal the Ctrl-C interruption, while SIGTERM is the general terminal signal. There is no concept of a signal being "more forceful", with the only exception that there are signals that cannot be blocked or handled (SIGKILL and SIGSTOP, according to the man page).
A signal can only be "more forceful" than another signal with respect to how a receiving process handles the signal (and what the default action for that signal is). For example, by default, both SIGTERM and SIGINT lead to termination. But if you ignore SIGTERM then it will not terminate your process, while SIGINT still will.
With the exception of a few signals, signal handlers can catch the various signals, or the default behavior upon receipt of a signal can be modified. See the signal(7) man page for details.

Resources