Why does this program hang on exit? (interaction between signals and sudo)

Why does this program hang on exit? (interaction between signals and sudo) - linux

I am debugging a legacy program (on Linux). To synchronise it with another process I tried naively adding a raise(SIGSTOP). However when run under sudo I get a defunct (zombie) process and a hung terminal. Can someone explain what is happening here and how it can be avoided.
I've reduced the problem to the following simple C program (selfstop.c):
#include <signal.h>
#include <stdio.h>
int main(void)
{
printf("about to stop\n");
(void)raise(SIGSTOP);
printf("resumed\n");
return 0;
}
If run as normal it displays "about to stop" and halts itself with SIGSTOP.
kill -18 <pid> causes it to display "resumed" and exit as desired.
However, if I run it under sudo i.e.
sudo ./selfstop
in another terminal:
sudo kill -18 <pid>
It displays "resumed" and returns control to the terminal but I am left with a defunct process:
>ps aux | grep [s]elf
root 7619 0.0 0.0 215476 4136 pts/4 T 18:16 0:00 sudo ./selfstop
root 7623 0.0 0.0 0 0 pts/4 Z 18:16 0:00 [selfstop] <defunct>
Things get worse if the program is run in a script (runselfstop):
#!/bin/sh
sudo ./selfstop
Now when the process exits it hangs the terminal.
In both cases normal service is resumed by killing the sudo process (in this case "7619 = sudo ./selfstop":
sudo kill -9 7619
My question is why do we get the zombie and how do we avoid it.
Note: The reason for using sudo is irrelevant here. It relates to the legacy application.

sudo will suspend itself if the command it's running suspends itself. This allows you to, for example, run sudo -s to start a shell, then type suspend in that shell to get back to your top-level shell. If you have the source code for sudo, you can look at the suspend_parent function to see how this is done.
When sudo (or any process) has been suspended, the only way to resume it is to send it a SIGCONT signal. Sending SIGCONT to the selfstop process won't do that.
>ps aux | grep [s]elf
root 7619 0.0 0.0 215476 4136 pts/4 T 18:16 0:00 sudo ./selfstop
root 7623 0.0 0.0 0 0 pts/4 Z 18:16 0:00 [selfstop] <defunct>
That indicates that selfstop has exited but hasn't yet been waited for by its parent. It will remain a zombie until sudo is either resumed or killed.
How can you work around this? sudo and selfstop will be in the same process group (unless selfstop does something to change that). So you could send SIGCONT to sudo's process group, which will resume both processes, by doing kill -CONT -the-pid-of-sudo (note the minus sign before the pid to denote a pgrp).

Related

How to launch a process outside a systemd control group

I have a server process (launched from systemd) which can launch an update process. The update process self-daemonizes itself and then (in theory) kills the server with SIGTERM. My problem is that the SIGTERM propagates to the update process and it's children.
For debugging purposes, the update process just sleeps, and I send the kill by hand.
Sample PS output before the kill:
1 1869 1869 1869 ? -1 Ss 0 0:00 /usr/local/bin/state_controller --start
1869 1873 1869 1869 ? -1 Sl 0 0:00 \_ ProcessWebController --start
1869 1886 1869 1869 ? -1 Z 0 0:00 \_ [UpdateSystem] <defunct>
1 1900 1900 1900 ? -1 Ss 0 0:00 /bin/bash /usr/local/bin/UpdateSystem refork /var/ttm/update.bin
1900 1905 1900 1900 ? -1 S 0 0:00 \_ sleep 10000
Note that UpdateSystem is in a separate PGID and TPGID. (The <defunct> process is a result of the daemonization, and is not (I think) a problem.)
UpdateSystem is a bash script (although I can easily make it a C program if that will help). After the daemonization code taken from https://stackoverflow.com/a/29107686/771073, the interesting bit is:
#############################################
trap "echo Ignoring SIGTERM" SIGTERM
sleep 10000
echo Awoken from sleep - presumably by the SIGTERM
exit 0
When I kill 1869 (which sends SIGTERM to the state_controller server process, my logfile contains:
Terminating
Ignoring SIGTERM
Awoken from sleep - presumably by the SIGTERM
I really want to prevent SIGTERM being sent to the sleep process.
(Actually, I really want to stop it being sent to apt-get upgrade which is stopping the system via the moral equivalent of systemctl stop ttm.service and the ExecStop is specified as /bin/kill $MAINPID - just in case that changes anyone's answer.)
This question is similar, but the accepted answer (use KillMode=process) doesn't work well for me - I want to kill some of the child processes, just not the update process:
Can't detach child process when main process is started from systemd

A completely different approach is for the upgrade process to remove itself from the service group by updating the /sys/fs/cgroup/systemd filesystem. Specifically in bash:
echo $$ > /sys/fs/cgroup/systemd/tasks
A process belongs to exactly one control group. Writing its PID to the root tasks file adds it to the other control group, and removes it from the service control group.

We were having exactly the same problem. What we ended up doing is launching the update process as transient cgroup with systemd-run:
systemd-run --unit=my_system_upgrade --scope --slice=my_system_upgrade_slice -E setsid nohup start-the-upgrade &> /tmp/some-logs.log &
That way, the update process will run in a different cgroup and will not be terminated. Additionally, we use setsid + nohup to make sure the process has its own group and session and that the parent process is the init process.

The approach we have decided to take is to launch the update process in a separate (single-shot) service. As such, it automatically belongs to a separate control group, so killing the main service doesn't kill it.
There is a wrinkle to this though. The package installs ttm.service and ttm.template.update.service. To run the updater, we copy ttm.template.update.service to ttm.update.service, run systemctl daemon-reload, and then run systemctl start ttm.update.service. Why the copy? Because when the updater installs a new version of ttm.template.update.service, it will forcibly terminate any processes running as that service. KillMode=None appears to offer a way round that, but although it appears to work, a subsequent call to apt-get yields a nasty error about dpkg having been interrupted.

Are you sure it is not systemd sending the TERM signal to the child process?
Depending on the service type, if your main process dies, systemd will do a cleanup and terminate all the child processes under the same cgroup.
This is defined by KillMode= property which is by default set to control-group. You could set it to "none" or "process". https://www.freedesktop.org/software/systemd/man/systemd.kill.html

I have same sitation with you.
Upgrade process is a child process of parent process. The parent process is call by a service.
The main point is not Cgroup, is MAINPID.
If you use PIDFILE to sepecify the MAINPID, when the service type = forking, then the situation solved.
[Service]
Type=forking
PIDFile=/run/test.pid

In unix I used kill command by providing a ppid then it close the terminal . why? kill -9 ppid

sleep 5000
In one terminal and in second terminal I'm running:
ps -ef | grep sleep
Then I'm killing this process in second terminal by using the ppid. Then it will close the first terminal where I run the sleep command. It will not create sleep command as an orphan.
$ ps -ef | grep sleep
trainee 4887 4864 0 17:05 pts/0 00:00:00 sleep 5000
trainee 4889 4264 0 17:05 pts/1 00:00:00 grep --color=auto sleep
kill -9 4864
Why?

Presumably the parent of the sleep is your shell. When you kill that your login is terminated and your terminal closes.
The Wikipedia article on Orphan process reads (in part),
An orphan process is a computer process whose parent process has finished or terminated, though it remains running itself.
and
A process can be orphaned unintentionally, such as when the parent process terminates or crashes. The process group mechanism in most Unix-like operation systems can be used to help protect against accidental orphaning, where in coordination with the user's shell will try to terminate all the child processes with the SIGHUP process signal, rather than letting them continue to run as orphans.

Why SIGINT can stop bash in terminal but not via kill -INT?

I noticed that when I am running a hanging process via bash script like this
foo.sh:
sleep 999
If I run it via command, and press Ctrl+C
./foo.sh
^C
The sleep will be interrupted. However, when I try to kill it with SIGINT
ps aux | grep foo
kill -INT 12345 # the /bin/bash ./foo.sh process
Then it looks like bash and sleep ignores the SIGINT and keep running. This surprises me. I thought Ctrl + C is actually sending SIGINT to the foreground process, so why is that behaviors differently for Ctrl + C in terminal and kill -INT?

CtrlC actually sends SIGINT to the foreground process group (which consists of a bash process, and a sleep process). To do the same with a kill command, send the signal to the process group, e.g:
kill -INT -12345

Your script is executing "sleep 999" and when you hit CTRL-C the shell that is running the sleep command will send the SIGINT to its foreground process, sleep. However, when you tried to kill the shell script from another window with kill, you didn't target "sleep" process, you targetted the parent shell process, which is catching SIGINT. Instead, find the process ID for the "sleep 999" and kill -2 it, it should exit.
In short, you are killing 2 different processes in your test cases, and comparing apples to oranges.
root 27979 27977 0 03:33 pts/0 00:00:00 -bash <-- CTRL-C is interpreted by shell
root 28001 27999 0 03:33 pts/1 00:00:00 -bash
root 28078 27979 0 03:49 pts/0 00:00:00 /bin/bash ./foo.sh
root 28079 28078 0 03:49 pts/0 00:00:00 sleep 100 <-- this is what you should try killing

Who does the daemonizing?

There are various tricks to daemonize a linux process, i.e. to make a command running after the terminal is closed.
nohup is used for this purpose, and fork()/setsid() combination can be used in a C program to make itself a daemon process.
The above was my knowledge about linux daemon, but today I noticed that exiting the terminal doesn't really terminate processes started with & at the end of the command.
$ while :; do echo "hi" >> temp.log ; done &
[1] 11108
$ ps -ef | grep 11108
username 11108 11076 83 15:25 pts/0 00:00:05 /bin/sh
username 11116 11076 0 15:25 pts/0 00:00:00 grep 11108
$ exit
(after reconnecting)
$ ps -ef | grep 11108
username 11108 1 91 15:25 pts/0 00:00:17 /bin/sh
username 11130 11540 0 15:25 pts/0 00:00:00 grep 11108
So apparently, the process's PPID changed to 1, meaning that it got daemonized somehow.
This contradicts my knowledge, that & is not enough and one must use nohup or some other tricks to a process 'daemon'.
Does anyone know who is doing this daemonizing?
I'm using a CentOS 6.3 host and putty/cygwin/sshclient produced the same result.

You can daemonize a process if that doesn't respond to SIGHUP signal.
When bash shell is terminated while it is running background tasks, bash shell sends SIGHUP
(hangup signal) to all tasks. However bash won't wait until child processes are completely
terminated. If child process doesn't respond to SIGHUP signal, that process becomes an orphan
process. (its parent pid is changed to 1 - init process - to prevent becoming a useless zombie process)
Subshell executions basically do not responds to SIGHUP signals, thus your command will still be running after logging out from the first shell.

Running a php script in the background via shell - script never executes on mac os x

I have a php script which is responsible for sending emails based on a queue contained in a database.
The script works when it is executed from my shell as such:
/usr/bin/php -f /folder/email.php
However, when I execute it to run in the background:
/usr/bin/php -f /folder/email.php > /dev/null &
It never completes, and the process just sits in the process queue:
clickonce: ps T
PID TT STAT TIME COMMAND
1246 s000 Ss 0:00.03 login -pf
1247 s000 S 0:00.03 -bash
1587 s000 T 0:00.05 /usr/bin/php -f /folder/email.php
1589 s000 R+ 0:00.00 ps T
So my question is how can I run this as a background process and have it actually execute? Do I need to configure my OS? Do I need to change the way I execute the command?

"T" in the "STAT" column indicates a stopped process. I would guess that your script is attempting to read input from stdin and is getting stopped because it is not the foreground process and thus is not allowed to read.
You should check if the script does indeed read something while executing.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string