How to launch a process outside a systemd control group - linux

I have a server process (launched from systemd) which can launch an update process. The update process self-daemonizes itself and then (in theory) kills the server with SIGTERM. My problem is that the SIGTERM propagates to the update process and it's children.
For debugging purposes, the update process just sleeps, and I send the kill by hand.
Sample PS output before the kill:
1 1869 1869 1869 ? -1 Ss 0 0:00 /usr/local/bin/state_controller --start
1869 1873 1869 1869 ? -1 Sl 0 0:00 \_ ProcessWebController --start
1869 1886 1869 1869 ? -1 Z 0 0:00 \_ [UpdateSystem] <defunct>
1 1900 1900 1900 ? -1 Ss 0 0:00 /bin/bash /usr/local/bin/UpdateSystem refork /var/ttm/update.bin
1900 1905 1900 1900 ? -1 S 0 0:00 \_ sleep 10000
Note that UpdateSystem is in a separate PGID and TPGID. (The <defunct> process is a result of the daemonization, and is not (I think) a problem.)
UpdateSystem is a bash script (although I can easily make it a C program if that will help). After the daemonization code taken from https://stackoverflow.com/a/29107686/771073, the interesting bit is:
#############################################
trap "echo Ignoring SIGTERM" SIGTERM
sleep 10000
echo Awoken from sleep - presumably by the SIGTERM
exit 0
When I kill 1869 (which sends SIGTERM to the state_controller server process, my logfile contains:
Terminating
Ignoring SIGTERM
Awoken from sleep - presumably by the SIGTERM
I really want to prevent SIGTERM being sent to the sleep process.
(Actually, I really want to stop it being sent to apt-get upgrade which is stopping the system via the moral equivalent of systemctl stop ttm.service and the ExecStop is specified as /bin/kill $MAINPID - just in case that changes anyone's answer.)
This question is similar, but the accepted answer (use KillMode=process) doesn't work well for me - I want to kill some of the child processes, just not the update process:
Can't detach child process when main process is started from systemd

A completely different approach is for the upgrade process to remove itself from the service group by updating the /sys/fs/cgroup/systemd filesystem. Specifically in bash:
echo $$ > /sys/fs/cgroup/systemd/tasks
A process belongs to exactly one control group. Writing its PID to the root tasks file adds it to the other control group, and removes it from the service control group.

We were having exactly the same problem. What we ended up doing is launching the update process as transient cgroup with systemd-run:
systemd-run --unit=my_system_upgrade --scope --slice=my_system_upgrade_slice -E setsid nohup start-the-upgrade &> /tmp/some-logs.log &
That way, the update process will run in a different cgroup and will not be terminated. Additionally, we use setsid + nohup to make sure the process has its own group and session and that the parent process is the init process.

The approach we have decided to take is to launch the update process in a separate (single-shot) service. As such, it automatically belongs to a separate control group, so killing the main service doesn't kill it.
There is a wrinkle to this though. The package installs ttm.service and ttm.template.update.service. To run the updater, we copy ttm.template.update.service to ttm.update.service, run systemctl daemon-reload, and then run systemctl start ttm.update.service. Why the copy? Because when the updater installs a new version of ttm.template.update.service, it will forcibly terminate any processes running as that service. KillMode=None appears to offer a way round that, but although it appears to work, a subsequent call to apt-get yields a nasty error about dpkg having been interrupted.

Are you sure it is not systemd sending the TERM signal to the child process?
Depending on the service type, if your main process dies, systemd will do a cleanup and terminate all the child processes under the same cgroup.
This is defined by KillMode= property which is by default set to control-group. You could set it to "none" or "process". https://www.freedesktop.org/software/systemd/man/systemd.kill.html

I have same sitation with you.
Upgrade process is a child process of parent process. The parent process is call by a service.
The main point is not Cgroup, is MAINPID.
If you use PIDFILE to sepecify the MAINPID, when the service type = forking, then the situation solved.
[Service]
Type=forking
PIDFile=/run/test.pid

Related

kill -s SIGTERM kills parent process and one level child process only

I've been doing some experimenting with writing a command to kill parent and all it's children recursively. I've a script as below
parent.sh:
#!/bin/bash
/home/oracle/child.sh &
sleep infinity
child.sh:
#!/bin/bash
sleep infinity
Started command using
su oracle -c parent.sh &
I see a process tree like below
[root#source ~]# ps -ef | grep "/home/oracle"
root 14129 1171 0 12:39 pts/1 00:00:00 su oracle -c /home/oracle/parent.sh
oracle 14130 14129 0 12:39 ? 00:00:00 /bin/bash /home/oracle/parent.sh
oracle 14131 14130 0 12:39 ? 00:00:00 /bin/bash /home/oracle/child.sh
When I send sigterm to 14129 using kill -s SIGTERM 14129 it appears to kill 14129 and then 14130 goes down as well immediately; but 14131 stays up for a very long time. The last level child appears to have been reparented and has become a zombie.
oracle 14131 1 0 12:39 ? 00:00:00 /bin/bash /home/oracle/child.sh
If kill doesn't terminate any child processes why did 14130 get killed when I sent a SIGTERM to 14129?
If kill can kill child processes, why would does it go only one level down? Is the behavior here guaranteed?
The relevant part of what pilcrow provided is this:
SIGNALS top
Upon receiving either SIGINT, SIGQUIT or SIGTERM, su terminates
its child and afterwards terminates itself with the received
signal.
>> The child is terminated by SIGTERM,
>> (then) after unsuccessful attempt (to kill with SIGTERM) and
>> (after) 2 seconds of delay (,) the child is (then) killed
>> by SIGKILL [a second, harsher method].
That harsher method, SIGKILL, prevents that child process from attempting to kill its own children, hence the zombie state.
I haven't used it myself, but it seems that something like
killall --process-group parent.sh
would kill all processes tied to the process group associated with the "parent.sh" script. BUT ... not sure if "--wait" will serve you well, if the method used in the attempt to terminate is not being accepted.

What happens to other processes when a Docker container's PID1 exits?

Consider the following, which runs sleep 60 in the background and then exits:
$ cat run.sh
sleep 60&
ps
echo Goodbye!!!
$ docker run --rm -v $(pwd)/run.sh:/run.sh ubuntu:16.04 bash /run.sh
PID TTY TIME CMD
1 ? 00:00:00 bash
5 ? 00:00:00 sleep
6 ? 00:00:00 ps
Goodbye!!!
This will start a Docker container, with bash as PID1. It then fork/execs a sleep process, and then bash exits. When the Docker container dies, the sleep process somehow dies too.
My question is: what is the mechanism by which the sleep process is killed? I tried trapping SIGTERM in a child process, and that appears to not get tripped. My presumption is that something (either Docker or the Linux kernel) is sending SIGKILL when shutting down the cgroup the container is using, but I've found no documentation anywhere clarifying this.
EDIT The closest I've come to an explanation is the following quote from baseimage-docker:
If your init process is your app, then it'll probably only shut down itself, not all the other processes in the container. The kernel will then forcefully kill those other processes, not giving them a chance to gracefully shut down, potentially resulting in file corruption, stale temporary files, etc. You really want to shut down all your processes gracefully.
So at least according to this, the implication is that when the container exits, the kernel will sending a SIGKILL to all remaining processes. But I'd still like clarity on how it decides to do that (i.e., is it a feature of cgroups?), and ideally a more authoritative source would be nice.
OK, I seem to have come up with some more solid evidence that this is, in fact, the Linux kernel doing the terminating. In the clone(2) man page, there's this useful section:
CLONE_NEWPID (since Linux 2.6.24)
The first process created in a new namespace (i.e., the process
created using the CLONE_NEWPID flag) has the PID 1, and is the
"init" process for the namespace. Children that are orphaned
within the namespace will be reparented to this process rather than
init(8). Unlike the traditional init process, the "init" process of a
PID namespace can terminate, and if it does, all of the processes in
the namespace are terminated.
Unfortunately this is still vague on how exactly the processes in the namespace are terminated, but perhaps that's because, unlike a normal process exit, no entry is left in the process table. Whatever the case is, it seems clear that:
The kernel itself is killing the other processes
They are not killed in a way that allows them any chance to do cleanup, making it (almost?) identical to a SIGKILL

In unix I used kill command by providing a ppid then it close the terminal . why? kill -9 ppid

sleep 5000
In one terminal and in second terminal I'm running:
ps -ef | grep sleep
Then I'm killing this process in second terminal by using the ppid. Then it will close the first terminal where I run the sleep command. It will not create sleep command as an orphan.
$ ps -ef | grep sleep
trainee 4887 4864 0 17:05 pts/0 00:00:00 sleep 5000
trainee 4889 4264 0 17:05 pts/1 00:00:00 grep --color=auto sleep
kill -9 4864
Why?
Presumably the parent of the sleep is your shell. When you kill that your login is terminated and your terminal closes.
The Wikipedia article on Orphan process reads (in part),
An orphan process is a computer process whose parent process has finished or terminated, though it remains running itself.
and
A process can be orphaned unintentionally, such as when the parent process terminates or crashes. The process group mechanism in most Unix-like operation systems can be used to help protect against accidental orphaning, where in coordination with the user's shell will try to terminate all the child processes with the SIGHUP process signal, rather than letting them continue to run as orphans.

Simulating a process stuck in a blocking system call

I'm trying to test a behaviour which is hard to reproduce in a controlled environment.
Use case:
Linux system; usually Redhat EL 5 or 6 (we're just starting with RHEL 7 and systemd, so it's currently out of scope).
There're situations where I need to restart a service. The script we use for stopping the service usually works quite well; it sends a SIGTERM to the process, which is designed to handle it; if the process doesn't handle the SIGTERM within a timeout (usually a couple of minutes) the script sends a SIGKILL, then waits a couple minutes more.
The problem is: in some (rare) situations, the process doesn't exit after a SIGKILL; this usually happens when it's badly stuck on a system call, possibly because of a kernel-level issue (corrupt filesystem, or not-working NFS filesystem, or something equally bad requiring manual intervention).
A bug arose when the script didn't realize that the "old" process hadn't actually exited and started a new process while the old was still running; we're fixing this with a stronger locking system (so that at least the new process doesn't start if the old is running), but I find it difficult to test the whole thing because I haven't found a way to simulate an hard-stuck process.
So, the question is:
How can I manually simulate a process that doesn't exit when sending a SIGKILL to it, even as a privileged user?
If your process are stuck doing I/O, You can simulate your situation in this way:
lvcreate -n lvtest -L 2G vgtest
mkfs.ext3 -m0 /dev/vgtest/lvtest
mount /dev/vgtest/lvtest /mnt
dmsetup suspend /dev/vgtest/lvtest && dd if=/dev/zero of=/mnt/file.img bs=1M count=2048 &
In this way the dd process will stuck waiting for IO and will ignore every signal, I know the signals aren't ignore in the latest kernel when processes are waiting for IO on nfs filesystem.
Well... How about just not sending SIGKILL? So your env will behave like it was sent, but the process didn't quit.
Once a proces is in "D" state (or TASK_UNINTERRUPTIBLE) in a kernel code path where the execution can not be interrupted while a task is processed, which means sending any signals to the process would not be useful and would be ignored.
This can be caused due to device driver getting too many interrupts from the hardware, getting too many incoming network packets, data from NIC firmware or blocked on a HDD performing I/O. Normally if this happens very quickly and threads remain in this state for very short span of time.
Therefore what you need to be doing is look at the syslog and sar reports during the time when the process was stuck in D-state. If you find stack traces in the log, try to search kernel.bugzilla.org for similar issues or seek support from the Linux vendor.
I would code the opposite way. Have your server process write its pid in e.g. /var/run/yourserver.pid (this is common practice). Have the starting script read that file and test that the process does not exist e.g. with kill of signal 0, or with
yourserver_pid=$(cat /var/run/yourserver.pid)
if [ -f /proc/$yourserver_pid/exe ]; then
You could improve that by readlink /proc/$yourserver_pid/exe and comparing that to /usr/bin/yourserver
BTW, having a process still alive a few seconds after a SIGKILL is a serious situation (the common case when it could happen is if the process is stuck in a D state, waiting for some NFS server), and you probably should detect and syslog it (e.g. with logger in your script).
I also would try to first send SIGTERM, wait a few seconds, send SIGQUIT, wait a few seconds, and at last send SIGKILL and only a few seconds later test that the server process has gone
A bug arose when the script didn't realize that the "old" process hadn't actually exited and started a new process while the old was still running;
This is the bug in the OS/kernel level, not in your service script. The situation is rare and is hard to simulate because the OS is supposed to kill the process when SIGKILL signal happens. So I guess your goal is to let your script work well under a buggy kernel. Is that correct?
You can attach gdb to the process, SIGKILL won't remove such process from processlist but it will flag it as zombie, which might still be acceptable for your purpose.
void#tahr:~$ ping 8.8.8.8 > /tmp/ping.log &
[1] 3770
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 S 0:00 ping 8.8.8.8
void#tahr:~$ sudo gdb -p 3770
...
(gdb)
Other terminal
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 t 0:00 ping 8.8.8.8
sudo kill -9 3770
...
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 Z 0:00 [ping] <defunct>
First terminal again
(gdb) quit

Upstart tracking wrong PID of Bluepill

I have bluepill setup to monitor my delayed_job processes.
Using Ubuntu 12.04.
I am starting and monitoring the bluepill service itself using Ubuntu's upstart. My upstart config is below (/etc/init/bluepill.conf).
description "Start up the bluepill service"
start on runlevel [2]
stop on runlevel [016]
expect fork
exec sudo /home/deploy/.rvm/wrappers/<app_name>/bluepill load /home/deploy/websites/<app_name>/current/config/server/staging/delayed_job.bluepill
# Restart the process if it dies with a signal
# or exit code not given by the 'normal exit' stanza.
respawn
I have also tried with expect daemon instead of expect fork. I have also tried removing the expect... line completely.
When the machine boots, bluepill starts up fine.
$ ps aux | grep blue
root 1154 0.6 0.8 206416 17372 ? Sl 21:19 0:00 bluepilld: <app_name>
The PID of the bluepill process is 1154 here. But upstart seems to be tracking the wrong PID.
$ initctl status bluepill
bluepill start/running, process 990
This is preventing the bluepill process from getting respawned if I forcefully kill bluepill using kill -9.
Moreover, I think because of the wrong PID being tracked, reboot / shutdown just hangs and I have to hard reset the machine every time.
What could be the issue here?
Clearly, upstart tracks the wrong PID. From looking at the bluepill source code, it uses the daemons gem to daemonize, which in turn forks twice. So expect daemon in the upstart config should track the correct PID -- but you've already tried that.
If it is possible for you, you should run bluepill in the foreground, and not use any expect stanza at all in your upstart config.
From the bluepill documentation:
Bluepill.application("app_name", :foreground => true) do |app|
# ...
end
will run bluepill in the foreground.

Resources