Can't detach child process when main process is started from systemd - linux

I want to spawn long-running child processes that survive when the main process restarts/dies. This works fine when running from the terminal:
$ cat exectest.go
package main
import (
"log"
"os"
"os/exec"
"syscall"
"time"
)
func main() {
if len(os.Args) == 2 && os.Args[1] == "child" {
for {
time.Sleep(time.Second)
}
} else {
cmd := exec.Command(os.Args[0], "child")
cmd.SysProcAttr = &syscall.SysProcAttr{Setsid: true}
log.Printf("child exited: %v", cmd.Run())
}
}
$ go build
$ ./exectest
^Z
[1]+ Stopped ./exectest
$ bg
[1]+ ./exectest &
$ ps -ef | grep exectest | grep -v grep | grep -v vim
snowm 7914 5650 0 23:44 pts/7 00:00:00 ./exectest
snowm 7916 7914 0 23:44 ? 00:00:00 ./exectest child
$ kill -INT 7914 # kill parent process
[1]+ Exit 2 ./exectest
$ ps -ef | grep exectest | grep -v grep | grep -v vim
snowm 7916 1 0 23:44 ? 00:00:00 ./exectest child
Note that the child process is still alive after parent process was killed. However, if I start the main process from systemd like this...
[snowm#localhost exectest]$ cat /etc/systemd/system/exectest.service
[Unit]
Description=ExecTest
[Service]
Type=simple
ExecStart=/home/snowm/src/exectest/exectest
User=snowm
[Install]
WantedBy=multi-user.target
$ sudo systemctl enable exectest
ln -s '/etc/systemd/system/exectest.service' '/etc/systemd/system/multi-user.target.wants/exectest.service'
$ sudo systemctl start exectest
... then the child also dies when I kill the main process:
$ ps -ef | grep exectest | grep -v grep | grep -v vim
snowm 8132 1 0 23:55 ? 00:00:00 /home/snowm/src/exectest/exectest
snowm 8134 8132 0 23:55 ? 00:00:00 /home/snowm/src/exectest/exectest child
$ kill -INT 8132
$ ps -ef | grep exectest | grep -v grep | grep -v vim
$
How can I make the child survive?
Running go version go1.4.2 linux/amd64 under CentOS Linux release 7.1.1503 (Core).

Solution is to add
KillMode=process
to the service block. Default value is control-group which means systemd cleans up any child processes.
From man systemd.kill
KillMode= Specifies how processes of this unit shall be killed. One of
control-group, process, mixed, none.
If set to control-group, all remaining processes in the control group
of this unit will be killed on unit stop (for services: after the stop
command is executed, as configured with ExecStop=). If set to process,
only the main process itself is killed. If set to mixed, the SIGTERM
signal (see below) is sent to the main process while the subsequent
SIGKILL signal (see below) is sent to all remaining processes of the
unit's control group. If set to none, no process is killed. In this
case, only the stop command will be executed on unit stop, but no
process be killed otherwise. Processes remaining alive after stop are
left in their control group and the control group continues to exist
after stop unless it is empty.

The only way I know to solve this is to launch the child process with the --scope argument.
systemd-run --user --scope firefox
KillMode has been mentioned here also, but changing the KillMode also means that if your main process crashes, systemd won't restart it if any child process is still running.

If you cannot (like me) to change the KillMode of the service for some reason, you could try the at command (see man).
You can schedule your command to run 1 minute ahead. See an example:
# this will remove all .tmp files from "/path/" in 1 minute ahead (this task will run once)
echo rm /path/*.tmp | at now + 1 minute

Related

Why does SIGHUP not work on busybox sh in an Alpine Docker container?

Sending SIGHUP with
kill -HUP <pid>
to a busybox sh process on my native system works as expected and the shell hangs up. However, if I use docker kill to send the signal to a container with
docker kill -s HUP <container>
it doesn't do anything. The Alpine container is still running:
$ CONTAINER=$(docker run -dt alpine:latest)
$ docker ps -a --filter "id=$CONTAINER" --format "{{.Status}}"
Up 1 second
$ docker kill -s HUP $CONTAINER
4fea4f2dabe0f8a717b0e1272528af1a97050bcec51babbe0ed801e75fb15f1b
$ docker ps -a --filter "id=$CONTAINER" --format "{{.Status}}"
Up 7 seconds
By the way, with an Ubuntu container (which runs bash) it does work as expected:
$ CONTAINER=$(docker run -dt debian:latest)
$ docker ps -a --filter "id=$CONTAINER" --format "{{.Status}}"
Up 1 second
$ docker kill -s HUP $CONTAINER
9a4aff456716397527cd87492066230e5088fbbb2a1bb6fc80f04f01b3368986
$ docker ps -a --filter "id=$CONTAINER" --format "{{.Status}}"
Exited (129) 1 second ago
Sending SIGKILL does work, but I'd rather find out why SIGHUP does not.
Update: I'll add another example. Here you can see that busybox sh generally does hang up on SIGHUP successfully:
$ busybox sh -c 'while true; do sleep 10; done' &
[1] 28276
$ PID=$!
$ ps -e | grep busybox
28276 pts/5 00:00:00 busybox
$ kill -HUP $PID
$
[1]+ Hangup busybox sh -c 'while true; do sleep 10; done'
$ ps -e | grep busybox
$
However, running the same infinite sleep loop inside the docker container doesn't quit. As you can see, the container is still running after SIGHUP and only exits after SIGKILL:
$ CONTAINER=$(docker run -dt alpine:latest busybox sh -c 'while true; do sleep 10; done')
$ docker ps -a --filter "id=$CONTAINER" --format "{{.Status}}"
Up 14 seconds
$ docker kill -s HUP $CONTAINER
31574ba7c0eb0505b776c459b55ffc8137042e1ce0562a3cf9aac80bfe8f65a0
$ docker ps -a --filter "id=$CONTAINER" --format "{{.Status}}"
Up 28 seconds
$ docker kill -s KILL $CONTAINER
31574ba7c0eb0505b776c459b55ffc8137042e1ce0562a3cf9aac80bfe8f65a0
$ docker ps -a --filter "id=$CONTAINER" --format "{{.Status}}"
Exited (137) 2 seconds ago
$
(I don't have Docker env at hand for a try. Just guessing.)
For your case, docker run must be running busybox/sh or bash as PID 1.
According to Docker doc:
Note: A process running as PID 1 inside a container is treated specially by Linux: it ignores any signal with the default action. So, the process will not terminate on SIGINT or SIGTERM unless it is coded to do so.
For the differece between busybox/sh and bash regarding SIGHUP ---
On my system (Debian 9.6, x86_64), the signal masks for busybox/sh and bash are as follows:
busybox/sh:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 82817 0.0 0.0 6952 1904 pts/2 S+ 10:23 0:00 busybox sh
PENDING (0000000000000000):
BLOCKED (0000000000000000):
IGNORED (0000000000284004):
3 QUIT
15 TERM
20 TSTP
22 TTOU
CAUGHT (0000000008000002):
2 INT
28 WINCH
bash:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 4871 0.0 0.1 21752 6176 pts/16 Ss 2019 0:00 /usr/local/bin/bash
PENDING (0000000000000000):
BLOCKED (0000000000000000):
IGNORED (0000000000380004):
3 QUIT
20 TSTP
21 TTIN
22 TTOU
CAUGHT (000000004b817efb):
1 HUP
2 INT
4 ILL
5 TRAP
6 ABRT
7 BUS
8 FPE
10 USR1
11 SEGV
12 USR2
13 PIPE
14 ALRM
15 TERM
17 CHLD
24 XCPU
25 XFSZ
26 VTALRM
28 WINCH
31 SYS
As we can see busybox/sh does not handle SIGHUP so the signal is ignored. Bash catches SIGHUP so docker kill can deliver the signal to Bash and then Bash will be terminated because, according to its manual, "the shell exits by default upon receipt of a SIGHUP".
UPDATE 2020-03-07 #1:
Did a quick test and my previous analysis is basically correct. You can verify like this:
[STEP 104] # docker run -dt debian busybox sh -c \
'trap exit HUP; while true; do sleep 1; done'
331380090c59018dae4dbc17dd5af9d355260057fdbd2f2ce9fc6548a39df1db
[STEP 105] # docker ps
CONTAINER ID IMAGE COMMAND CREATED
331380090c59 debian "busybox sh -c 'trap…" 11 seconds ago
[STEP 106] # docker kill -s HUP 331380090c59
331380090c59
[STEP 107] # docker ps
CONTAINER ID IMAGE COMMAND CREATED
[STEP 108] #
As I showed earlier, by default busybox/sh does not catch SIGHUP so the signal will be ignored. But after busybox/sh explicitly trap SIGHUP, the signal will be delivered to it.
I also tried SIGKILL and yes it'll always terminate the running container. This is reasonable since SIGKILL cannot be caught by any process so the signal will always be delivered to the container and kill it.
UPDATE 2020-03-07 #2:
You can also verify it this way (much simpler):
[STEP 110] # docker run -ti alpine
/ # ps
PID USER TIME COMMAND
1 root 0:00 /bin/sh
7 root 0:00 ps
/ # kill -HUP 1 <-- this does not kill it because linux ignored the signal
/ #
/ # trap 'echo received SIGHUP' HUP
/ # kill -HUP 1
received SIGHUP <-- this indicates it can receive SIGHUP now
/ #
/ # trap exit HUP
/ # kill -HUP 1 <-- this terminates it because the action changed to `exit`
[STEP 111] #
Like the other answer already points out, the docs for docker run contain the following note:
Note: A process running as PID 1 inside a container is treated specially by Linux: it ignores any signal with the default action. So, the process will not terminate on SIGINT or SIGTERM unless it is coded to do so.
This is the reason why SIGHUP doesn't work on busybox sh inside the container. However, if I run busybox sh on my native system, it won't have PID 1 and therefore SIGHUP works.
There are various solutions:
Use --init to specify an init process which should be used as PID 1.
You can use the --init flag to indicate that an init process should be used as the PID 1 in the container. Specifying an init process ensures the usual responsibilities of an init system, such as reaping zombie processes, are performed inside the created container.
The default init process used is the first docker-init executable found in the system path of the Docker daemon process. This docker-init binary, included in the default installation, is backed by tini.
Trap SIGHUP and call exit yourself.
docker run -dt alpine busybox sh -c 'trap exit HUP ; while true ; do sleep 60 & wait $! ; done'
Use another shell like bash which exits on SIGHUP by default, doesn't matter if PID 1 or not.

In unix I used kill command by providing a ppid then it close the terminal . why? kill -9 ppid

sleep 5000
In one terminal and in second terminal I'm running:
ps -ef | grep sleep
Then I'm killing this process in second terminal by using the ppid. Then it will close the first terminal where I run the sleep command. It will not create sleep command as an orphan.
$ ps -ef | grep sleep
trainee 4887 4864 0 17:05 pts/0 00:00:00 sleep 5000
trainee 4889 4264 0 17:05 pts/1 00:00:00 grep --color=auto sleep
kill -9 4864
Why?
Presumably the parent of the sleep is your shell. When you kill that your login is terminated and your terminal closes.
The Wikipedia article on Orphan process reads (in part),
An orphan process is a computer process whose parent process has finished or terminated, though it remains running itself.
and
A process can be orphaned unintentionally, such as when the parent process terminates or crashes. The process group mechanism in most Unix-like operation systems can be used to help protect against accidental orphaning, where in coordination with the user's shell will try to terminate all the child processes with the SIGHUP process signal, rather than letting them continue to run as orphans.

Job -l after nohup

How can I monitor a job that is still running (I guess detached?) after I started it with nohup, exited the server and logged back in? Normally, I use jobs -l to see what's running, but this is showing blank.
You need to understand the difference between a process and a job. Jobs are managed by the shell, so when you end your terminal session and start a new one, you are now in a new instance of Bash with its own jobs table. You can't access jobs from the original shell but as the other answers have noted, you can still find and manipulate the processes that were started. For example:
$ nohup sleep 60 &
[1] 27767
# Our job is in the jobs table
$ jobs
[1]+ Running nohup sleep 60 &
# And this is the process we started
$ ps -p 27767
PID TTY TIME CMD
27767 pts/1 00:00:00 sleep
$ exit # and start a new session
# Now jobs returns nothing because the jobs table is empty
$ jobs
# But our process is still alive and kicking...
$ ps -p 27767
PID TTY TIME CMD
27767 pts/1 00:00:00 sleep
# Until we decide to kill it
$ kill 27767
# Now the process is gone
$ ps -p 27767
PID TTY TIME CMD
You can monitor if the proceses if still running using
ps -p <pid>, where is the ID of the process you get after using the nohup command.
If you see valid entries you process is probably alive.
You could have a list of the processes running under current user with ps -u "$USER" or ps -u "$(whoami)".
Try this :
ps -ef | grep <pid>

How to get the process ID to kill a nohup process?

I'm running a nohup process on the server. When I try to kill it my putty console closes instead.
this is how I try to find the process ID:
ps -ef |grep nohup
this is the command to kill
kill -9 1787 787
When using nohup and you put the task in the background, the background operator (&) will give you the PID at the command prompt. If your plan is to manually manage the process, you can save that PID and use it later to kill the process if needed, via kill PID or kill -9 PID (if you need to force kill). Alternatively, you can find the PID later on by ps -ef | grep "command name" and locate the PID from there. Note that nohup keyword/command itself does not appear in the ps output for the command in question.
If you use a script, you could do something like this in the script:
nohup my_command > my.log 2>&1 &
echo $! > save_pid.txt
This will run my_command saving all output into my.log (in a script, $! represents the PID of the last process executed). The 2 is the file descriptor for standard error (stderr) and 2>&1 tells the shell to route standard error output to the standard output (file descriptor 1). It requires &1 so that the shell knows it's a file descriptor in that context instead of just a file named 1. The 2>&1 is needed to capture any error messages that normally are written to standard error into our my.log file (which is coming from standard output). See I/O Redirection for more details on handling I/O redirection with the shell.
If the command sends output on a regular basis, you can check the output occasionally with tail my.log, or if you want to follow it "live" you can use tail -f my.log. Finally, if you need to kill the process, you can do it via:
kill -9 `cat save_pid.txt`
rm save_pid.txt
I am using red hat linux on a VPS server (and via SSH - putty), for me the following worked:
First, you list all the running processes:
ps -ef
Then in the first column you find your user name; I found it the following three times:
One was the SSH connection
The second was an FTP connection
The last one was the nohup process
Then in the second column you can find the PID of the nohup process and you only type:
kill PID
(replacing the PID with the nohup process's PID of course)
And that is it!
I hope this answer will be useful for someone I'm also very new to bash and SSH, but found 95% of the knowledge I need here :)
suppose i am running ruby script in the background with below command
nohup ruby script.rb &
then i can get the pid of above background process by specifying command name. In my case command is ruby.
ps -ef | grep ruby
output
ubuntu 25938 25742 0 05:16 pts/0 00:00:00 ruby test.rb
Now you can easily kill the process by using kill command
kill 25938
jobs -l should give you the pid for the list of nohup processes.
kill (-9) them gently.
;)
You could try
kill -9 `pgrep [command name]`
Suppose you are executing a java program with nohup you can get java process id by
`ps aux | grep java`
output
xxxxx 9643 0.0 0.0 14232 968 pts/2
then you can kill the process by typing
sudo kill 9643
or lets say that you need to kill all the java processes then just use
sudo killall java
this command kills all the java processes. you can use this with process. just give the process name at the end of the command
sudo killall {processName}
If your application always uses the same port, you can kill all the processes in that port like this.
kill -9 $(lsof -t -i:8080)
This works in Ubuntu
Type this to find out the PID
ps aux | grep java
All the running process regarding to java will be shown
In my case is
johnjoe 3315 9.1 4.0 1465240 335728 ? Sl 09:42 3:19 java -jar batch.jar
Now kill it kill -9 3315
The zombie process finally stopped.
when you create a job in nohup it will tell you the process ID !
nohup sh test.sh &
the output will show you the process ID like
25013
you can kill it then :
kill 25013
I started django server with the following command.
nohup manage.py runserver <localhost:port>
This works on CentOS:
:~ ns$netstat -ntlp
:~ ns$kill -9 PID
This works for mi fine on mac
kill -9 `ps -ef | awk '/nohup/{ print \$2 }'`
I often do this way. Try this way :
ps aux | grep script_Name
Here, script_Name could be any script/file run by nohup.
This command gets you a process ID. Then use this command below to kill the script running on nohup.
kill -9 1787 787
Here, 1787 and 787 are Process ID as mentioned in the question as an example.
This should do what was intended in the question.
If you are unaware of the PID, then first find it using TOP command
top -U userid
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
You will get the PID using top, then perform the kill operation.
$ kill -9 <PID>
Today I met the same problem. And since it was a long time ago, I totally forgot which command I used and when. I tried three methods:
Using the STIME shown in ps -ef command. This shows the time you start your process, and it's very likely that you nohup you command just before you close ssh(depends on you) . Unfortunately I don't think the latest command is the command I run using nohup, so this doesn't work for me.
Second is the PPID, also shown in ps -ef command. It means Parent Process ID, the ID of process that creates the process. The ppid is 1 in ubuntu for process that using nohup to run. Then you can use ps --ppid "1" to get the list, and check TIME(the total CPU time your process use) or CMD to find the process's PID.
Use lsof -i:port if the process occupy some ports, and you will get the command. Then just like the answer above, use ps -ef | grep command and you will get the PID.
Once you find the PID of the process, then can use kill pid to terminal the process.
About losing your putty: often the ps ... | awk/grep/perl/... process gets matched, too! So the old school trick is like this
ps -ef | grep -i [n]ohup
That way the regex search doesn't match the regex search process!
if you are on a remote server, check memory usage with top , and find your process and its ID. After that, just execute kill [your process ID] .

Who does the daemonizing?

There are various tricks to daemonize a linux process, i.e. to make a command running after the terminal is closed.
nohup is used for this purpose, and fork()/setsid() combination can be used in a C program to make itself a daemon process.
The above was my knowledge about linux daemon, but today I noticed that exiting the terminal doesn't really terminate processes started with & at the end of the command.
$ while :; do echo "hi" >> temp.log ; done &
[1] 11108
$ ps -ef | grep 11108
username 11108 11076 83 15:25 pts/0 00:00:05 /bin/sh
username 11116 11076 0 15:25 pts/0 00:00:00 grep 11108
$ exit
(after reconnecting)
$ ps -ef | grep 11108
username 11108 1 91 15:25 pts/0 00:00:17 /bin/sh
username 11130 11540 0 15:25 pts/0 00:00:00 grep 11108
So apparently, the process's PPID changed to 1, meaning that it got daemonized somehow.
This contradicts my knowledge, that & is not enough and one must use nohup or some other tricks to a process 'daemon'.
Does anyone know who is doing this daemonizing?
I'm using a CentOS 6.3 host and putty/cygwin/sshclient produced the same result.
You can daemonize a process if that doesn't respond to SIGHUP signal.
When bash shell is terminated while it is running background tasks, bash shell sends SIGHUP
(hangup signal) to all tasks. However bash won't wait until child processes are completely
terminated. If child process doesn't respond to SIGHUP signal, that process becomes an orphan
process. (its parent pid is changed to 1 - init process - to prevent becoming a useless zombie process)
Subshell executions basically do not responds to SIGHUP signals, thus your command will still be running after logging out from the first shell.

Resources