How to kill this immortal nginx worker? - linux

I have started nginx and when I stop like root
/etc/init.d/nginx stop
after that I type
ps aux | grep nginx
and get response like tcp LISTEN 2124 nginx WORKER
kill -9 2124 # tried with kill -QUIT 2124, kill -KILL 2124
and after I type again
ps aux | grep nginx
and get response like tcp LISTEN 2125 nginx WORKER
and so on.
How to kill this immortal Chuck Norris worker ?

After kill -9 there's nothing more to do to the process - it's dead (or doomed to die). The reason it sticks around is because either (a) it's parent process hasn't waited for it yet, so the kernel holds the process table entry to keep it's status until the parent does so, or (b) the process is stuck on a system call into the kernel that is not finishing (which usually means a buggy driver and/or hardware).
If the first case, getting the parent to wait for the child, or terminating the parent, should work. Most programs don't have a clear way to make them "wait for a child", so that may not be an option.
In the second case, the most likely solution is to reboot. There may be tools that could clear such a condition, but that's not common. Depending on just what that kernel processing is doing, it may be possible to get it to unblock by other means - but that requires knowledge of that processing. For example, if the process is blocked on a kernel lock that some other process is somehow holding indefinitely, terminating that other process could aleviate the problem.
Note that the ps command can distinguish these two states as well. These show up in the 'Z' state. See the ps man page for more info: http://linux.die.net/man/1/ps. They may also show up with the text "defunct".

I had the same issue.
In my case gitlab was the responsible to bring the nginx workers.
when i completelly removed gitlab from my server i got able to kill the nginx workers.
ps -aux | grep "nginx"
Search for the workers and check at the first column who is bringing them up.
kill or unistall the responsible and kill the workers again, they will stop spawning ;D

I was having similar issue.
Check if you are using any auto-healer like Monit or Supervisor which runs the worker whenever you try to stop them. If Yes Disable them.
My workers were being spawned due to changes I forget i made in update-rc.d in Ubuntu.
So I installed sysv-rc-conf which gives a clean interface control of what processes are on reboot, you can disable from there and I assure you no Chuck Noris Resurrection :D

Related

spawn daemon process with make

I'm fiddling with libfuse and I find useful the rules make mount which executes the userspace fuse daemon and make umount to unmount the directory. Unfortunately if I start the daemon in the make mount rule, this gets killed as soon as make exits (when the rule is completed).
Is it possible to spawn a daemon from a make rule such that the daemon persists the exit of make?
Make is the wrong tool for the job here. It shouldn't be used as a supervisor for other processes and anything it starts should end when it does.
That being said you can easily unhitch processes so that kill signals are not propagated when processes terminate. Running your fuse daemon prefixed by nohup … should stop the signals from reaching the child process and it will go on it's merry way.

What happens to other processes when a Docker container's PID1 exits?

Consider the following, which runs sleep 60 in the background and then exits:
$ cat run.sh
sleep 60&
ps
echo Goodbye!!!
$ docker run --rm -v $(pwd)/run.sh:/run.sh ubuntu:16.04 bash /run.sh
PID TTY TIME CMD
1 ? 00:00:00 bash
5 ? 00:00:00 sleep
6 ? 00:00:00 ps
Goodbye!!!
This will start a Docker container, with bash as PID1. It then fork/execs a sleep process, and then bash exits. When the Docker container dies, the sleep process somehow dies too.
My question is: what is the mechanism by which the sleep process is killed? I tried trapping SIGTERM in a child process, and that appears to not get tripped. My presumption is that something (either Docker or the Linux kernel) is sending SIGKILL when shutting down the cgroup the container is using, but I've found no documentation anywhere clarifying this.
EDIT The closest I've come to an explanation is the following quote from baseimage-docker:
If your init process is your app, then it'll probably only shut down itself, not all the other processes in the container. The kernel will then forcefully kill those other processes, not giving them a chance to gracefully shut down, potentially resulting in file corruption, stale temporary files, etc. You really want to shut down all your processes gracefully.
So at least according to this, the implication is that when the container exits, the kernel will sending a SIGKILL to all remaining processes. But I'd still like clarity on how it decides to do that (i.e., is it a feature of cgroups?), and ideally a more authoritative source would be nice.
OK, I seem to have come up with some more solid evidence that this is, in fact, the Linux kernel doing the terminating. In the clone(2) man page, there's this useful section:
CLONE_NEWPID (since Linux 2.6.24)
The first process created in a new namespace (i.e., the process
created using the CLONE_NEWPID flag) has the PID 1, and is the
"init" process for the namespace. Children that are orphaned
within the namespace will be reparented to this process rather than
init(8). Unlike the traditional init process, the "init" process of a
PID namespace can terminate, and if it does, all of the processes in
the namespace are terminated.
Unfortunately this is still vague on how exactly the processes in the namespace are terminated, but perhaps that's because, unlike a normal process exit, no entry is left in the process table. Whatever the case is, it seems clear that:
The kernel itself is killing the other processes
They are not killed in a way that allows them any chance to do cleanup, making it (almost?) identical to a SIGKILL

Simulating a process stuck in a blocking system call

I'm trying to test a behaviour which is hard to reproduce in a controlled environment.
Use case:
Linux system; usually Redhat EL 5 or 6 (we're just starting with RHEL 7 and systemd, so it's currently out of scope).
There're situations where I need to restart a service. The script we use for stopping the service usually works quite well; it sends a SIGTERM to the process, which is designed to handle it; if the process doesn't handle the SIGTERM within a timeout (usually a couple of minutes) the script sends a SIGKILL, then waits a couple minutes more.
The problem is: in some (rare) situations, the process doesn't exit after a SIGKILL; this usually happens when it's badly stuck on a system call, possibly because of a kernel-level issue (corrupt filesystem, or not-working NFS filesystem, or something equally bad requiring manual intervention).
A bug arose when the script didn't realize that the "old" process hadn't actually exited and started a new process while the old was still running; we're fixing this with a stronger locking system (so that at least the new process doesn't start if the old is running), but I find it difficult to test the whole thing because I haven't found a way to simulate an hard-stuck process.
So, the question is:
How can I manually simulate a process that doesn't exit when sending a SIGKILL to it, even as a privileged user?
If your process are stuck doing I/O, You can simulate your situation in this way:
lvcreate -n lvtest -L 2G vgtest
mkfs.ext3 -m0 /dev/vgtest/lvtest
mount /dev/vgtest/lvtest /mnt
dmsetup suspend /dev/vgtest/lvtest && dd if=/dev/zero of=/mnt/file.img bs=1M count=2048 &
In this way the dd process will stuck waiting for IO and will ignore every signal, I know the signals aren't ignore in the latest kernel when processes are waiting for IO on nfs filesystem.
Well... How about just not sending SIGKILL? So your env will behave like it was sent, but the process didn't quit.
Once a proces is in "D" state (or TASK_UNINTERRUPTIBLE) in a kernel code path where the execution can not be interrupted while a task is processed, which means sending any signals to the process would not be useful and would be ignored.
This can be caused due to device driver getting too many interrupts from the hardware, getting too many incoming network packets, data from NIC firmware or blocked on a HDD performing I/O. Normally if this happens very quickly and threads remain in this state for very short span of time.
Therefore what you need to be doing is look at the syslog and sar reports during the time when the process was stuck in D-state. If you find stack traces in the log, try to search kernel.bugzilla.org for similar issues or seek support from the Linux vendor.
I would code the opposite way. Have your server process write its pid in e.g. /var/run/yourserver.pid (this is common practice). Have the starting script read that file and test that the process does not exist e.g. with kill of signal 0, or with
yourserver_pid=$(cat /var/run/yourserver.pid)
if [ -f /proc/$yourserver_pid/exe ]; then
You could improve that by readlink /proc/$yourserver_pid/exe and comparing that to /usr/bin/yourserver
BTW, having a process still alive a few seconds after a SIGKILL is a serious situation (the common case when it could happen is if the process is stuck in a D state, waiting for some NFS server), and you probably should detect and syslog it (e.g. with logger in your script).
I also would try to first send SIGTERM, wait a few seconds, send SIGQUIT, wait a few seconds, and at last send SIGKILL and only a few seconds later test that the server process has gone
A bug arose when the script didn't realize that the "old" process hadn't actually exited and started a new process while the old was still running;
This is the bug in the OS/kernel level, not in your service script. The situation is rare and is hard to simulate because the OS is supposed to kill the process when SIGKILL signal happens. So I guess your goal is to let your script work well under a buggy kernel. Is that correct?
You can attach gdb to the process, SIGKILL won't remove such process from processlist but it will flag it as zombie, which might still be acceptable for your purpose.
void#tahr:~$ ping 8.8.8.8 > /tmp/ping.log &
[1] 3770
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 S 0:00 ping 8.8.8.8
void#tahr:~$ sudo gdb -p 3770
...
(gdb)
Other terminal
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 t 0:00 ping 8.8.8.8
sudo kill -9 3770
...
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 Z 0:00 [ping] <defunct>
First terminal again
(gdb) quit

How to terminate screen while running a .sh?

I search everyplace but didn't find a solution to my question, please help!
My situation:
I need run a huge .sh in my AWS (amazon web service), it will take about 4-5 hours to finish the job, I don't want to sit down just look those logs, so I create a screen to run it (screen 1), but while I configure the installation, I make a stupid mistake to create another screen and config and execute (screen 2).
The question is:
Screen 2 finish the job and I 'exit' the screen(terminated), but I can't terminate screen 1, because when I enter 'exit', it become a parameter of configuration, CTRL+A+K also din't work, please tell me how can I kill this screen, thanks.
KILL -9 <pid> does the trick. If you want it to run in the background do it for the parent process.
logon to another session.
ps -ef | grep yourusername
will show you the processes running that you own. The leftmost number is the pid of the process.
Issue a kill command on the process you want to stop.
kill [pid]
If that fails try
kill -9 [pid]

Automating Killall then Killall level 9

Sometimes I want to killall of a certain process, but running killall doesn't work. So when I try to start the process again, it fails because the previous session is still running. Then I have to tediously run killall -9 on it. So to simplify my life, I created a realkill script and it goes like this:
PIDS=$(ps aux | grep -i "$#" | awk '{ print $2 }') # Get matching pid's.
kill $PIDS 2> /dev/null # Try to kill all pid's.
sleep 3
kill -9 $PIDS 2> /dev/null # Force quit any remaining pid's.
So, Is this the best way to be doing this? In what ways can I improve this script?
Avoid killall if you can since there is not a consistent implementation across all UNIX platforms. Proctools' pkill and pgrep are preferable:
for procname; do
pkill "$procname"
done
sleep 3
for procname; do
# Why check if the process exists if you're just going to `SIGKILL` it?
pkill -9 "$procname"
done
(Edit) If you have processes that are supposed to restart after being killed, you may not want to blindly kill them, so you can gather the PIDs first:
pids=()
for procname; do
pids+=($(pgrep "$procname"))
done
# then proceed with `kill`
That said, you should really try to avoid using SIGKILL if you can. It does not give software a chance to clean up after itself. If a program won't quit shortly after receiving a SIGTERM it is probably waiting for something. Find out what it's waiting for (hardware interrupt? open file?) and fix that, and you can let it close cleanly.
Without understanding what exactly the process does, I would say it probably isn't ideal cos you may have a situation where the processes you are killing are really doing some useful shutdown/cleanup work. Forcing it down with kill -9 may short-circuit that work and could cause corruption if your process is in fact writing data.
Assuming there is no danger of data corruption and it's ok to short-circuit the shutdown, can you just kill -9 the process the first time and be done with it. Do you have access to the developers of the process you are killing to understand what is going on that might prevent the shutdown from happening? The process might have blocked the INT and TERM for good reason.
It is unlikely, but it is possible that in that 3 second wait, a new process could have taken over that PID and the second kill would kill it.

Resources