varnish - why two processes - varnish

For Varnish, I see two processes running, one is child of other as
nobody 10499 23634 0 22:25 ? 00:00:00 varnishd -f /etc/varnish/default.vcl -s malloc,1G -T 127.0.0.1:2000 -a 0.0.0.0:80
root 23634 1 0 19:33 ? 00:00:00 varnishd -f /etc/varnish/default.vcl -s malloc,1G -T 127.0.0.1:2000 -a 0.0.0.0:80
How does it actually work?

Varnish has two main processes: the management process and the child process. The management
process apply configuration changes (VCL and parameters), compile VCL, monitor Varnish, initialize
Varnish and provides a command line interface, accessible either directly on the terminal or through a
management interface.
The management process polls the child process every few seconds to see if it's still there. If it doesn't get
a reply within a reasonable time, the management process will kill the child and start it back up again. The
same happens if the child unexpectedly exits, for example from a segmentation fault or assert error.
This ensures that even if Varnish does contain a critical bug, it will start back up again fast. Usually within
a few seconds, depending on the conditions.

Related

How to launch a process outside a systemd control group

I have a server process (launched from systemd) which can launch an update process. The update process self-daemonizes itself and then (in theory) kills the server with SIGTERM. My problem is that the SIGTERM propagates to the update process and it's children.
For debugging purposes, the update process just sleeps, and I send the kill by hand.
Sample PS output before the kill:
1 1869 1869 1869 ? -1 Ss 0 0:00 /usr/local/bin/state_controller --start
1869 1873 1869 1869 ? -1 Sl 0 0:00 \_ ProcessWebController --start
1869 1886 1869 1869 ? -1 Z 0 0:00 \_ [UpdateSystem] <defunct>
1 1900 1900 1900 ? -1 Ss 0 0:00 /bin/bash /usr/local/bin/UpdateSystem refork /var/ttm/update.bin
1900 1905 1900 1900 ? -1 S 0 0:00 \_ sleep 10000
Note that UpdateSystem is in a separate PGID and TPGID. (The <defunct> process is a result of the daemonization, and is not (I think) a problem.)
UpdateSystem is a bash script (although I can easily make it a C program if that will help). After the daemonization code taken from https://stackoverflow.com/a/29107686/771073, the interesting bit is:
#############################################
trap "echo Ignoring SIGTERM" SIGTERM
sleep 10000
echo Awoken from sleep - presumably by the SIGTERM
exit 0
When I kill 1869 (which sends SIGTERM to the state_controller server process, my logfile contains:
Terminating
Ignoring SIGTERM
Awoken from sleep - presumably by the SIGTERM
I really want to prevent SIGTERM being sent to the sleep process.
(Actually, I really want to stop it being sent to apt-get upgrade which is stopping the system via the moral equivalent of systemctl stop ttm.service and the ExecStop is specified as /bin/kill $MAINPID - just in case that changes anyone's answer.)
This question is similar, but the accepted answer (use KillMode=process) doesn't work well for me - I want to kill some of the child processes, just not the update process:
Can't detach child process when main process is started from systemd
A completely different approach is for the upgrade process to remove itself from the service group by updating the /sys/fs/cgroup/systemd filesystem. Specifically in bash:
echo $$ > /sys/fs/cgroup/systemd/tasks
A process belongs to exactly one control group. Writing its PID to the root tasks file adds it to the other control group, and removes it from the service control group.
We were having exactly the same problem. What we ended up doing is launching the update process as transient cgroup with systemd-run:
systemd-run --unit=my_system_upgrade --scope --slice=my_system_upgrade_slice -E setsid nohup start-the-upgrade &> /tmp/some-logs.log &
That way, the update process will run in a different cgroup and will not be terminated. Additionally, we use setsid + nohup to make sure the process has its own group and session and that the parent process is the init process.
The approach we have decided to take is to launch the update process in a separate (single-shot) service. As such, it automatically belongs to a separate control group, so killing the main service doesn't kill it.
There is a wrinkle to this though. The package installs ttm.service and ttm.template.update.service. To run the updater, we copy ttm.template.update.service to ttm.update.service, run systemctl daemon-reload, and then run systemctl start ttm.update.service. Why the copy? Because when the updater installs a new version of ttm.template.update.service, it will forcibly terminate any processes running as that service. KillMode=None appears to offer a way round that, but although it appears to work, a subsequent call to apt-get yields a nasty error about dpkg having been interrupted.
Are you sure it is not systemd sending the TERM signal to the child process?
Depending on the service type, if your main process dies, systemd will do a cleanup and terminate all the child processes under the same cgroup.
This is defined by KillMode= property which is by default set to control-group. You could set it to "none" or "process". https://www.freedesktop.org/software/systemd/man/systemd.kill.html
I have same sitation with you.
Upgrade process is a child process of parent process. The parent process is call by a service.
The main point is not Cgroup, is MAINPID.
If you use PIDFILE to sepecify the MAINPID, when the service type = forking, then the situation solved.
[Service]
Type=forking
PIDFile=/run/test.pid

Simulating a process stuck in a blocking system call

I'm trying to test a behaviour which is hard to reproduce in a controlled environment.
Use case:
Linux system; usually Redhat EL 5 or 6 (we're just starting with RHEL 7 and systemd, so it's currently out of scope).
There're situations where I need to restart a service. The script we use for stopping the service usually works quite well; it sends a SIGTERM to the process, which is designed to handle it; if the process doesn't handle the SIGTERM within a timeout (usually a couple of minutes) the script sends a SIGKILL, then waits a couple minutes more.
The problem is: in some (rare) situations, the process doesn't exit after a SIGKILL; this usually happens when it's badly stuck on a system call, possibly because of a kernel-level issue (corrupt filesystem, or not-working NFS filesystem, or something equally bad requiring manual intervention).
A bug arose when the script didn't realize that the "old" process hadn't actually exited and started a new process while the old was still running; we're fixing this with a stronger locking system (so that at least the new process doesn't start if the old is running), but I find it difficult to test the whole thing because I haven't found a way to simulate an hard-stuck process.
So, the question is:
How can I manually simulate a process that doesn't exit when sending a SIGKILL to it, even as a privileged user?
If your process are stuck doing I/O, You can simulate your situation in this way:
lvcreate -n lvtest -L 2G vgtest
mkfs.ext3 -m0 /dev/vgtest/lvtest
mount /dev/vgtest/lvtest /mnt
dmsetup suspend /dev/vgtest/lvtest && dd if=/dev/zero of=/mnt/file.img bs=1M count=2048 &
In this way the dd process will stuck waiting for IO and will ignore every signal, I know the signals aren't ignore in the latest kernel when processes are waiting for IO on nfs filesystem.
Well... How about just not sending SIGKILL? So your env will behave like it was sent, but the process didn't quit.
Once a proces is in "D" state (or TASK_UNINTERRUPTIBLE) in a kernel code path where the execution can not be interrupted while a task is processed, which means sending any signals to the process would not be useful and would be ignored.
This can be caused due to device driver getting too many interrupts from the hardware, getting too many incoming network packets, data from NIC firmware or blocked on a HDD performing I/O. Normally if this happens very quickly and threads remain in this state for very short span of time.
Therefore what you need to be doing is look at the syslog and sar reports during the time when the process was stuck in D-state. If you find stack traces in the log, try to search kernel.bugzilla.org for similar issues or seek support from the Linux vendor.
I would code the opposite way. Have your server process write its pid in e.g. /var/run/yourserver.pid (this is common practice). Have the starting script read that file and test that the process does not exist e.g. with kill of signal 0, or with
yourserver_pid=$(cat /var/run/yourserver.pid)
if [ -f /proc/$yourserver_pid/exe ]; then
You could improve that by readlink /proc/$yourserver_pid/exe and comparing that to /usr/bin/yourserver
BTW, having a process still alive a few seconds after a SIGKILL is a serious situation (the common case when it could happen is if the process is stuck in a D state, waiting for some NFS server), and you probably should detect and syslog it (e.g. with logger in your script).
I also would try to first send SIGTERM, wait a few seconds, send SIGQUIT, wait a few seconds, and at last send SIGKILL and only a few seconds later test that the server process has gone
A bug arose when the script didn't realize that the "old" process hadn't actually exited and started a new process while the old was still running;
This is the bug in the OS/kernel level, not in your service script. The situation is rare and is hard to simulate because the OS is supposed to kill the process when SIGKILL signal happens. So I guess your goal is to let your script work well under a buggy kernel. Is that correct?
You can attach gdb to the process, SIGKILL won't remove such process from processlist but it will flag it as zombie, which might still be acceptable for your purpose.
void#tahr:~$ ping 8.8.8.8 > /tmp/ping.log &
[1] 3770
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 S 0:00 ping 8.8.8.8
void#tahr:~$ sudo gdb -p 3770
...
(gdb)
Other terminal
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 t 0:00 ping 8.8.8.8
sudo kill -9 3770
...
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 Z 0:00 [ping] <defunct>
First terminal again
(gdb) quit

Terminating TFTPD after file transfer

I am using inetutils tftpd which is started via inetd using the following entry in inetd.conf:
tftp dgram udp wait root /bin/tftpd -p -u root -s /home
(ignore the use of root account and /home directory, it's just for testing purposes, it will be changed later).
inetd version is inetd (GNU inetutils) 1.7
tftpd version is tftp-hpa 5.2, with remap, with tcpwrappers
Everything works fine, but the problem is that I don't have any information about the file transfer status. Having in mind that I have more than 10 scripts that rely on tftpd, I need to either:
terminate tftpd after the file transfer or error (because it keeps running in the background doing nothing)
make it display file transfer status in a way that I can grep sed or at least $?
Is this possible, and if not, what other tftpd server should I use?
From the man page for tftpd:
--timeout timeout, -t timeout
When run from inetd this specifies how long, in seconds, to wait for a second connection before terminating the server. inetd will then respawn the server when another request comes in. The default is 900 (15 minutes.)
Try changing your inetd.conf like so:
tftp dgram udp wait root /bin/tftpd -t 5 -p -u root -s /home
Then restart inetd and test.

How to kill this immortal nginx worker?

I have started nginx and when I stop like root
/etc/init.d/nginx stop
after that I type
ps aux | grep nginx
and get response like tcp LISTEN 2124 nginx WORKER
kill -9 2124 # tried with kill -QUIT 2124, kill -KILL 2124
and after I type again
ps aux | grep nginx
and get response like tcp LISTEN 2125 nginx WORKER
and so on.
How to kill this immortal Chuck Norris worker ?
After kill -9 there's nothing more to do to the process - it's dead (or doomed to die). The reason it sticks around is because either (a) it's parent process hasn't waited for it yet, so the kernel holds the process table entry to keep it's status until the parent does so, or (b) the process is stuck on a system call into the kernel that is not finishing (which usually means a buggy driver and/or hardware).
If the first case, getting the parent to wait for the child, or terminating the parent, should work. Most programs don't have a clear way to make them "wait for a child", so that may not be an option.
In the second case, the most likely solution is to reboot. There may be tools that could clear such a condition, but that's not common. Depending on just what that kernel processing is doing, it may be possible to get it to unblock by other means - but that requires knowledge of that processing. For example, if the process is blocked on a kernel lock that some other process is somehow holding indefinitely, terminating that other process could aleviate the problem.
Note that the ps command can distinguish these two states as well. These show up in the 'Z' state. See the ps man page for more info: http://linux.die.net/man/1/ps. They may also show up with the text "defunct".
I had the same issue.
In my case gitlab was the responsible to bring the nginx workers.
when i completelly removed gitlab from my server i got able to kill the nginx workers.
ps -aux | grep "nginx"
Search for the workers and check at the first column who is bringing them up.
kill or unistall the responsible and kill the workers again, they will stop spawning ;D
I was having similar issue.
Check if you are using any auto-healer like Monit or Supervisor which runs the worker whenever you try to stop them. If Yes Disable them.
My workers were being spawned due to changes I forget i made in update-rc.d in Ubuntu.
So I installed sysv-rc-conf which gives a clean interface control of what processes are on reboot, you can disable from there and I assure you no Chuck Noris Resurrection :D

Currently running applications in Linux

I'm working on a project similar to what we call system monitor in Linux. I'm using opensuse 11.4 gnome. I was wondering if there's any command (Except ps) that lists all currently running applications on the system. I'm developing it for multi-core environment.
For example I'm browsing the web with Firefox and lets say Google Chrome simultaneously, plus I'm editing text in a text file. In this scenario, when I open my project, I want the list of all applications currently running [in my scenario, the names gEdit, Google Chrome and Firefox(but not the process these three apps generated) must be displayed as a list]
The output I want is the same as what we get in the Applications tab when we use task manager in Windows.
If anyone has a solution, please let me know it'll be highly apprecieted. I'm using netbeans to implement the project. Thanks!!!
I don't think there is a easy way of getting this done. In Linux an application may create several processes on startup - for example let's take postfix:
localhost:~ # ps -ef|grep postfix
root 3708 1 0 Apr24 ? 00:00:35 /usr/lib/postfix/master
postfix 3748 3708 0 Apr24 ? 00:00:01 qmgr -l -t fifo -u
postfix 3749 3708 0 Apr24 ? 00:00:00 pickup -l -t fifo -u -c
postfix 13504 3708 0 16:05 ? 00:00:00 cleanup -z -t unix -u -c
postfix 15458 3708 0 17:45 ? 00:00:00 cleanup -z -t unix -u -c
postfix 19907 3708 0 19:25 ? 00:00:00 cleanup -z -t unix -u -c
the processes "master", "qmgr", "pickup" and "cleanup" all belong to the application postfix. You can see that those processes each belong to one parent process "master" by looking at the third column which tells you the parent process who has startet this process. In my example all processes have been startet by process with id 3708. Another example is the Apache Webserver, which creates several httpd processes on startup - here the process names are all the same only the count varies depending on the configuration.
To come back to the problem you would like to solve: From my point of view there are two ways you could try:
Build up a database which contains associations of processes names to applications and use this to create your list of applications by using ps.
You restrict your application to list only applications which display a graphical user interface. This list should be easily build by using some X11 functions or the likes...
hope this helps...
Have you tried pstree yet? well this shows you a tree of the processes that are running on the system.
htop is what I usually use for multicore enviornment cause it shows resource utilization and you can see where your processes are pinned by adding columns. htop is more userfriendly than top and has more options. when you run it just hit 't' and it will sort the processes by their parents.
I don't know any other tools, but your other option is to go through /proc and write your own script to extract the information you need.
I hope it helps.
EDIT: I forgot to mention that processes are being forked in linux, so there is a parent process which starts a couple of other processes/threads. From your question, it seems that you are trying to find the parent process for each running process, my answers are based on that assumption.
Check out top (linux command)
And this article will help you a lot.
http://www.cyberciti.biz/tips/top-linux-monitoring-tools.html
You may want to start from xlsclients.
It probably does not have all the functionality you need, but then Linux has no well-defined notion of application.
The next thing you might find useful is xprop (look for _NET_WM_PID) but this is not guaranteed to work for all programs.

Resources