How to automatically restart systemd service that is killed due to OOM - linux

How do I automatically restart a systemd service that is killed due to OOM.
I have added a restart but I am not sure if this would work on OOMs I cannot reproduce the OOM on my local dev box so knowing this works would be helpful.
[Service]
Restart=on-failure
RestartSec=1s
Error:
Main process exited, code=killed, status=9/KILL
Reading the docs https://www.freedesktop.org/software/systemd/man/systemd.service.html looks like the restart happens on unclean exit code and I think status 9 would come under it, but please can someone validate my thinking.

When a process terminates, the nature of its termination is made available to its parent process of record. For services started by systemd, the parent is systemd.
The available alternatives that can be reported are termination because of a signal (and which) or normal termination (and the accompanying exit status). By "normal" I mean the complement of "killed by a signal", not necessarily "clean" or "successful".
The system interfaces for process management do not provide any other options, but systemd also itself provides for applying a timeout to or using a watchdog timer with services it manages, which can lead to service termination on account of one of those expiring (as systemd accounts it).
The systemd documentation of the behavior of the various Restart settings provides pretty good detail on which termination circumstances lead to restart with which Restart settings. Termination because of a SIGKILL is what the message presented in the question shows, and this would fall into the "unclean signal" category, as systemd defines that. Thus, following the docs, configuring a service's Restart property to be any of always, on-failure, on-abnormal, or on-abort would result in systemd automatically restarting that service if it terminates because of a SIGKILL.
Most of those options will also produce automatic restarts under other circumstances as well, but on-abort will yield automatic restarts only in the event of termination because of unclean signal. (Note that although systemd considersSIGKILL unclean, it considers SIGTERM clean.)

Related

Systemd get reason for watchdog timeout

I have to debug an application that always gets killed via SIGABRT signal due to some mysterious watchdog timeout in systemd after exactly 3 minutes. Is there any logging etc. that helps me find out which of the many systemd parameters triggers the abort?
The application needs to notify watchdog messages to systemd. There are several ways of doing this.
The watchdog internal is set in the systemd service file, and the line looks like
WatchdogSec=4s
3 minutes seems like a long time, so it looks like the app is not feeding the watchdog.
See https://www.freedesktop.org/software/systemd/man/sd_notify.html for documentation on how to feed the watchdog.

Start services killed using killall

Recently I was working on an update and I had to kill a few java process before that.
killall -9 java
So I used the above command which killed all the java process. But now I'm stuck without knowing how to restart those java services.
Is there a command to start all java services killed using killall?
using kill
First of all: kill -9 should be the last method to use to stop a process.
A process stopped with SIGKILL has no chance to shutdown properly. Some services or daemons have complex and important shutdown procedures like databases who takes care to close open database files in a consistent state and write cached data to it.
Before stopping processes with kill or something like that, you should try the stop procedure which comes from the init system of your unix/linux operating system.
When you have to use kill, try to send a TERM signal to a process first (just use kill without -9) and wait a moment to see if the process shuts down. Use -9 if there is no other option!
Starting and stopping services
Starting and stopping services should be handled by the init service which comes with your unix/linux operating system.
SysV init or systemd is common. Check the Manual of your operating system to see which system is used. If set up properply you can check which services are missing (stopped, which should be running) and start them again.
here are some manual examples
FreeBSD:
https://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/configtuning-rcd.html
Debian:
https://www.debian.org/doc/manuals/debian-handbook/unix-services.de.html#sect.system-boot
Fedora: https://docs.fedoraproject.org/f28/system-administrators-guide/infrastructure-services/Services_and_Daemons.html
As far as I know, no. There is no record (by default) of what you have killed, as you can see in strace killall java.
More about process management, including why SIGKILL is a bad idea almost all of the time.

How to restart a service if its dependent service is restarted

A service (say bar.service) is dependent on another service (say foo.service), like below
bar's service file:
[Unit]
After=foo.service
Requires=foo.service
...
If foo.service is restarted (either manually or due to a bug), how can bar.service be automatically restarted as well?
You can use PartOf.
[Unit]
After=foo.service
Requires=foo.service
PartOf=foo.service
From the systemd.unit man page:
PartOf=
Configures dependencies similar to Requires=, but limited to stopping and restarting of units. When systemd stops or restarts the units listed here, the action is propagated to this unit. Note that this is a one-way dependency — changes to this unit do not affect the listed units.
A another solution might be to use ExecStartPost option to restart the bar.service (if it´s executed) when foo.service has (re)started:
# foo.service
[Service]
ExecStartPost=/bin/systemctl try-restart bar.service
Restart=on-failure
RestartSec=30s
The additional Restart and RestartSec options ensure that foo.service will automatically restarted on crash and thus also bar.service.
A second extension my be to add same to bar.service and ensure that bar.service start after foo.service:
# bar.service
[Unit]
After=foo.service
[Service]
Restart=on-failure
RestartSec=30s
This should start both services automatically in case of a crash and bar.service will be restarted when foo.service restarts (due to an error or manually triggered).
I think the needed option is BindsTo, it handles the misbehaviours too.
[Unit]
Requires=postgresql.service
After=postgresql.service
BindsTo=postgresql.service
BindsTo=
Configures requirement dependencies, very similar in style to Requires=. However, this dependency type is stronger: in addition to the effect of Requires= it declares that if the unit bound to is stopped, this unit will be stopped too. This means a unit bound to another unit that suddenly enters inactive state will be stopped too. Units can suddenly, unexpectedly enter inactive state for different reasons: the main process of a service unit might terminate on its own choice, the backing device of a device unit might be unplugged or the mount point of a mount unit might be unmounted without involvement of the system and service manager.
When used in conjunction with After= on the same unit the behaviour of BindsTo= is even stronger. In this case, the unit bound to strictly has to be in active state for this unit to also be in active state. This not only means a unit bound to another unit that suddenly enters inactive state, but also one that is bound to another unit that gets skipped due to a failed condition check (such as ConditionPathExists=, ConditionPathIsSymbolicLink=, … — see below) will be stopped, should it be running. Hence, in many cases it is best to combine BindsTo= with After=.

What are the ways application may die?

I develop linux daemon working with some complex hardware, and i need to know ways how application may exit (normal or abnormal) to create proper cleanup functions. As i read from docs application may die via:
1. Receive signal - sigwait,sigaction, etc.
2. exit
3. kill
4. tkill
Is there is some other ways how application may exit or die?
In your comments you wrote that you're concerned about "abnormal ways" the application may die.
There's only one solution for that1 -- code outside the application. In particular, all handles held by the application at termination (normal or abnormal) are cleanly closed by the kernel.
If you have a driver for your special hardware, do cleanup when the driver receives notification that the device fd has been closed. If you don't already have a custom driver, you can use a second user-mode process as a watchdog. Just connect the watchdog to the main process via a pipe... it will receive a signal when the main application closes.
In addition to things the programmer has some degree of control over, such as wild pointer bugs causing segmentation fault, there's always the oom-killer, which can take out even a bug-free process. For this reason the application should also detect unexpected loss of its watchdog and spawn a new one.
Your app should finish by itself when the system or the user doesnt need it.
Using an external commando like a kill -9 PROCESS could give you some bugs on your application because you don't know what is your application doing in that moment.
Try to imeplement over your app a subsystem to control your application status... like a real daemon to allow something like this:
yourapp service status or /etc/init.d/yourapp status
yourapp service start or /etc/init.d/yourapp start
yourapp service stop or /etc/init.d/yourapp stop
In that way your app could finish normally everytime and the users could control it easily.
Regards

what is the advantage of using supervisord over monit

We have a custom setup which has several daemons (web apps + background tasks) running. I am looking at using a service which helps us to monitor those daemons and restart them if their resource consumption exceeds over a level.
I will appreciate any insight on when one is better over the other. As I understand monit spins up a new process while supervisord starts a sub process. What is the pros and cons of this approach ?
I will also be using upstart to monitor monit or supervisord itself. The webapp deployment will be done using capistrano.
Thanks
I haven't used monit but there are some significant flaws with supervisord.
Programs should run in the foreground
This means you can't just execute /etc/init.d/apache2 start. Most times you can just write a one liner e.g. "source /etc/apache2/envvars && exec /usr/sbin/apache2 -DFOREGROUND" but sometimes you need your own wrapper script. The problem with wrapper scripts is that you end up with two processes, a parent and child. See the the next flaw...
supervisord does not manage child processes
If your program starts child process, supervisord wont detect this. If the parent process dies (or if it's restarted using supervisorctl) the child processes keep running but will be "adopted" by the init process and stay running. This might prevent future invocations of your program running or consume additional resources. The recent config options stopasgroup and killasgroup are supposed to fix this, but didn't work for me.
supervisord has no dependency management - see #122
I recently setup squid with qlproxy. qlproxyd needs to start first otherwise squid can fail. Even though both programs were managed with supervisord there was no way to ensure this. I needed to write a start script for squid that made it wait for the qlproxyd process. Adding the start script resulted in the orphaned process problem described in flaw 2
supervisord doesn't allow you to control the delay between startretries
Sometimes when a process fails to start (or crashes), it's because it can't get access to another resource, possibly due to a network wobble. Supervisor can be set to restart the process a number of times. Between restarts the process will enter a "BACKOFF" state but there's no documentation or control over the duration of the backoff.
In its defence supervisor does meet our needs 80% of the time. The configuration is sensible and documentation pretty good.
If you want to additionally monitor resources you should settle for monit. In addition to just checking whether a process is running (availability), monit can also perform some checks of resource usage (performance, capacity usage), load levels and even basic security checks (md5sum of a bianry file, config file, etc). It has a rule-based config which is quite easy to comprehend. Also there is a lot of ready to use configs: http://mmonit.com/wiki/Monit/ConfigurationExamples
Monit requires processes to create PID files, which can be a flaw, because if a process does not create pid file you have to create some wrappers around. See http://mmonit.com/wiki/Monit/FAQ#pidfile
Supervisord on the other hand is more bound to a process, it spawns it by itself. It cannot make any resource based checks as monit. It has a nice CLI servicectl and a web GUI though.

Resources