A service (say bar.service) is dependent on another service (say foo.service), like below
bar's service file:
[Unit]
After=foo.service
Requires=foo.service
...
If foo.service is restarted (either manually or due to a bug), how can bar.service be automatically restarted as well?
You can use PartOf.
[Unit]
After=foo.service
Requires=foo.service
PartOf=foo.service
From the systemd.unit man page:
PartOf=
Configures dependencies similar to Requires=, but limited to stopping and restarting of units. When systemd stops or restarts the units listed here, the action is propagated to this unit. Note that this is a one-way dependency — changes to this unit do not affect the listed units.
A another solution might be to use ExecStartPost option to restart the bar.service (if it´s executed) when foo.service has (re)started:
# foo.service
[Service]
ExecStartPost=/bin/systemctl try-restart bar.service
Restart=on-failure
RestartSec=30s
The additional Restart and RestartSec options ensure that foo.service will automatically restarted on crash and thus also bar.service.
A second extension my be to add same to bar.service and ensure that bar.service start after foo.service:
# bar.service
[Unit]
After=foo.service
[Service]
Restart=on-failure
RestartSec=30s
This should start both services automatically in case of a crash and bar.service will be restarted when foo.service restarts (due to an error or manually triggered).
I think the needed option is BindsTo, it handles the misbehaviours too.
[Unit]
Requires=postgresql.service
After=postgresql.service
BindsTo=postgresql.service
BindsTo=
Configures requirement dependencies, very similar in style to Requires=. However, this dependency type is stronger: in addition to the effect of Requires= it declares that if the unit bound to is stopped, this unit will be stopped too. This means a unit bound to another unit that suddenly enters inactive state will be stopped too. Units can suddenly, unexpectedly enter inactive state for different reasons: the main process of a service unit might terminate on its own choice, the backing device of a device unit might be unplugged or the mount point of a mount unit might be unmounted without involvement of the system and service manager.
When used in conjunction with After= on the same unit the behaviour of BindsTo= is even stronger. In this case, the unit bound to strictly has to be in active state for this unit to also be in active state. This not only means a unit bound to another unit that suddenly enters inactive state, but also one that is bound to another unit that gets skipped due to a failed condition check (such as ConditionPathExists=, ConditionPathIsSymbolicLink=, … — see below) will be stopped, should it be running. Hence, in many cases it is best to combine BindsTo= with After=.
Related
How do I automatically restart a systemd service that is killed due to OOM.
I have added a restart but I am not sure if this would work on OOMs I cannot reproduce the OOM on my local dev box so knowing this works would be helpful.
[Service]
Restart=on-failure
RestartSec=1s
Error:
Main process exited, code=killed, status=9/KILL
Reading the docs https://www.freedesktop.org/software/systemd/man/systemd.service.html looks like the restart happens on unclean exit code and I think status 9 would come under it, but please can someone validate my thinking.
When a process terminates, the nature of its termination is made available to its parent process of record. For services started by systemd, the parent is systemd.
The available alternatives that can be reported are termination because of a signal (and which) or normal termination (and the accompanying exit status). By "normal" I mean the complement of "killed by a signal", not necessarily "clean" or "successful".
The system interfaces for process management do not provide any other options, but systemd also itself provides for applying a timeout to or using a watchdog timer with services it manages, which can lead to service termination on account of one of those expiring (as systemd accounts it).
The systemd documentation of the behavior of the various Restart settings provides pretty good detail on which termination circumstances lead to restart with which Restart settings. Termination because of a SIGKILL is what the message presented in the question shows, and this would fall into the "unclean signal" category, as systemd defines that. Thus, following the docs, configuring a service's Restart property to be any of always, on-failure, on-abnormal, or on-abort would result in systemd automatically restarting that service if it terminates because of a SIGKILL.
Most of those options will also produce automatic restarts under other circumstances as well, but on-abort will yield automatic restarts only in the event of termination because of unclean signal. (Note that although systemd considersSIGKILL unclean, it considers SIGTERM clean.)
I've made a program that communicates with hardware over CAN bus. When I start my program via CLI, everything seems to run fine, but starting the process via a Systemd service leads to paused traffic
I'm making a system that communicates with hardware over CAN bus. When I start my program via CLI, everything seems to run fine, I'll quantify this in a second.
Then I created systemd services, like below, to autostart the process on system power up.
By plotting log timestamps, we noticed that there are periodic pauses in the CAN traffic, anywhere between 250ms to a few seconds, every 5 or so minutes (not a regular rate), within a 30 minute window. If we switch back to starting up via CLI, we might get one 100ms drop over a 3 hour period, essentially no issue.
Technically, we can tolerate pauses like this in the traffic, but the issue is that we don't understand the cause of these dropped messages (run via systemd vs starting up manually via command line).
Does anyone have an inkling what's going on here?
Other notes:
- We don't use any environment variables or parameters (read in via config file).
- We've just watched CAN traffic with nothing running, no drops, so we're pretty confident it's not our hardware/socketCAN driver
- We've tried starting via services on an Arch laptop and didn't see this pausing behavior.
[Unit]
Description=Simple service to start CAN C2 process
[Service]
Type=simple
User=dzyne
WorkingDirectory=/home/thisguy/canProg/build/bin
ExecStart=/home/thisguy/canProg/build/bin/piccolo
Restart=on-failure
# or always, on-abort, etc
RestartSec=5
[Install]
WantedBy=multi-user.target
I'd expect that no pauses between messages larger then about ~20-100ms, our tolerance, when run via system service
I have a simple StatelessService and I want to knew when it is being closed down so I can perform some quick clean up. But it never seems to call OnCloseAsync.
When the service is running and I use the 'Restart' command on the running node via the Service Fabric Explorer, it removes the services and restarts the node. But it never calls the OnCloseAsync override, even though it is knowingly being closed down.
Nor does it signal the cancellationToken that is passed into the RunAsync method. So there is no indication that the service is being shutdown. Are there any circumstances when it does call OnCloseAsync, because I cannot see much point in it at the moment.
I wonder the reasoning behind issueing the restart command, what behavior do you expect?
It does explain however the behavior you see. From the docs (Keep in mind that a restart is just a combined stop and start)
Stopping a node puts it into a stopped state where it is not a member of the cluster and cannot host services, thus simulating a down node. This is useful for injecting faults into the system to test your application.
Now, if we take a look at the lifecycle we read this:
After CloseAsync() finishes on each listener and RunAsync() also finishes, the service's StatelessService.OnCloseAsync() method is called, if present. OnCloseAsync is called when the stateless service instance is going to be gracefully shut down.
So, the basic problem is that you service is not gracefully shutdown. The restart command kills the process and no cancellation will be issued.
Context
I've added configuration validation to some of the modules that compose my Node.js application. When they are starting, each one checks if it is properly configured and have access to the resources it needs (e.g. can write to a directory). If it detects that something is wrong it sends a SIGINT to itself (process.pid) so the application is gracefully shutdown (I close the http server, close possible connections to Redis and so on). I want the operator to realize there is a configuration and/or environment problem and fix it before starting the application.
I use pm2 to start/stop/reload the application and I like the fact pm2 will automatically restart it in case it crashes later on, but I don't want it to restart my application in the above scenario because the root cause won't be eliminated by simply restarting the app, so pm2 will keep restarting it up to max_restarts (defaults to 10 in pm2).
Question
How can I prevent pm2 from keeping restarting my application when it is aborted during startup?
I know pm2 has the --wait-ready option, but given we are talking about multiple modules with asynchronous startup logic, I find very hard to determine where/when to process.send('ready').
Possible solution
I'm considering making all my modules to emit an internal "ready" event and wire the whole thing chaining the "ready" events to finally be able to send the "ready" to pm2, but I would like to ask first if that would be a little bit of over engineering.
Thanks,
Roger
We have a custom setup which has several daemons (web apps + background tasks) running. I am looking at using a service which helps us to monitor those daemons and restart them if their resource consumption exceeds over a level.
I will appreciate any insight on when one is better over the other. As I understand monit spins up a new process while supervisord starts a sub process. What is the pros and cons of this approach ?
I will also be using upstart to monitor monit or supervisord itself. The webapp deployment will be done using capistrano.
Thanks
I haven't used monit but there are some significant flaws with supervisord.
Programs should run in the foreground
This means you can't just execute /etc/init.d/apache2 start. Most times you can just write a one liner e.g. "source /etc/apache2/envvars && exec /usr/sbin/apache2 -DFOREGROUND" but sometimes you need your own wrapper script. The problem with wrapper scripts is that you end up with two processes, a parent and child. See the the next flaw...
supervisord does not manage child processes
If your program starts child process, supervisord wont detect this. If the parent process dies (or if it's restarted using supervisorctl) the child processes keep running but will be "adopted" by the init process and stay running. This might prevent future invocations of your program running or consume additional resources. The recent config options stopasgroup and killasgroup are supposed to fix this, but didn't work for me.
supervisord has no dependency management - see #122
I recently setup squid with qlproxy. qlproxyd needs to start first otherwise squid can fail. Even though both programs were managed with supervisord there was no way to ensure this. I needed to write a start script for squid that made it wait for the qlproxyd process. Adding the start script resulted in the orphaned process problem described in flaw 2
supervisord doesn't allow you to control the delay between startretries
Sometimes when a process fails to start (or crashes), it's because it can't get access to another resource, possibly due to a network wobble. Supervisor can be set to restart the process a number of times. Between restarts the process will enter a "BACKOFF" state but there's no documentation or control over the duration of the backoff.
In its defence supervisor does meet our needs 80% of the time. The configuration is sensible and documentation pretty good.
If you want to additionally monitor resources you should settle for monit. In addition to just checking whether a process is running (availability), monit can also perform some checks of resource usage (performance, capacity usage), load levels and even basic security checks (md5sum of a bianry file, config file, etc). It has a rule-based config which is quite easy to comprehend. Also there is a lot of ready to use configs: http://mmonit.com/wiki/Monit/ConfigurationExamples
Monit requires processes to create PID files, which can be a flaw, because if a process does not create pid file you have to create some wrappers around. See http://mmonit.com/wiki/Monit/FAQ#pidfile
Supervisord on the other hand is more bound to a process, it spawns it by itself. It cannot make any resource based checks as monit. It has a nice CLI servicectl and a web GUI though.