Systemd get reason for watchdog timeout - linux

I have to debug an application that always gets killed via SIGABRT signal due to some mysterious watchdog timeout in systemd after exactly 3 minutes. Is there any logging etc. that helps me find out which of the many systemd parameters triggers the abort?

The application needs to notify watchdog messages to systemd. There are several ways of doing this.
The watchdog internal is set in the systemd service file, and the line looks like
WatchdogSec=4s
3 minutes seems like a long time, so it looks like the app is not feeding the watchdog.
See https://www.freedesktop.org/software/systemd/man/sd_notify.html for documentation on how to feed the watchdog.

Related

How to automatically restart systemd service that is killed due to OOM

How do I automatically restart a systemd service that is killed due to OOM.
I have added a restart but I am not sure if this would work on OOMs I cannot reproduce the OOM on my local dev box so knowing this works would be helpful.
[Service]
Restart=on-failure
RestartSec=1s
Error:
Main process exited, code=killed, status=9/KILL
Reading the docs https://www.freedesktop.org/software/systemd/man/systemd.service.html looks like the restart happens on unclean exit code and I think status 9 would come under it, but please can someone validate my thinking.
When a process terminates, the nature of its termination is made available to its parent process of record. For services started by systemd, the parent is systemd.
The available alternatives that can be reported are termination because of a signal (and which) or normal termination (and the accompanying exit status). By "normal" I mean the complement of "killed by a signal", not necessarily "clean" or "successful".
The system interfaces for process management do not provide any other options, but systemd also itself provides for applying a timeout to or using a watchdog timer with services it manages, which can lead to service termination on account of one of those expiring (as systemd accounts it).
The systemd documentation of the behavior of the various Restart settings provides pretty good detail on which termination circumstances lead to restart with which Restart settings. Termination because of a SIGKILL is what the message presented in the question shows, and this would fall into the "unclean signal" category, as systemd defines that. Thus, following the docs, configuring a service's Restart property to be any of always, on-failure, on-abnormal, or on-abort would result in systemd automatically restarting that service if it terminates because of a SIGKILL.
Most of those options will also produce automatic restarts under other circumstances as well, but on-abort will yield automatic restarts only in the event of termination because of unclean signal. (Note that although systemd considersSIGKILL unclean, it considers SIGTERM clean.)

Systemd service interrupting CAN bus process

I've made a program that communicates with hardware over CAN bus. When I start my program via CLI, everything seems to run fine, but starting the process via a Systemd service leads to paused traffic
I'm making a system that communicates with hardware over CAN bus. When I start my program via CLI, everything seems to run fine, I'll quantify this in a second.
Then I created systemd services, like below, to autostart the process on system power up.
By plotting log timestamps, we noticed that there are periodic pauses in the CAN traffic, anywhere between 250ms to a few seconds, every 5 or so minutes (not a regular rate), within a 30 minute window. If we switch back to starting up via CLI, we might get one 100ms drop over a 3 hour period, essentially no issue.
Technically, we can tolerate pauses like this in the traffic, but the issue is that we don't understand the cause of these dropped messages (run via systemd vs starting up manually via command line).
Does anyone have an inkling what's going on here?
Other notes:
- We don't use any environment variables or parameters (read in via config file).
- We've just watched CAN traffic with nothing running, no drops, so we're pretty confident it's not our hardware/socketCAN driver
- We've tried starting via services on an Arch laptop and didn't see this pausing behavior.
[Unit]
Description=Simple service to start CAN C2 process
[Service]
Type=simple
User=dzyne
WorkingDirectory=/home/thisguy/canProg/build/bin
ExecStart=/home/thisguy/canProg/build/bin/piccolo
Restart=on-failure
# or always, on-abort, etc
RestartSec=5
[Install]
WantedBy=multi-user.target
I'd expect that no pauses between messages larger then about ~20-100ms, our tolerance, when run via system service

What are the ways application may die?

I develop linux daemon working with some complex hardware, and i need to know ways how application may exit (normal or abnormal) to create proper cleanup functions. As i read from docs application may die via:
1. Receive signal - sigwait,sigaction, etc.
2. exit
3. kill
4. tkill
Is there is some other ways how application may exit or die?
In your comments you wrote that you're concerned about "abnormal ways" the application may die.
There's only one solution for that1 -- code outside the application. In particular, all handles held by the application at termination (normal or abnormal) are cleanly closed by the kernel.
If you have a driver for your special hardware, do cleanup when the driver receives notification that the device fd has been closed. If you don't already have a custom driver, you can use a second user-mode process as a watchdog. Just connect the watchdog to the main process via a pipe... it will receive a signal when the main application closes.
In addition to things the programmer has some degree of control over, such as wild pointer bugs causing segmentation fault, there's always the oom-killer, which can take out even a bug-free process. For this reason the application should also detect unexpected loss of its watchdog and spawn a new one.
Your app should finish by itself when the system or the user doesnt need it.
Using an external commando like a kill -9 PROCESS could give you some bugs on your application because you don't know what is your application doing in that moment.
Try to imeplement over your app a subsystem to control your application status... like a real daemon to allow something like this:
yourapp service status or /etc/init.d/yourapp status
yourapp service start or /etc/init.d/yourapp start
yourapp service stop or /etc/init.d/yourapp stop
In that way your app could finish normally everytime and the users could control it easily.
Regards

Maintaining a long-running task on Linux

My system includes a task which opens a network socket, receives pushed data from the network, processes it, and writes it out to disk or pings other machines depending on the messages. This task is intended to run forever, and the service is designed to have this task always running. But sometimes it crashes.
What's the best practice for keeping a task like this alive? Assume it's okay for the task to be dead for up to 30 seconds before we restart it.
Some obvious ideas include having a watchdog process that checks to make sure the process is still running. Watchdog could be triggered by cron. But how does it know if the process is alive or not? Write a pidfile? touch a heartbeat file? An ideal solution wouldn't continuously spin up more processes if the machine gets bogged down to the point where the watchdog is running faster than the heartbeat.
Are there standard linux tools for this? I can imagine a solution that uses a message queue, but I'm not sure if that's a good idea or not.
Depending on the nature of the task that you wish to monitor, one method is to write a simple wrapper to start up your task in a fork().
The wrapper task can then do a waitpid() on the child and restart it if it is terminated.
This does depend on modifying the source for the task that you wish to run.
sysvinit will restart processes that die, if added to inittab.
If you're worried about the process freezing without crashing and ending the process, you can use a heartbeat and hard kill the active instance, letting init restart it.
You could use monit along with daemonize. There are lots of tools for this in the *nix world.
Supervisor was designed precisely for this task. From the project website:
Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems.
It runs as a daemon (supervisord) controlled by a command line tool, supervisorctl. The configuration file contains a list of programs it is supposed to monitor, among other settings.
The number of options is quite extensive, -- have a look at the docs for a complete list. In your case, the relevant configuration section might be something like this:
[program:my-network-task]
command=/bin/my-network-task # where your binary lives
autostart=true # start when supervisor starts?
autorestart=true # restart automatically when stopped?
startsecs=10 # consider start successful after how many secs?
startretries=3 # try starting how many times?
I have used Supervisor myself and it worked really well once everything was set up. It requires Python, which should not be a big deal in most environments but might be.

Debugging utilities for Linux process hang issues?

I have a daemon process which does the configuration management. all the other processes should interact with this daemon for their functioning. But when I execute a large action, after few hours the daemon process is unresponsive for 2 to 3 hours. And After 2- 3 hours it is working normally.
Debugging utilities for Linux process hang issues?
How to get at what point the linux process hangs?
strace can show the last system calls and their result
lsof can show open files
the system log can be very effective when log messages are written to track progress. Allows to box the problem in smaller areas. Also correlate log messages to other messages from other systems, this often turns up interesting results
wireshark if the apps use sockets to make the wire chatter visible.
ps ax + top can show if your app is in a busy loop, i.e. running all the time, sleeping or blocked in IO, consuming CPU, using memory.
Each of these may give a little bit of information which together build up a picture of the issue.
When using gdb, it might be useful to trigger a core dump when the app is blocked. Then you have a static snapshot which you can analyze using post mortem debugging at your leisure. You can have these triggered by a script. The you quickly build up a set of snapshots which can be used to test your theories.
One option is to use gdb and use the attach command in order to attach to a running process. You will need to load a file containing the symbols of the executable in question (using the file command)
There are a number of different ways to do:
Listening on a UNIX domain socket, to handle status requests. An external application can then inquire as to whether the application is still ok. If it gets no response within some timeout period, then it can be assumed that the application being queried has deadlocked or is dead.
Periodically touching a file with a preselected path. An external application can look a the timestamp for the file, and if it is stale, then it can assume that the appliation is dead or deadlocked.
You can use the alarm syscall repeatedly, having the signal terminate the process (use sigaction accordingly). As long as you keep calling alarm (i.e. as long as your program is running) it will keep running. Once you don't, the signal will fire.
You can seamlessly restart your process as it dies with fork and waitpid as described in this answer. It does not cost any significant resources, since the OS will share the memory pages.

Resources