Watchdog timer > 60s in linux embedded - linux

I am using linux watchdog driver /dev/watchdog on a linux embedded system with busy box as user space tools. I want to trigger the watchdog from C/C++ code, which works fine for timeouts up to 60s:
watchdogFD = open( "/dev/watchdog", O_WRONLY );
int timeout = 60;
ioctl( watchdogFD, WDIOC_SETTIMEOUT, &timeout )
However for larger intervals the timeout is accepted, but the watchdog is triggered already after 60s.
The linux watchdog deamon offers a --force parameter to set timeouts larger than 60s (see https://linux.die.net/man/8/watchdog). However the busy box watchdog deamon does not offer this (see https://git.busybox.net/busybox/tree/miscutils/watchdog.c?id=1572f520ccfe9e45d4cb9b18bb7b728eb2bbb571).
Does anyone have a suggestion how to use the same --force option when controlling watchdog using ioctl? Thanks :)

It seems the busybox watchdog daemon you link to is very simple compared to the usual Linux one from here:
https://sourceforge.net/p/watchdog/code/ci/master/tree/
The --force option for the Linux daemon (above) is to override the sanity checks on the polling interval versus the hardware time-out used. It will not change any limits a specific hardware driver/timer has to offer.
Typically the choice of hardware time-out is in the 10-60 second range, depending on how long you can tolerate a major fault (like a kernel panic) persisting for. Then the watchdog daemon that feeds the timer has to poll at an interval that is at least a few seconds shorter so nothing timers out unexpectedly. Between polls it uses nanosleep() so gives up its CPU time, and so the system load for the daemon is proportional to the polling rate and the type of tests that are run.
Without any tests all you protect against is a major fault killing either the daemon or kernel, so usually you should be checking for something else that is essential for normal operations (e.g. a specific process being alive, files being updated, test script can be run, etc) to get the most benefit.

Related

Linux software watchdog configuration

I need to configure linux software watchdog (enabled in kernel configuration - CONFIG_SOFT_WATCHDOG=y, which gives me a new device /dev/watchdog1) such that if enabled and if a watchdog timeout occurs, it can launch a script/binary, instead of rebooting the system. My platform uses systemd and not init and I do not see a watchdog.conf file in /etc
Could not find a solution in how to use linux software watchdog. However, one comment says that " it is very possible to restart single or multiple processes after the watchdog signals that the systems is hanging - you can even ABORT the reboot or do a SOFT-reboot, one is able to configure "test" and "repair"-scripts / binaries which do whatever you want them to do."
How/Where can I configure /dev/watchdog1 so that it launches a script/binary instead of rebooting the system?
Eventually resorting to looking at the kernel source for watchdog drivers helped clear things for me. There is no way to configure /dev/watchdog1 or a kernel watchdog driver (hardware or software(softdog)), to be precise, to launch a script/binary instead of causing a system reboot. For this purpose, if feasible, you will have to write your own watchdog driver. The "launching script/binary" configuration that I was led to chase is associated with application space "watchdog daemon" (and has nothing to do with kernel's watchdog driver's configuration/behavior) which can launch a custom script to test your system health and try to fix things before a system reboot is necessary.

Writing a compatible watchdog kernel module

I am developing a custom watchdog driver for the Beaglebone black SBC. There is an external entity connected to the BBB. It will reset the board if it wont receive a GPIO state change from the BBB within a certain time, that is settable through I2C. From what I have understood so far is that from the Linux software point of view, the /dev/watchdog device should be written to in order to refresh the watchdog peripheral, that's clear. Such thing could be done by the watchdog daemon: https://www.systutorials.com/docs/linux/man/8-watchdog/
The problem here is that it seems that the refresh interval is hard-coded to 60 seconds. For my application the interval is a lot shorter (about 5 seconds typically) and is settable (from 1 to 10 seconds). In this case I think I would not be able to use the watchdog daemon for the custom wdg driver.
Is there a way around this? Or is my take on this case not even correct?
Typically, if you want to use kernel watchdog framework, you can simply write some C code which is petting /dev/watchdog file with your own "watchdog frequency". There is no reason to use watchdog daemon if you have your own reasons.
And, the kernel watchdog framework is hooked into real hardware watchdog which is capable to detect lockup, and generate event based on expiration and if your hardware watchdog "timeout" or "expiration" interval can be tunnable, you can change the time and you can make not to fire for 60 seconds.
Normally, nobody is dealing with watchdog process which is provided by busybox or some other linux pkg. Most likely they are using it as it is. Also, as far as I remember, it is 1 seconds interval.

System lock or infinite loop is able to cause reboot?

My question is related to knowledge on embedded Linux.
I just observed a strange reboot on my embedded project, which is very easy to reproduce.
When some condition is triggered, the system will like "freezing". I mean, its like encounter some infinite loop or be locked. Last for several seconds, system will quietly reboot. Not even core dump!!
I have no much clue about the cause. Generally will a lock or infinite loop can truly trigger Linux reboot? Or are there any things can freeze system and cause reboot with no core dump happens?
It is common on embedded systems to have a hardware watchdog; a timer implemented in hardware that resets the processor if it is allowed to expire.
Typically some software monitoring task continuously verifies the integrity of the system and restarts the hardware watchdog timer. If the monitoring task fails to run and the watchdog timer expires, the watchdog triggers a processor reset directly.
Your question is a bit hard to understand but yes, a "infinite loop" (the proper term is) in any application on any platform (including Linux) can crash a system. This happens obviously because an infinite loop can constantly take up memory and resources until there is none left. You mentioned you are doing embedded development (which can mean many different things) but usually means you are developing low-level applications built into Linux itself; these are more prone to crashing an OS than your average programming venture.

QNX system hangs while shutting down using phshutdown

While shutting down QNX neutrino using phshutdown(either reboot or shutdown),system hangs while killing message queues(mqueue).the message displayed on screen is
Shutting down service providers(mqueue)
What could be the reason for this ?
This happens from time to time when you issue shutdown from the command line as well.
Some of the reasons I've seen on the web are:
Hardware issue
Driver issue
Kernel told to shut down when it didn't want to
From what I've cobbled together (and this is by no means definitive, but seems to be plausible), basically, any program that is waiting for the hardware or OS to reply has a chance of hanging the shutdown if the thing it is waiting on gets killed before it does.
A possible mitigation is to slay all your apps/servers (especially those touching hardware devices or shared memory queues) prior to issuing a shutdown, wait for a second or two, then go ahead with your shutdown.

Difference between OS scheduling and RTOS scheduling

Consider the function/process,
void task_fun(void)
{
while(1)
}
If this process were to run on a normal PC OS, it would happily run forever. But on a mobile phone, it would surely crash the entire phone in a matter of minutes as the HW watchdog expires and resets the system.
On a PC, this process, after it expires its stipulated time slice would be scheduled out and a new runnable process would be scheduled to run.
My doubt is why cant we apply the same strategy on an RTOS? What is the performance limitation involved if such a scheduling policy is implemeted on an RTOS?
One more doubt is that I checked the schedule() function of both my PC OS ( Ubuntu ) and my phone which also runs Linux Kernel. I found both of them to be almost the same. Where is the watchdog handing done on my phone? My assumption is that scheduler is the one who starts the watchdog before letting a process run. Can someone point me where in code its being done?
The phone "crashing" is an issue with the phone design or the specific OS, not embedded OSes or RTOSes in general. It would 'starve' lower priority tasks (possibly including the watchdog service), which is probably what is happening here.
In most embedded RTOSes it is intended that all processes are defined at deployment by the system designer and the design is for all processes to be scheduled as required. Placing user defined or third party code on such a system can compromise its scheduling scheme as in your example. I would suggest that all such processes should run at the same low priority as all others so that the round-robin scheduler will service user application equally without compromising system services.
Phone operating systems are usually RTOS, but user processes should not run at higher priority that system processes. It may be intentional that such processes run higher than the watchdog service exactly to protect the system from "misbehaving" applications which yours simulates.
Most RTOSes use a pre-emptive priority based scheduler (highest priority ready task runs until it terminates, yields, or is pre-empted by a higher priority task or interrupt). Some also schedule round-robin for tasks at the same priority level (task runs until it terminates, yields or consumes its time-slice and other tasks of the same priority are ready to run).
There are several ways a watchdog can be implemented, none of which is imposed by Linux:
A process or thread runs periodically to test that vital operations are being performed. If they are not, correction action is taken, like reboot the machine, or reset a troublesome component.
A process or thread runs continuously to soak up extra CPU time and reset a timer. If the task is not able to run, a timer expires and takes corrective action.
A hardware component resets the system if it is not periodically massaged; that is, a hardware timer expires.
There is nothing here that can't be done on either an RTOS or any other multitasking operating system.
Linux, on a desktop computer or on a mobile phone, is not a RTOS. Its scheduling policy is time-driven.
On a RTOS, scheduling is triggered by events, either from environment through ISR or from software itself through system calls (send message, wait for mutex, ...)
In a normal OS, we have two types of processes. User process & kernel Process. Kernel processes have time constraints.However, user processes do not have time constraints.
In a RTOS,all process are Kernel process & hence time constraints should be strictly followed. All process/task (can be used interchangeably) are based on priority and time constraints are important for the system to run correctly.
So, if your code void task_fun(void) { while(1) } runs forever, other higher priority tasks will be starving. Hence, watch dog will crash the system to specify the developer that time constraints of other tasks are not met.
For example, GSM Scheduler needs to run every 4.6ms, if your task runs for more time, time constraints of GSM Scheduler task cannot be satisfied. So the system has to reboot as its purpose is defeated.
Hope this helps :)

Resources