pm-utils hook-script failed on pkill signal (Beagle-Bone) - linux

I want my application be notified (signaled) if my BBB Board is
going to suspend or resume.
So I added a hook-script in
#! /bin/sh
#/etc/pm/sleep.d/15_myapp
case "$1" in
suspend)
pkill -SIGUSR1 myapp>/dev/null 2>&1
;;
resume)
pkill -SIGALRM myapp >/dev/null 2>&1
;;
*)
;;
esac
exit $?
So far so good, every time I try to suspend the board
ajava#debainBBB:~# pm-suspend
I get immediately the Message on the same Consule with a description of
my pkill signal:
User defined signal 1
the suspend-process is then broken.
So I checked the /var/log/pm-suspend.log
Running hook /usr/lib/pm-utils/sleep.d/000kernel-change suspend suspend:
/usr/lib/pm-utils/sleep.d/000kernel-change suspend suspend: success.
Running hook /usr/lib/pm-utils/sleep.d/00logging suspend suspend:
Linux jetMaster 3.12.19-rt30+ #29 PREEMPT RT Wed Jun 25 15:02:55 CEST 2014 armv7l GNU/Linux
Module Size Used by
rfcomm 35643 0
bluetooth 238755 3 rfcomm
usb_f_acm 7016 2
u_serial 11485 1 usb_f_acm
usb_f_mass_storage 45500 2
libcomposite 42382 12 usb_f_acm,usb_f_mass_storage
musb_dsps 7540 0
at25 4594 0
lm75 4802 0
rtc_ds1307 8243 0
musb_am335x 1680 0
total used free shared buffers cached
Mem: 506180 66696 439484 0 7104 32204
-/+ buffers/cache: 27388 478792
Swap: 0 0 0
/usr/lib/pm-utils/sleep.d/00logging suspend suspend: success.
Running hook /usr/lib/pm-utils/sleep.d/00powersave suspend suspend:
/usr/lib/pm-utils/sleep.d/00powersave suspend suspend: success.
Running hook /etc/pm/sleep.d/15_myapp suspend suspend:
As you can see there is no success after calling my script.
But I already have checked the exit-status of my hook-script its 0 (i .e success)
Have any of you folks any idea what's going on here?
Why am I getting the signal description as out put of pm-suspend?
Thanks.
--- UPDATE --------------------------------------------------
The Problem seems to be the standard behavior of the Signal defined here
signal(7)
Apparently every Signal ending with terminating the process causes
the pm-suspend to fail.

The Problem seems to be the standard behavior of the Signal defined here signal(7) Apparently every Signal ending with terminating the process causes the pm-suspend to fail.

Related

How to disable serial console during bootup

I have stopped ttyS0 (initctl stop serial DEV=ttyS0). The ttyS0 process stops for the session but reappears post to reboot, I want to disable ttyS0 at boot as it throws errors like:
Feb 19 20:19:42 sdm2 init: serial (ttyS0) main process (608881) terminated with status 1
Feb 19 20:19:42 sdm2 init: serial (ttyS0) main process ended, respawning
Feb 19 20:19:42 sdm2 init: initLogger main process (608986) terminated with status 1
I couldnt find any /etc/init/ttyS0.conf but serial.conf exists.
I searched for 'respawn' in an attempt to turn it OFF, but I found 'respawn' in serial.conf.
instance $DEV
respawn
pre-start exec /sbin/securetty $DEV
./init/serial.conf-33-exec /sbin/agetty /dev/$DEV $SPEED vt100-nav
Though /etc/ttyS0.conf doesnt exists,but I used 'echo manual | sudo tee /etc/init/ttyS0.override' to stop the ttyS0 at boot time.
-Also I removed ttyS0 from securetty.
-There is no mention of ttyS0 in inittab file.
-In grub.conf I have two console entries tty0 and console=ttyS0,115200 as well.
-/dev/ttyS0 exists but /etc/init/ttyS0.conf does nt.
Could anyone assist in stopping ttyS0 after reboot.

Presto worker process mysteriously killed and restarted sometime

In our presto cluster (0.212) with ~200 nodes (EC2 instances), sometime (like once per day) a few presto worker processes mysteriously restart (usually around the same time when it happens). The EC2 instances are fine and memory % metrics indicate 70% memory was used.
Does presto worker has any kind of suicide and restart logic (like restart if >= M consecutive errors in a row)? Or can presto coordinator restart worker under some situations? What else might kill a few worker process around the same time?
Here is one example of the server log that shows the restart.
2018-11-14T23:16:28.78011 2018-11-14T23:16:28.776Z INFO Thread-63 io.airlift.bootstrap.LifeCycleManager Life cycle stopping...
2018-11-14T23:16:29.17181 ThreadDump 4524
2018-11-14T23:16:29.17182 ForceSafepoint 414
2018-11-14T23:16:29.17182 Deoptimize 66
2018-11-14T23:16:29.17182 CollectForMetadataAllocation 11
2018-11-14T23:16:29.17182 CGC_Operation 272
2018-11-14T23:16:29.17182 G1IncCollectionPause 2900
2018-11-14T23:16:29.17183 EnableBiasedLocking 1
2018-11-14T23:16:29.17183 RevokeBias 6248
2018-11-14T23:16:29.17183 BulkRevokeBias 272
2018-11-14T23:16:29.17183 Exit 1
2018-11-14T23:16:29.17183 931 VM operations coalesced during safepoint
2018-11-14T23:16:29.17184 Maximum sync time 197 ms
2018-11-14T23:16:29.17184 Maximum vm operation time (except for Exit VM operation) 2599 ms
2018-11-14T23:16:29.52968 ./finish: line 37: kill: (3700) - No such process
2018-11-14T23:16:29.52969 ./finish: line 37: kill: (3702) - No such process
2018-11-14T23:16:31.53563 ./finish: line 40: kill: (3704) - No such process
2018-11-14T23:16:31.53564 ./finish: line 40: kill: (3706) - No such process
2018-11-14T23:16:32.25948 2018-11-14T23:16:32.257Z INFO main io.airlift.log.Logging Logging to stderr
2018-11-14T23:16:32.26034 2018-11-14T23:16:32.260Z INFO main Bootstrap Loading configuration
2018-11-14T23:16:32.33800 2018-11-14T23:16:32.337Z INFO main Bootstrap Initializing logging
......
2018-11-14T23:16:35.75427 2018-11-14T23:16:35.754Z INFO main io.airlift.bootstrap.LifeCycleManager Life cycle starting...
2018-11-14T23:16:35.75556 2018-11-14T23:16:35.755Z INFO main io.airlift.bootstrap.LifeCycleManager Life cycle startup complete. System ready.
If relevant, these "./finish: ..." lines in the log are related to the the /etc/service/presto/finish file below.
1 #!/bin/bash
2 set -e
3 exec 2>&1
4 exec 3>>/var/log/runit/runit.log
5
6 STATSD_PREFIX="runit.presto"
7 source /etc/statsd/functions
8
9 function error_handler() {
10 echo "$(date +"%Y-%m-%dT%H:%M:%S.%3NZ") Error occurred in run file at line: $1."
11 echo "$(date +"%Y-%m-%dT%H:%M:%S.%3NZ") Line exited with status: $2"
12 incr "finish.error"
13 }
14 trap 'error_handler $LINENO $?' ERR
15 echo "$(date +"%Y-%m-%dT%H:%M:%S.%3NZ") process=presto status=stopped exitcode=$1 waitcode=$2" >&3
16 # treat non-zero exit codes as a crash
17 # waitcode contains the signal if there's one (ex. 11 - SIGSEGV)
18 if [ $1 -ne 0 ]; then
19 incr "finish.crash"
20 fi
21
22
23 # ensure that we kill the entire process group.
24 # When sv force-restart runs, it will try to TERM the runit processes. If
25 # this doesn't work, it will kill (-9) the process. In case of haproxy,
26 # apache, gunicorn, etc., the master process will be killed (-9). Child processes
27 # (ie apache workers, gunicorn workers) will *not* be killed and will be
28 # around for minutes (if not hours). These child workers will keep
29 # listening on the socket, preventing the new master apache/gunicorn
30 # processes from binding to the socket. The new master process will keep
31 # crashing and be restarted by runit until the old child processes are
32 # gone.
33
34 # determine the process group id. it's the group id of the current (finish) proces.
35 PGID=$(ps -o pgid= $$ | grep -o [0-9]*)
36 # kill all processes, except ourself and the PGID (which is the main process)
37 kill $(pgrep -g $PGID | egrep -v "$PGID|$$" ) || true
38 sleep 2
39 # kill -9 to be sure
40 kill -9 $(pgrep -g $PGID | egrep -v "$PGID|$$" ) || true
41
42 echo "$(date +"%Y-%m-%dT%H:%M:%S.%3NZ") process=presto status=finished" >&3
43 incr "finish.count"
44 timing "finish.duration"
Our continuous pull deploy (salt based) restarts the presto server process under some conditions (dependency or config change). It was undesirable and unintentional and related listen_in sections have been removed.

Debian init.d script fails to sleep

This problem occurs on a Pogoplug E02 running Debian jessie.
At startup the network interface takes several seconds to come online. A short delay is required after the "networking" script completes to ensure that ensuing network operations occur properly.
I wrote the following script and inserted it using update-rc.d. The script inserted correctly and executes at boot time in proper sequence, after networking and before the network-dependent scripts which were modified to depend on netdelay
cat /etc/init.d/netdelay
#! /bin/sh
### BEGIN INIT INFO
# Provides: netdelay
# Required-Start: networking
# Required-Stop:
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: Delay 5s after eth0 up for Pogoplug
# Description:
### END INIT INFO
PATH=/sbin:/usr/sbin:/bin:/usr/bin
./lib/init/vars.sh
./lib/lsb/init-functions
log_action_msg "Pausing for eth0 to come online"
/bin/sleep 5
log_action_msg "Continuing"
exit 0
When the script executes at startup there is no delay. I've used both sleep and /bin/sleep in the script but neither effect the desired delay. Boot log showing this attached below.
Thu Jan 1 00:00:25 1970: Configuring network interfaces...done.
Thu Jan 1 00:00:25 1970: INIT: Entering runlevel: 2
Thu Jan 1 00:00:25 1970: Using makefile-style concurrent boot in runlevel 2.
Thu Jan 1 00:00:26 1970: Starting SASL Authentication Daemon: saslauthd.
Thu Jan 1 00:00:29 1970: Pausing for eth0 to come online.
Thu Jan 1 00:00:30 1970: Continuing.
Thu Jan 1 00:00:33 1970: ntpdate updating system time.
Wed Feb 1 05:33:40 2017: Starting enhanced syslogd: rsyslogd.
(The Pogoplug has no hardware clock and has no idea what time it is until ntpdate has run.)
Can someone see where the problem might be?

Who is refreshing hardware watchdog in Linux?

I have a processor AT91SAM9G20 running a 2.6 kernel. Watchdog is enabled at bootstrap level and configured for 16 seconds. Watchdog mode register can be configured only once.
When code hangs either in bootstrap, bootloader or kernel, the board reboots. But once kernel comes up even though watchdog is not refreshed in any of the applications, the board is not being reset after 16 seconds, but 15 minutes.
Who is refreshing the watchdog?
In our case, the watchdog should be influenced by applications, so that the board can reset if our application hangs.
These are the running processes:
1 root init
2 root [kthreadd]
3 root [ksoftirqd/0]
4 root [watchdog/0]
5 root [events/0]
6 root [khelper]
63 root [kblockd/0]
72 root [ksuspend_usbd]
78 root [khubd]
85 root [kmmcd]
107 root [pdflush]
108 root [pdflush]
109 root [kswapd0]
110 root [aio/0]
740 root [mtdblockd]
828 root [rpciod/0]
982 root [jffs2_gcd_mtd10]
1003 root /sbin/udevd -d
1145 daemon portmap
1158 dbus dbus-daemon --system
1178 root /usr/sbin/ifplugd -i eth0 -fwI -u0 -d5 -l -q
1190 root /usr/sbin/ifplugd -i eth1 -fwI -u0 -d5 -l -q
1221 default avahi-daemon: running [SP14.local]
1226 root /usr/sbin/dropbear
1246 root /root/bin/host_app
1254 root /root/bin/mini_httpd -c *.cgi -d /root/bin -u root -E /root/bin/
1256 root -sh
1257 root /sbin/syslogd -n -m 0
1258 root /sbin/klogd -n
1259 root /usr/bin/tail -f /var/log/messages
1265 root ps -e
We are using the watchdog for soft lockups available in kernel-2.6.25-ts.at91sam9g20/kernel/softlockup.c
If you enabled the watchdog driver in your kernel, the watchdog driver sets up a kernel timer, in charge of resetting the watchdog. The corresponding code is linux/drivers/watchdog/at91sam9_wdt.c. So it works like this:
If no application opens the /dev/watchdog file, then the kernel takes care of resetting the watchdog. Since it is a timer, it won't appear as a dedicated kernel thread, but handled by the soft IRQ thread. Now, if an application opens this file, it becomes responsible of the watchdog, and can reset it by writing to the file, as documented by the documentation linked in Richard's post.
Is the watchdog driver configured in your kernel?
If not, you should configure it, and see if the reset still happens. If it still happens, it is likely that your reset comes from somewhere else.
If your kernel is too old to have a proper watchdog driver (not present in 2.6.25) you should backport it from 2.6.28. Or you can try to disable the watchdog in your bootloader and see if the reset still occurs.
In July 2016 commit 3fbfe92647 (watchdog: change watchdog_need_worker logic) in the 4.7 kernel to watchdog_dev.c enabled the same behavior as shodanex's answer for all watchdog timer drivers. This doesn't seem to be documented anywhere other than this thread and the source code.
/*
* A worker to generate heartbeat requests is needed if all of the
* following conditions are true.
* - Userspace activated the watchdog.
* - The driver provided a value for the maximum hardware timeout, and
* thus is aware that the framework supports generating heartbeat
* requests.
* - Userspace requests a longer timeout than the hardware can handle.
*
* Alternatively, if userspace has not opened the watchdog
* device, we take care of feeding the watchdog if it is
* running.
*/
return (hm && watchdog_active(wdd) && t > hm) ||
(t && !watchdog_active(wdd) && watchdog_hw_running(wdd));
This may give you a hint: http://www.mjmwired.net/kernel/Documentation/watchdog/watchdog-api.txt
It makes perfect sense to have a user space daemon handling the watchdog. It probably defaults to a 15 minute timeout.
we had a similar problem regarding WDT on AT91SAM9263. Problem was with bit 29 WDIDLEHLT of WDT_MR (Address: 0xFFFFFD44) register. This bit was set to 1 but it should be 0 for our application needs.
Bit explanation from datasheet documentation:
• WDIDLEHLT: Watchdog Idle Halt
0: The Watchdog runs when the system is in idle mode.
1: The Watchdog stops when the system is in idle state.
This means that WDT counter does not increment when kernel is in idle state, hence the 15 or more delay until reset happens.
You can try "dd if=/dev/zero of=/dev/null" which will prevent kernel from entering idle state and you should get a reset in 16 seconds (or whatever period you have set in WDT_MR register).
So, the solution is to update u-boot code or other piece of code that sets WDT_MR register. Remember this register is write once...
Wouldn't the kernel be refreshing the watchdog timer? The watchdog is designed to reset the board if the whole system hangs, not just a single application.

How is it possible that kill -9 for a process on Linux has no effect?

I'm writing a plugin to highlight text strings automatically as you visit a web site. It's like the highlight search results but automatic and for many words; it could be used for people with allergies to make words really stand out, for example, when they browse a food site.
But I have problem. When I try to close an empty, fresh FF window, it somehow blocks the whole process. When I kill the process, all the windows vanish, but the Firefox process stays alive (parent PID is 1, doesn't listen to any signals, has lots of resources open, still eats CPU, but won't budge).
So two questions:
How is it even possible for a process not to listen to kill -9 (neither as user nor as root)?
Is there anything I can do but a reboot?
[EDIT] This is the offending process:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
digulla 16688 4.3 4.2 784476 345464 pts/14 D Mar28 75:02 /opt/firefox-3.0/firefox-bin
Same with ps -ef | grep firefox
UID PID PPID C STIME TTY TIME CMD
digulla 16688 1 4 Mar28 pts/14 01:15:02 /opt/firefox-3.0/firefox-bin
It's the only process left. As you can see, it's not a zombie, it's running! It doesn't listen to kill -9, no matter if I kill by PID or name! If I try to connect with strace, then the strace also hangs and can't be killed. There is no output, either. My guess is that FF hangs in some kernel routine but which?
[EDIT2] Based on feedback by sigjuice:
ps axopid,comm,wchan
can show you in which kernel routine a process hangs. In my case, the offending plugin was the Beagle Indexer (openSUSE 11.1). After disabling the plugin, FF was a quick and happy fox again.
As noted in comments to the OP, a process status (STAT) of D indicates that the process is in an "uninterruptible sleep" state. In real-world terms, this generally means that it's waiting on I/O and can't/won't do anything - including dying - until that I/O operation completes.
Processes in a D state will normally only be there for a fraction of a second before the operation completes and they return to R/S. In my experience, if a process gets stuck in D, it's most often trying to communicate with an unreachable NFS or other remote filesystem, trying to access a failing hard drive, or making use of some piece of hardware by way of a flaky device driver. In such cases, the only way to recover and allow the process to die is to either get the fs/drive/hardware back up and running so the I/O can complete or to give up and reboot the system. In the specific case of NFS, the mount may also eventually time out and return from the I/O operation (with a failure code), but this is dependent on the mount options and it's very common for NFS mounts to be set to wait forever.
This is distinct from a zombie process, which will have a status of Z.
Double-check that the parent-id is really 1. If not, and this is firefox, first try sudo killall -9 firefox-bin. After that, try killing the specific process IDs individually with sudo killall -9 [process-id].
How is it even possible for a process not to listen to kill -9 (neiter as user nor as root)?
If a process has gone <defunct> and then becomes a zombie with a parent of 1, you can't kill it manually; only init can. Zombie processes are already dead and gone - they've lost the ability to be killed as they are no longer processes, only a process table entry and its associated exit code, waiting to be collected. You need to kill the parent, and you can't kill init for obvious reasons.
But see here for more general information. A reboot will kill everything, naturally.
Is it possible, that this process is restarted (for example by init) just at the time you kill it?
You can check this easily. If the PID is the same after kill -9 PID then the process wasn't killed, but if it has changed the process has been restarted.
I lately get trapped into a pitfall of Double Fork and had landed to this page before finally finding my answer. The symptoms are identical even if the problem is not the same:
WYKINWYT :What You Kill Is Not What You Thought
The minimal test code is shown below based on an example for an SNMP Daemon
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
int main(int argc, char* argv[])
{
//We omit the -f option (do not Fork) to reproduce the problem
char * options[]={"/usr/local/sbin/snmpd",/*"-f","*/-d","--master=agentx", "-Dagentx","--agentXSocket=tcp:localhost:1706", "udp:10161", (char*) NULL};
pid_t pid = fork();
if ( 0 > pid ) return -1;
switch(pid)
{
case 0:
{ //Child launches SNMP daemon
execv(options[0],options);
exit(-2);
break;
}
default:
{
sleep(10); //Simulate "long" activity
kill(pid,SIGTERM);//kill what should be child,
//i.e the SNMP daemon I assume
printf("Signal sent to %d\n",pid);
sleep(10); //Simulate "long" operation before closing
waitpid(pid);
printf("SNMP should be now down\n");
getchar();//Blocking (for observation only)
break;
}
}
printf("Bye!\n");
}
During the first phase the main process (7699) launches the SNMP daemon (7700) but we can see that this one is now Defunct/Zombie. Beside we can see another process (7702) with the options we specified
[nils#localhost ~]$ ps -ef | tail
root 7439 2 0 23:00 ? 00:00:00 [kworker/1:0]
root 7494 2 0 23:03 ? 00:00:00 [kworker/0:1]
root 7544 2 0 23:08 ? 00:00:00 [kworker/0:2]
root 7605 2 0 23:10 ? 00:00:00 [kworker/1:2]
root 7698 729 0 23:11 ? 00:00:00 sleep 60
nils 7699 2832 0 23:11 pts/0 00:00:00 ./main
nils 7700 7699 0 23:11 pts/0 00:00:00 [snmpd] <defunct>
nils 7702 1 0 23:11 ? 00:00:00 /usr/local/sbin/snmpd -Lo -d --master=agentx -Dagentx --agentXSocket=tcp:localhost:1706 udp:10161
nils 7727 3706 0 23:11 pts/1 00:00:00 ps -ef
nils 7728 3706 0 23:11 pts/1 00:00:00 tail
After the 10 sec simulated we will try to kill the only process we know (7700). What we succeed at last with waitpid(). But Process 7702 is still here
[nils#localhost ~]$ ps -ef | tail
root 7431 2 0 23:00 ? 00:00:00 [kworker/u256:1]
root 7439 2 0 23:00 ? 00:00:00 [kworker/1:0]
root 7494 2 0 23:03 ? 00:00:00 [kworker/0:1]
root 7544 2 0 23:08 ? 00:00:00 [kworker/0:2]
root 7605 2 0 23:10 ? 00:00:00 [kworker/1:2]
root 7698 729 0 23:11 ? 00:00:00 sleep 60
nils 7699 2832 0 23:11 pts/0 00:00:00 ./main
nils 7702 1 0 23:11 ? 00:00:00 /usr/local/sbin/snmpd -Lo -d --master=agentx -Dagentx --agentXSocket=tcp:localhost:1706 udp:10161
nils 7751 3706 0 23:12 pts/1 00:00:00 ps -ef
nils 7752 3706 0 23:12 pts/1 00:00:00 tail
After giving a character to the getchar() function our main process terminates but the SNMP daemon with the pid 7002 is still here
[nils#localhost ~]$ ps -ef | tail
postfix 7399 1511 0 22:58 ? 00:00:00 pickup -l -t unix -u
root 7431 2 0 23:00 ? 00:00:00 [kworker/u256:1]
root 7439 2 0 23:00 ? 00:00:00 [kworker/1:0]
root 7494 2 0 23:03 ? 00:00:00 [kworker/0:1]
root 7544 2 0 23:08 ? 00:00:00 [kworker/0:2]
root 7605 2 0 23:10 ? 00:00:00 [kworker/1:2]
root 7698 729 0 23:11 ? 00:00:00 sleep 60
nils 7702 1 0 23:11 ? 00:00:00 /usr/local/sbin/snmpd -Lo -d --master=agentx -Dagentx --agentXSocket=tcp:localhost:1706 udp:10161
nils 7765 3706 0 23:12 pts/1 00:00:00 ps -ef
nils 7766 3706 0 23:12 pts/1 00:00:00 tail
Conclusion
The fact that we ignored the double fork mechanism made us think that the kill action did not succeed. But in fact we simply killed the wrong process !!
By adding the -f option ( Do Not (Double) Fork ) all go as expected
ps -ef | grep firefox;
and you can see 3 process, kill them all.
sudo killall -9 firefox
Should work
EDIT: [PID] changed to firefox
You can also do a pstree and kill the parent. This makes sure that you get the entire offending process tree and not just the leaf.

Resources