Detect program-controlled termination versus crash in Linux

Detect program-controlled termination versus crash in Linux - linux

A program running under Linux can terminate for a number of reasons: the program might finish its needed computations and simply exit (normal exit), the code might detect some problem and throw an exception (early exit), and finally, the system might stop the execution because the program tried to do something it should not (e.g., access protected memory) (crash).
Is there a reliable and consistent way I can distinguish between normal/early exit and a crash? That is,
% any_program
...time passes and prompt re-appears...
% (type something here that tells me if the program crashed)
For example, are there values of $? that indicate crashes versus program-controlled termination?

The bash man page states:
The return value of a simple command is its exit status, or 128+n if
the command is terminated by signal n.
You can check for various signals indicating a crash, such as SIGSEGV (11), and SIGABRT(6), by seeing if $? is 139 or 134 respectively:
$ any_program
$ if [ $? = 139 -o $? = 134 ]; then
> echo "Crashed!"
> fi
At the very least, if $? is greater than 128, it indicates something unusual happened, although it may be that the user killed the program by hitting ctrl-c and not an actual crash.

Related

The Linux timeout command and exit codes

In a Linux shell script I would like to use the timeout command to end another command if some time limit is reached. In general:
timeout -s SIGTERM 100 command
But I also want that my shell script exits when the command is failing for some reason. If the command is failing early enough, the time limit will not be reached, and timeout will exit with exit code 0. Thus the error cannot be trapped with trap or set -e, as least I have tried it and it did not work. How can I achieve what I want to do?

Your situation isn't very clear because you haven't included your code in the post.
timeout does exit with the exit code of the command if it finishes before the timeout value.
For example:
timeout 5 ls -l non_existent_file
# outputs ERROR: ls: cannot access non_existent_file: No such file or directory
echo $?
# outputs 2 (which is the exit code of ls)
From man timeout:
If the command times out, and --preserve-status is not set, then
exit with status 124. Otherwise, exit with the status of COMMAND. If
no signal is specified, send the TERM signal upon timeout. The TERM
signal kills any process that does not block or catch that signal.
It may be necessary to use the KILL (9) signal, since this signal
cannot be caught, in which case the exit status is 128+9 rather than
124.
See BashFAQ105 to understand the pitfalls of set -e.

BASH - why the infinite loop is not infinite and failing to restart the crashed process? [duplicate]

I have a python script that'll be checking a queue and performing an action on each item:
# checkqueue.py
while True:
check_queue()
do_something()
How do I write a bash script that will check if it's running, and if not, start it. Roughly the following pseudo code (or maybe it should do something like ps | grep?):
# keepalivescript.sh
if processidfile exists:
if processid is running:
exit, all ok
run checkqueue.py
write processid to processidfile
I'll call that from a crontab:
# crontab
*/5 * * * * /path/to/keepalivescript.sh

Avoid PID-files, crons, or anything else that tries to evaluate processes that aren't their children.
There is a very good reason why in UNIX, you can ONLY wait on your children. Any method (ps parsing, pgrep, storing a PID, ...) that tries to work around that is flawed and has gaping holes in it. Just say no.
Instead you need the process that monitors your process to be the process' parent. What does this mean? It means only the process that starts your process can reliably wait for it to end. In bash, this is absolutely trivial.
until myserver; do
echo "Server 'myserver' crashed with exit code $?. Respawning.." >&2
sleep 1
done
The above piece of bash code runs myserver in an until loop. The first line starts myserver and waits for it to end. When it ends, until checks its exit status. If the exit status is 0, it means it ended gracefully (which means you asked it to shut down somehow, and it did so successfully). In that case we don't want to restart it (we just asked it to shut down!). If the exit status is not 0, until will run the loop body, which emits an error message on STDERR and restarts the loop (back to line 1) after 1 second.
Why do we wait a second? Because if something's wrong with the startup sequence of myserver and it crashes immediately, you'll have a very intensive loop of constant restarting and crashing on your hands. The sleep 1 takes away the strain from that.
Now all you need to do is start this bash script (asynchronously, probably), and it will monitor myserver and restart it as necessary. If you want to start the monitor on boot (making the server "survive" reboots), you can schedule it in your user's cron(1) with an #reboot rule. Open your cron rules with crontab:
crontab -e
Then add a rule to start your monitor script:
#reboot /usr/local/bin/myservermonitor
Alternatively; look at inittab(5) and /etc/inittab. You can add a line in there to have myserver start at a certain init level and be respawned automatically.
Edit.
Let me add some information on why not to use PID files. While they are very popular; they are also very flawed and there's no reason why you wouldn't just do it the correct way.
Consider this:
PID recycling (killing the wrong process):
/etc/init.d/foo start: start foo, write foo's PID to /var/run/foo.pid
A while later: foo dies somehow.
A while later: any random process that starts (call it bar) takes a random PID, imagine it taking foo's old PID.
You notice foo's gone: /etc/init.d/foo/restart reads /var/run/foo.pid, checks to see if it's still alive, finds bar, thinks it's foo, kills it, starts a new foo.
PID files go stale. You need over-complicated (or should I say, non-trivial) logic to check whether the PID file is stale, and any such logic is again vulnerable to 1..
What if you don't even have write access or are in a read-only environment?
It's pointless overcomplication; see how simple my example above is. No need to complicate that, at all.
See also: Are PID-files still flawed when doing it 'right'?
By the way; even worse than PID files is parsing ps! Don't ever do this.
ps is very unportable. While you find it on almost every UNIX system; its arguments vary greatly if you want non-standard output. And standard output is ONLY for human consumption, not for scripted parsing!
Parsing ps leads to a LOT of false positives. Take the ps aux | grep PID example, and now imagine someone starting a process with a number somewhere as argument that happens to be the same as the PID you stared your daemon with! Imagine two people starting an X session and you grepping for X to kill yours. It's just all kinds of bad.
If you don't want to manage the process yourself; there are some perfectly good systems out there that will act as monitor for your processes. Look into runit, for example.

Have a look at monit (http://mmonit.com/monit/). It handles start, stop and restart of your script and can do health checks plus restarts if necessary.
Or do a simple script:
while true
do
/your/script
sleep 1
done

In-line:
while true; do <your-bash-snippet> && break; done
This will restart continuously <your-bash-snippet> if it fails: && break will stop the loop if <your-bash-snippet> stop gracefully (return code 0).
To restart <your-bash-snippet> in all cases:
while true; do <your-bash-snippet>; done
e.g. #1
while true; do openconnect x.x.x.x:xxxx && break; done
e.g. #2
while true; do docker logs -f container-name; sleep 2; done

The easiest way to do it is using flock on file. In Python script you'd do
lf = open('/tmp/script.lock','w')
if(fcntl.flock(lf, fcntl.LOCK_EX|fcntl.LOCK_NB) != 0):
sys.exit('other instance already running')
lf.write('%d\n'%os.getpid())
lf.flush()
In shell you can actually test if it's running:
if [ `flock -xn /tmp/script.lock -c 'echo 1'` ]; then
echo 'it's not running'
restart.
else
echo -n 'it's already running with PID '
cat /tmp/script.lock
fi
But of course you don't have to test, because if it's already running and you restart it, it'll exit with 'other instance already running'
When process dies, all it's file descriptors are closed and all locks are automatically removed.

You should use monit, a standard unix tool that can monitor different things on the system and react accordingly.
From the docs: http://mmonit.com/monit/documentation/monit.html#pid_testing
check process checkqueue.py with pidfile /var/run/checkqueue.pid
if changed pid then exec "checkqueue_restart.sh"
You can also configure monit to email you when it does do a restart.

if ! test -f $PIDFILE || ! psgrep `cat $PIDFILE`; then
restart_process
# Write PIDFILE
echo $! >$PIDFILE
fi

watch "yourcommand"
It will restart the process if/when it stops (after a 2s delay).
watch -n 0.1 "yourcommand"
To restart it after 0.1s instead of the default 2 seconds
watch -e "yourcommand"
To stop restarts if the program exits with an error.
Advantages:
built-in command
one line
easy to use and remember.
Drawbacks:
Only display the result of the command on the screen once it's finished

I'm not sure how portable it is across operating systems, but you might check if your system contains the 'run-one' command, i.e. "man run-one".
Specifically, this set of commands includes 'run-one-constantly', which seems to be exactly what is needed.
From man page:
run-one-constantly COMMAND [ARGS]
Note: obviously this could be called from within your script, but also it removes the need for having a script at all.

I've used the following script with great success on numerous servers:
pid=`jps -v | grep $INSTALLATION | awk '{print $1}'`
echo $INSTALLATION found at PID $pid
while [ -e /proc/$pid ]; do sleep 0.1; done
notes:
It's looking for a java process, so I
can use jps, this is much more
consistent across distributions than
ps
$INSTALLATION contains enough of the process path that's it's totally unambiguous
Use sleep while waiting for the process to die, avoid hogging resources :)
This script is actually used to shut down a running instance of tomcat, which I want to shut down (and wait for) at the command line, so launching it as a child process simply isn't an option for me.

I use this for my npm Process
#!/bin/bash
for (( ; ; ))
do
date +"%T"
echo Start Process
cd /toFolder
sudo process
date +"%T"
echo Crash
sleep 1
done

defer pipe process to background after text match

So I have a bash command to start a server and it outputs some lines before getting to the point where it outputs something like "Server started, Press Control+C to exit". How do I pipe this output so when this line occurs i put this process in the background and continue with another script/function (i.e to do stuff that needs to wait until the server starts such as run tests)
I want to end up with 3 functions
start_server
run_tests
stop_server
I've got something along the lines of:
function read_server_output{
while read data; do
printf "$data"
if [[ $data == "Server started, Press Control+C to exit" ]]; then
# do something here to put server process in the background
# so I can run another function
fi
done
}
function start_server{
# start the server and pipe its output to another function to check its running
start-server-command | read_server_output
}
function run_test{
# do some stuff
}
function stop_server{
# stop the server
}
# run the bash script code
start_server()
run_tests()
stop_tests()
related question possibly SH/BASH - Scan a log file until some text occurs, then exit. How?
Thanks in advance I'm pretty new to this.

First, a note on terminology...
"Background" and "foreground" are controlling-terminal concepts, i.e., they have to do with what happens when you type ctrl+C, ctrl+Z, etc. (which process gets the signal), whether a process can read from the terminal device (a "background" process gets a SIGTTIN that by default causes it to stop), and so on.
It seems clear that this has little to do with what you want to achieve. Instead, you have an ill-behaved program (or suite of programs) that needs some special coddling: when the server is first started, it needs some hand-holding up to some point, after which it's OK. The hand-holding can stop once it outputs some text string (see your related question for that, or the technique below).
There's a big potential problem here: a lot of programs, when their output is redirected to a pipe or file, produce no output until they have printed a "block" worth of output, or are exiting. If this is the case, a simple:
start-server-command | cat
won't print the line you're looking for (so that's a quick way to tell whether you will have to work around this issue as well). If so, you'll need something like expect, which is an entirely different way to achieve what you want.
Assuming that's not a problem, though, let's try an entirely-in-shell approach.
What you need is to run the start-server-command and save the process-ID so that you can (eventually) send it a SIGINT signal (as ctrl+C would if the process were "in the foreground", but you're doing this from a script, not from a controlling terminal, so there's no key the script can press). Fortunately sh has a syntax just for this.
First let's make a temporary file:
#! /bin/sh
# myscript - script to run server, check for startup, then run tests
TMPFILE=$(mktemp -t myscript) || exit 1 # create /tmp/myscript.<unique>
trap "rm -f $TMPFILE" 0 1 2 3 15 # arrange to clean up when done
Now start the server and save its PID:
start-server-command > $TMPFILE & # start server, save output in file
SERVER_PID=$! # and save its PID so we can end it
trap "kill -INT $SERVER_PID; rm -f $TMPFILE" 0 1 2 3 15 # adjust cleanup
Now you'll want to scan through $TMPFILE until the desired output appears, as in the other question. Because this requires a certain amount of polling you should insert a delay. It's also probably wise to check whether the server has failed and terminated without ever getting to the "started" point.
while ! grep '^Server started, Press Control+C to exit$' >/dev/null; do
# message has not yet appeared, is server still starting?
if kill -0 $SERVER_PID 2>/dev/null; then
# server is running; let's wait a bit and try grepping again
sleep 1 # or other delay interval
else
echo "ERROR: server terminated without starting properly" 1>&2
exit 1
fi
done
(Here kill -0 is used to test whether the process still exists; if not, it has exited. The "cleanup" kill -INT will produce an error message, but that's probably OK. If not, either redirect that kill command's error-output, or adjust the cleanup or do it manually, as seen below.)
At this point, the server is running and you can do your tests. When you want it to exit as if the user hit ctrl+C, send it a SIGINT with kill -INT.
Since there's a kill -INT in the trap set for when the script exits (0) as well as when it's terminated by SIGHUP (1), SIGINT (2), SIGQUIT (3), and SIGTERM (15)—that's the:
trap "do some stuff" 0 1 2 3 15
part—you can simply let your script exit at this point, unless you want to specifically wait for the server to exit too. If you want that, perhaps:
kill -INT $SERVER_PID; rm -f $TMPFILE # do the pre-arranged cleanup now
trap - 0 1 2 3 15 # don't need it arranged anymore
wait $SERVER_PID # wait for server to finish exit
would be appropriate.
(Obviously none of the above is tested, but that's the general framework.)

Probably the easiest thing to do is to start it in the background and block on reading its output. Something like:
{ start-server-command & } | {
while read -r line; do
echo "$line"
echo "$line" | grep -q 'Server started' && break
done
cat &
}
echo script continues here after server outputs 'Server started' message
But this is a pretty ugly hack. It would be better if the server could be modified to perform a more specific action which the script could wait for.

Detect when process quits or is being killed due out of memory

My bash script is running some program in background and with wait command waits for it to stop. But there is a high possibility that the background process will be killed because it takes too much memory. I want my script to react differently for a process that ended up gently and for a killed one. How do I check this condition?

Make sure your command signals success (with exit code 0) when it succeeds, and failure (non-zero) when it fails.
When a process is killed with SIGKILL by the OOM killer, signaling failure is automatic. (The shell will consider the exit code of signal terminated processes to be 128 + the signal number, so 128+9=137 for SIGKILL).
You then use the fact that wait somepid exits with the same code as the command it waits on in an if statement:
yourcommand &
pid=$!
....
if wait $pid
then
echo "It exited successfully"
else
echo "It exited with failure"
fi

usually they shutdown with a signal, try to have some signal hander function to handle unpredictable shutdowns, or worst case have another monitoring process, like a task manager.
did you try anything?
by the way some signals cant be handled, like segmentation faults, SIGSEGV

Simpler solution is
yourcommand
if [ $? -eq 0 ] ; then
echo "It exited successfully"
else
echo "It exited with failure, exitcode $?"
fi

Linux commands - abort operation with timeout

I'm writing a script and at some point I call to "command1" which
does not stop until it CTRL+C is invoked.
Is there a way to state a timeout for a command? Like:
command1 arguments -timeout 10
How do I write a CTRL+C command in textual form?
Thx!

You can use the timeout command from GNU coreutils (you may need to install it first, but it comes in most, if not all, Linux distributions):
timeout [OPTION] DURATION COMMAND [ARG]...
For instance:
timeout 5 ./test.sh
will terminate the script after 5 seconds of execution. If you want to send a KILL signal (instead of TERM), use -kflag.
Here you have the full description of the timeout command.

I just tried
jekyll -server & sleep 10;pkill jekyll
Could do for your situation.

in your script you can set a wait. wait 10 would wait 10 seconds and as for exiting the program without CTRL + C look into the exit command. If you use exit 0 it means ok. There are different versions, but I don't know what they mean exactly off the top of my head.
exit 1
exit 2..... so on and so forth
Update
# Celada
No need to bash. You could have just said "maybe you didn't understand the question correctly" Stackoverflow is here to help people learn, not tear them down. They invented Reddit for that. As for the exit codes, you can force the program to exit by issuing the exit() command with a code. Directly from linux.die.net.
Exit Code Number Meaning Example Comments
1 Catchall for general errors let "var1 = 1/0" Miscellaneous errors, such as "divide by zero"
2 Misuse of shell builtins (according to Bash documentation) Seldom seen, usually defaults to exit code 1
126 Command invoked cannot execute Permission problem or command is not an executable
127 "command not found" Possible problem with $PATH or a typo
128 Invalid argument to exit exit 3.14159 exit takes only integer args in the range 0 - 255 (see footnote)
128+n Fatal error signal "n" kill -9 $PPID of script $? returns 137 (128 + 9)
130 Script terminated by Control-C Control-C is fatal error signal 2, (130 = 128 + 2, see above)
255* Exit status out of range exit -1 exit takes only integer args in the range 0 - 255

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string