Different behaviour of bash script on supervisor start and restart - linux

I have bash script which do something, (for example:)
[program:long_script]
command=/usr/local/bin/long.sh
autostart=true
autorestart=true
stderr_logfile=/var/log/long.err.log
stdout_logfile=/var/log/long.out.log
and it is bounded to supervisor.
I want to add if check in this script to determine is it executed by:
supervisor> start long_script
or
supervisor> restart long_script
I want something like that:
if [ executed by start command ]
then
echo "start"
else
echo "restart"
fi
but i don't know what should be in if clause.
Is it possible to determine this?
If not, how to achieve different behaviour of script for start and restart commands?
Please help.

Within the code there is no current difference between a restart and a stop/start. Restart within the supervisorctl calls:
self.do_stop(arg)
self.do_start(arg)
There is no status within the app for "restart" though there is some discussion of allowing different signals. The supervisor is already able to send different signals to the process. (allowing more control over reload/restart has been a long standing "gap")
This means you have at least two options but the key to making this work is that the process needs to record some state at shutdown
Option 1. The easiest option would be to use the supervisorctl signal <singal> <process> instead of calling supervisorctl restart <process> and record somewhere what signal was sent so that on startup you can read back the last signal.
Option 2. However a more interesting solution is to not expect any upstream changes ie continue to allow restart to be used and distinguish between stop, crash and restart
In this case, the only information that will be different between a start and a restart is that a restart should have a much shorter time between the shutdown of the old process and the start of the new process. So if, on shutdown, a timestamp is recorded, then on startup, the difference between now and the last shutdown will distinguish between a start and a restart
To do this, Ive got a definition like yours but with stopsignal defined:
[program:long_script]
command=/usr/local/bin/long.sh
autostart=true
autorestart=true
stderr_logfile=/var/log/long.err.log
stdout_logfile=/var/log/long.out.log
stopsignal=SIGUSR1
By making the stop from supervisord a specific signal, you can tell the difference between a crash and a normal stop event, and not interfere with normal kill or interrupt signals
Then as the very first line in the bash script, i set a trap for this singal:
trap "mkdir -p /var/run/long/; date +%s > /var/run/long/last.stop; exit 0" SIGUSR1
This means the date as epoch will be recorded in the file /var/run/long/last.stop everytime we are sent a stop from supervisord
Then as the immediate next lines in the script, calculate the difference between the last stop and now
stopdiff=0
if [ -e /var/run/long/last.stop ]; then
curtime=$(date +%s)
stoptime=$(cat /var/run/long/last.stop | grep "[0-9]*")
if [ -n "${stoptime}" ]; then
stopdiff=$[ ${curtime} - ${stoptime} ]
fi
else
stopdiff=9999
fi
stopdiff will now contain the difference in seconds between the stop and start or 9999 if the stop file didnt exist.
This can then be used to decide what to do:
if [ ${stopdiff} -gt 2 ]; then
echo "Start detected (${stopdiff} sec difference)"
elif [ ${stopdiff} -ge 0 ]; then
echo "Restart detected (${stopdiff} sec difference)"
else
echo "Error detected (${stopdiff} sec difference)"
fi
You'll have to make some choices about how long it actually takes to get from sending a stop to the script actually starting: here, ive allowed only 2 seconds and anything greater is considered a "start". If the shutdown of the script needs to happen in a specific way, you'll need a bit more complexity in the trap statement (rather than just exit 0
Since a crash shouldnt record any timestamp to the stop file, you should be able to tell that a startup is occurring because of a crash if you also regularly recorded somewhere a running timestamp.

I understand your problem. But I don't know about supervisor. Please check whether this idea works.
Instantiate a global string variable and put values to the variable before you enter the supervisor commands. Here I am making your each start and restart commands as two bash programs.
program : supervisor_start.sh
#!/bin/bash
echo "Starting.."
supervisor> start long_script
supervisor_started_command="start" # This is the one
echo "Started.."
program : supervisor_restart.sh
#!/bin/bash
echo "ReStarting.."
supervisor> restart long_script
supervisor_started_command="restart" # This is the one
echo "ReStarted.."
Now you can see what is in "supervisor_started_command" variable :)
#!/bin/bash
if [ $supervisor_started_command == "start" ]
then
echo "start"
elif [ $supervisor_started_command == "restart" ]
echo "restart"
fi
Well, I don't know this idea works for you or not..

Related

nohup node service using cron job on CentOS 7 [duplicate]

I have a python script that'll be checking a queue and performing an action on each item:
# checkqueue.py
while True:
check_queue()
do_something()
How do I write a bash script that will check if it's running, and if not, start it. Roughly the following pseudo code (or maybe it should do something like ps | grep?):
# keepalivescript.sh
if processidfile exists:
if processid is running:
exit, all ok
run checkqueue.py
write processid to processidfile
I'll call that from a crontab:
# crontab
*/5 * * * * /path/to/keepalivescript.sh
Avoid PID-files, crons, or anything else that tries to evaluate processes that aren't their children.
There is a very good reason why in UNIX, you can ONLY wait on your children. Any method (ps parsing, pgrep, storing a PID, ...) that tries to work around that is flawed and has gaping holes in it. Just say no.
Instead you need the process that monitors your process to be the process' parent. What does this mean? It means only the process that starts your process can reliably wait for it to end. In bash, this is absolutely trivial.
until myserver; do
echo "Server 'myserver' crashed with exit code $?. Respawning.." >&2
sleep 1
done
The above piece of bash code runs myserver in an until loop. The first line starts myserver and waits for it to end. When it ends, until checks its exit status. If the exit status is 0, it means it ended gracefully (which means you asked it to shut down somehow, and it did so successfully). In that case we don't want to restart it (we just asked it to shut down!). If the exit status is not 0, until will run the loop body, which emits an error message on STDERR and restarts the loop (back to line 1) after 1 second.
Why do we wait a second? Because if something's wrong with the startup sequence of myserver and it crashes immediately, you'll have a very intensive loop of constant restarting and crashing on your hands. The sleep 1 takes away the strain from that.
Now all you need to do is start this bash script (asynchronously, probably), and it will monitor myserver and restart it as necessary. If you want to start the monitor on boot (making the server "survive" reboots), you can schedule it in your user's cron(1) with an #reboot rule. Open your cron rules with crontab:
crontab -e
Then add a rule to start your monitor script:
#reboot /usr/local/bin/myservermonitor
Alternatively; look at inittab(5) and /etc/inittab. You can add a line in there to have myserver start at a certain init level and be respawned automatically.
Edit.
Let me add some information on why not to use PID files. While they are very popular; they are also very flawed and there's no reason why you wouldn't just do it the correct way.
Consider this:
PID recycling (killing the wrong process):
/etc/init.d/foo start: start foo, write foo's PID to /var/run/foo.pid
A while later: foo dies somehow.
A while later: any random process that starts (call it bar) takes a random PID, imagine it taking foo's old PID.
You notice foo's gone: /etc/init.d/foo/restart reads /var/run/foo.pid, checks to see if it's still alive, finds bar, thinks it's foo, kills it, starts a new foo.
PID files go stale. You need over-complicated (or should I say, non-trivial) logic to check whether the PID file is stale, and any such logic is again vulnerable to 1..
What if you don't even have write access or are in a read-only environment?
It's pointless overcomplication; see how simple my example above is. No need to complicate that, at all.
See also: Are PID-files still flawed when doing it 'right'?
By the way; even worse than PID files is parsing ps! Don't ever do this.
ps is very unportable. While you find it on almost every UNIX system; its arguments vary greatly if you want non-standard output. And standard output is ONLY for human consumption, not for scripted parsing!
Parsing ps leads to a LOT of false positives. Take the ps aux | grep PID example, and now imagine someone starting a process with a number somewhere as argument that happens to be the same as the PID you stared your daemon with! Imagine two people starting an X session and you grepping for X to kill yours. It's just all kinds of bad.
If you don't want to manage the process yourself; there are some perfectly good systems out there that will act as monitor for your processes. Look into runit, for example.
Have a look at monit (http://mmonit.com/monit/). It handles start, stop and restart of your script and can do health checks plus restarts if necessary.
Or do a simple script:
while true
do
/your/script
sleep 1
done
In-line:
while true; do <your-bash-snippet> && break; done
This will restart continuously <your-bash-snippet> if it fails: && break will stop the loop if <your-bash-snippet> stop gracefully (return code 0).
To restart <your-bash-snippet> in all cases:
while true; do <your-bash-snippet>; done
e.g. #1
while true; do openconnect x.x.x.x:xxxx && break; done
e.g. #2
while true; do docker logs -f container-name; sleep 2; done
The easiest way to do it is using flock on file. In Python script you'd do
lf = open('/tmp/script.lock','w')
if(fcntl.flock(lf, fcntl.LOCK_EX|fcntl.LOCK_NB) != 0):
sys.exit('other instance already running')
lf.write('%d\n'%os.getpid())
lf.flush()
In shell you can actually test if it's running:
if [ `flock -xn /tmp/script.lock -c 'echo 1'` ]; then
echo 'it's not running'
restart.
else
echo -n 'it's already running with PID '
cat /tmp/script.lock
fi
But of course you don't have to test, because if it's already running and you restart it, it'll exit with 'other instance already running'
When process dies, all it's file descriptors are closed and all locks are automatically removed.
You should use monit, a standard unix tool that can monitor different things on the system and react accordingly.
From the docs: http://mmonit.com/monit/documentation/monit.html#pid_testing
check process checkqueue.py with pidfile /var/run/checkqueue.pid
if changed pid then exec "checkqueue_restart.sh"
You can also configure monit to email you when it does do a restart.
if ! test -f $PIDFILE || ! psgrep `cat $PIDFILE`; then
restart_process
# Write PIDFILE
echo $! >$PIDFILE
fi
watch "yourcommand"
It will restart the process if/when it stops (after a 2s delay).
watch -n 0.1 "yourcommand"
To restart it after 0.1s instead of the default 2 seconds
watch -e "yourcommand"
To stop restarts if the program exits with an error.
Advantages:
built-in command
one line
easy to use and remember.
Drawbacks:
Only display the result of the command on the screen once it's finished
I'm not sure how portable it is across operating systems, but you might check if your system contains the 'run-one' command, i.e. "man run-one".
Specifically, this set of commands includes 'run-one-constantly', which seems to be exactly what is needed.
From man page:
run-one-constantly COMMAND [ARGS]
Note: obviously this could be called from within your script, but also it removes the need for having a script at all.
I've used the following script with great success on numerous servers:
pid=`jps -v | grep $INSTALLATION | awk '{print $1}'`
echo $INSTALLATION found at PID $pid
while [ -e /proc/$pid ]; do sleep 0.1; done
notes:
It's looking for a java process, so I
can use jps, this is much more
consistent across distributions than
ps
$INSTALLATION contains enough of the process path that's it's totally unambiguous
Use sleep while waiting for the process to die, avoid hogging resources :)
This script is actually used to shut down a running instance of tomcat, which I want to shut down (and wait for) at the command line, so launching it as a child process simply isn't an option for me.
I use this for my npm Process
#!/bin/bash
for (( ; ; ))
do
date +"%T"
echo Start Process
cd /toFolder
sudo process
date +"%T"
echo Crash
sleep 1
done

How to kill a process on no output for some period of time

I've written a program that is suppose to run for a long time and it outputs the progress to stdout, however, under some circumstances it begins to hang and the easiest thing to do is to restart it.
My question is: Is there a way to do something that would kill the process only if it had no output for a specific number of seconds?
I have started thinking about it, and the only thing that comes to mind is something like this:
./application > output.log &
tail -f output.log
then create script which would look at the date and time of the last modification on output.log and restart the whole thing.
But it looks very tedious, and i would hate to go through all that if there were an existing command for that.
As far as I know, there isn't a standard utility to do it, but a good start for a one-liner would be:
timeout=10; if [ -z "`find output.log -newermt #$[$(date +%s)-${timeout}]`" ]; then killall -TERM application; fi
At least, this will avoid the tedious part of coding a more complex script.
Some hints:
Using the find utility to compare the last modification date of the output.log file against a time reference.
The time reference is returned by date utility as the current time in seconds (+%s) since EPOCH (1970-01-01 UTC).
Using bash $[] operation to subtract the $timeout value (10 seconds on the example)
If no output is returned from the above find, then the file wasn't changed for more than 10 seconds. This will trigger a true in the if condition and the killall command will be executed.
You can also set an alias for that, using:
alias kill_application='timeout=10; if [ -z "`find output.log -newermt #$[$(date +%s)-${timeout}]`" ]; then killall -TERM application; fi';
And then use it whenever you want by just issuing the command kill_application
If you want to automatically restart the application without human intervention, you can install a crontab entry to run every minute or so and also issue the application restart command after the killall (Probably you may also want to change the -TERM to -KILL, just in case the application becomes unresponsive to handleable signals).
The inotifywait could help here, it efficiently waits for changes to files. The exit status can be checked to identify if the event (modify) occurred in the specified interval of time.
$ inotifywait -e modify -t 10 output.log
Setting up watches.
Watches established.
$ echo $?
2
Some related info from man:
OPTIONS
-e <event>, --event <event>
Listen for specific event(s) only.
-t <seconds>, --timeout <seconds>
Exit if an appropriate event has not occurred within <seconds> seconds.
EXIT STATUS
2 The -t option was used and an event did not occur in the specified interval of time.
EVENTS
modify A watched file or a file within a watched directory was written to.

BASH - why the infinite loop is not infinite and failing to restart the crashed process? [duplicate]

I have a python script that'll be checking a queue and performing an action on each item:
# checkqueue.py
while True:
check_queue()
do_something()
How do I write a bash script that will check if it's running, and if not, start it. Roughly the following pseudo code (or maybe it should do something like ps | grep?):
# keepalivescript.sh
if processidfile exists:
if processid is running:
exit, all ok
run checkqueue.py
write processid to processidfile
I'll call that from a crontab:
# crontab
*/5 * * * * /path/to/keepalivescript.sh
Avoid PID-files, crons, or anything else that tries to evaluate processes that aren't their children.
There is a very good reason why in UNIX, you can ONLY wait on your children. Any method (ps parsing, pgrep, storing a PID, ...) that tries to work around that is flawed and has gaping holes in it. Just say no.
Instead you need the process that monitors your process to be the process' parent. What does this mean? It means only the process that starts your process can reliably wait for it to end. In bash, this is absolutely trivial.
until myserver; do
echo "Server 'myserver' crashed with exit code $?. Respawning.." >&2
sleep 1
done
The above piece of bash code runs myserver in an until loop. The first line starts myserver and waits for it to end. When it ends, until checks its exit status. If the exit status is 0, it means it ended gracefully (which means you asked it to shut down somehow, and it did so successfully). In that case we don't want to restart it (we just asked it to shut down!). If the exit status is not 0, until will run the loop body, which emits an error message on STDERR and restarts the loop (back to line 1) after 1 second.
Why do we wait a second? Because if something's wrong with the startup sequence of myserver and it crashes immediately, you'll have a very intensive loop of constant restarting and crashing on your hands. The sleep 1 takes away the strain from that.
Now all you need to do is start this bash script (asynchronously, probably), and it will monitor myserver and restart it as necessary. If you want to start the monitor on boot (making the server "survive" reboots), you can schedule it in your user's cron(1) with an #reboot rule. Open your cron rules with crontab:
crontab -e
Then add a rule to start your monitor script:
#reboot /usr/local/bin/myservermonitor
Alternatively; look at inittab(5) and /etc/inittab. You can add a line in there to have myserver start at a certain init level and be respawned automatically.
Edit.
Let me add some information on why not to use PID files. While they are very popular; they are also very flawed and there's no reason why you wouldn't just do it the correct way.
Consider this:
PID recycling (killing the wrong process):
/etc/init.d/foo start: start foo, write foo's PID to /var/run/foo.pid
A while later: foo dies somehow.
A while later: any random process that starts (call it bar) takes a random PID, imagine it taking foo's old PID.
You notice foo's gone: /etc/init.d/foo/restart reads /var/run/foo.pid, checks to see if it's still alive, finds bar, thinks it's foo, kills it, starts a new foo.
PID files go stale. You need over-complicated (or should I say, non-trivial) logic to check whether the PID file is stale, and any such logic is again vulnerable to 1..
What if you don't even have write access or are in a read-only environment?
It's pointless overcomplication; see how simple my example above is. No need to complicate that, at all.
See also: Are PID-files still flawed when doing it 'right'?
By the way; even worse than PID files is parsing ps! Don't ever do this.
ps is very unportable. While you find it on almost every UNIX system; its arguments vary greatly if you want non-standard output. And standard output is ONLY for human consumption, not for scripted parsing!
Parsing ps leads to a LOT of false positives. Take the ps aux | grep PID example, and now imagine someone starting a process with a number somewhere as argument that happens to be the same as the PID you stared your daemon with! Imagine two people starting an X session and you grepping for X to kill yours. It's just all kinds of bad.
If you don't want to manage the process yourself; there are some perfectly good systems out there that will act as monitor for your processes. Look into runit, for example.
Have a look at monit (http://mmonit.com/monit/). It handles start, stop and restart of your script and can do health checks plus restarts if necessary.
Or do a simple script:
while true
do
/your/script
sleep 1
done
In-line:
while true; do <your-bash-snippet> && break; done
This will restart continuously <your-bash-snippet> if it fails: && break will stop the loop if <your-bash-snippet> stop gracefully (return code 0).
To restart <your-bash-snippet> in all cases:
while true; do <your-bash-snippet>; done
e.g. #1
while true; do openconnect x.x.x.x:xxxx && break; done
e.g. #2
while true; do docker logs -f container-name; sleep 2; done
The easiest way to do it is using flock on file. In Python script you'd do
lf = open('/tmp/script.lock','w')
if(fcntl.flock(lf, fcntl.LOCK_EX|fcntl.LOCK_NB) != 0):
sys.exit('other instance already running')
lf.write('%d\n'%os.getpid())
lf.flush()
In shell you can actually test if it's running:
if [ `flock -xn /tmp/script.lock -c 'echo 1'` ]; then
echo 'it's not running'
restart.
else
echo -n 'it's already running with PID '
cat /tmp/script.lock
fi
But of course you don't have to test, because if it's already running and you restart it, it'll exit with 'other instance already running'
When process dies, all it's file descriptors are closed and all locks are automatically removed.
You should use monit, a standard unix tool that can monitor different things on the system and react accordingly.
From the docs: http://mmonit.com/monit/documentation/monit.html#pid_testing
check process checkqueue.py with pidfile /var/run/checkqueue.pid
if changed pid then exec "checkqueue_restart.sh"
You can also configure monit to email you when it does do a restart.
if ! test -f $PIDFILE || ! psgrep `cat $PIDFILE`; then
restart_process
# Write PIDFILE
echo $! >$PIDFILE
fi
watch "yourcommand"
It will restart the process if/when it stops (after a 2s delay).
watch -n 0.1 "yourcommand"
To restart it after 0.1s instead of the default 2 seconds
watch -e "yourcommand"
To stop restarts if the program exits with an error.
Advantages:
built-in command
one line
easy to use and remember.
Drawbacks:
Only display the result of the command on the screen once it's finished
I'm not sure how portable it is across operating systems, but you might check if your system contains the 'run-one' command, i.e. "man run-one".
Specifically, this set of commands includes 'run-one-constantly', which seems to be exactly what is needed.
From man page:
run-one-constantly COMMAND [ARGS]
Note: obviously this could be called from within your script, but also it removes the need for having a script at all.
I've used the following script with great success on numerous servers:
pid=`jps -v | grep $INSTALLATION | awk '{print $1}'`
echo $INSTALLATION found at PID $pid
while [ -e /proc/$pid ]; do sleep 0.1; done
notes:
It's looking for a java process, so I
can use jps, this is much more
consistent across distributions than
ps
$INSTALLATION contains enough of the process path that's it's totally unambiguous
Use sleep while waiting for the process to die, avoid hogging resources :)
This script is actually used to shut down a running instance of tomcat, which I want to shut down (and wait for) at the command line, so launching it as a child process simply isn't an option for me.
I use this for my npm Process
#!/bin/bash
for (( ; ; ))
do
date +"%T"
echo Start Process
cd /toFolder
sudo process
date +"%T"
echo Crash
sleep 1
done

defer pipe process to background after text match

So I have a bash command to start a server and it outputs some lines before getting to the point where it outputs something like "Server started, Press Control+C to exit". How do I pipe this output so when this line occurs i put this process in the background and continue with another script/function (i.e to do stuff that needs to wait until the server starts such as run tests)
I want to end up with 3 functions
start_server
run_tests
stop_server
I've got something along the lines of:
function read_server_output{
while read data; do
printf "$data"
if [[ $data == "Server started, Press Control+C to exit" ]]; then
# do something here to put server process in the background
# so I can run another function
fi
done
}
function start_server{
# start the server and pipe its output to another function to check its running
start-server-command | read_server_output
}
function run_test{
# do some stuff
}
function stop_server{
# stop the server
}
# run the bash script code
start_server()
run_tests()
stop_tests()
related question possibly SH/BASH - Scan a log file until some text occurs, then exit. How?
Thanks in advance I'm pretty new to this.
First, a note on terminology...
"Background" and "foreground" are controlling-terminal concepts, i.e., they have to do with what happens when you type ctrl+C, ctrl+Z, etc. (which process gets the signal), whether a process can read from the terminal device (a "background" process gets a SIGTTIN that by default causes it to stop), and so on.
It seems clear that this has little to do with what you want to achieve. Instead, you have an ill-behaved program (or suite of programs) that needs some special coddling: when the server is first started, it needs some hand-holding up to some point, after which it's OK. The hand-holding can stop once it outputs some text string (see your related question for that, or the technique below).
There's a big potential problem here: a lot of programs, when their output is redirected to a pipe or file, produce no output until they have printed a "block" worth of output, or are exiting. If this is the case, a simple:
start-server-command | cat
won't print the line you're looking for (so that's a quick way to tell whether you will have to work around this issue as well). If so, you'll need something like expect, which is an entirely different way to achieve what you want.
Assuming that's not a problem, though, let's try an entirely-in-shell approach.
What you need is to run the start-server-command and save the process-ID so that you can (eventually) send it a SIGINT signal (as ctrl+C would if the process were "in the foreground", but you're doing this from a script, not from a controlling terminal, so there's no key the script can press). Fortunately sh has a syntax just for this.
First let's make a temporary file:
#! /bin/sh
# myscript - script to run server, check for startup, then run tests
TMPFILE=$(mktemp -t myscript) || exit 1 # create /tmp/myscript.<unique>
trap "rm -f $TMPFILE" 0 1 2 3 15 # arrange to clean up when done
Now start the server and save its PID:
start-server-command > $TMPFILE & # start server, save output in file
SERVER_PID=$! # and save its PID so we can end it
trap "kill -INT $SERVER_PID; rm -f $TMPFILE" 0 1 2 3 15 # adjust cleanup
Now you'll want to scan through $TMPFILE until the desired output appears, as in the other question. Because this requires a certain amount of polling you should insert a delay. It's also probably wise to check whether the server has failed and terminated without ever getting to the "started" point.
while ! grep '^Server started, Press Control+C to exit$' >/dev/null; do
# message has not yet appeared, is server still starting?
if kill -0 $SERVER_PID 2>/dev/null; then
# server is running; let's wait a bit and try grepping again
sleep 1 # or other delay interval
else
echo "ERROR: server terminated without starting properly" 1>&2
exit 1
fi
done
(Here kill -0 is used to test whether the process still exists; if not, it has exited. The "cleanup" kill -INT will produce an error message, but that's probably OK. If not, either redirect that kill command's error-output, or adjust the cleanup or do it manually, as seen below.)
At this point, the server is running and you can do your tests. When you want it to exit as if the user hit ctrl+C, send it a SIGINT with kill -INT.
Since there's a kill -INT in the trap set for when the script exits (0) as well as when it's terminated by SIGHUP (1), SIGINT (2), SIGQUIT (3), and SIGTERM (15)—that's the:
trap "do some stuff" 0 1 2 3 15
part—you can simply let your script exit at this point, unless you want to specifically wait for the server to exit too. If you want that, perhaps:
kill -INT $SERVER_PID; rm -f $TMPFILE # do the pre-arranged cleanup now
trap - 0 1 2 3 15 # don't need it arranged anymore
wait $SERVER_PID # wait for server to finish exit
would be appropriate.
(Obviously none of the above is tested, but that's the general framework.)
Probably the easiest thing to do is to start it in the background and block on reading its output. Something like:
{ start-server-command & } | {
while read -r line; do
echo "$line"
echo "$line" | grep -q 'Server started' && break
done
cat &
}
echo script continues here after server outputs 'Server started' message
But this is a pretty ugly hack. It would be better if the server could be modified to perform a more specific action which the script could wait for.

How do I write a bash script to restart a process if it dies?

I have a python script that'll be checking a queue and performing an action on each item:
# checkqueue.py
while True:
check_queue()
do_something()
How do I write a bash script that will check if it's running, and if not, start it. Roughly the following pseudo code (or maybe it should do something like ps | grep?):
# keepalivescript.sh
if processidfile exists:
if processid is running:
exit, all ok
run checkqueue.py
write processid to processidfile
I'll call that from a crontab:
# crontab
*/5 * * * * /path/to/keepalivescript.sh
Avoid PID-files, crons, or anything else that tries to evaluate processes that aren't their children.
There is a very good reason why in UNIX, you can ONLY wait on your children. Any method (ps parsing, pgrep, storing a PID, ...) that tries to work around that is flawed and has gaping holes in it. Just say no.
Instead you need the process that monitors your process to be the process' parent. What does this mean? It means only the process that starts your process can reliably wait for it to end. In bash, this is absolutely trivial.
until myserver; do
echo "Server 'myserver' crashed with exit code $?. Respawning.." >&2
sleep 1
done
The above piece of bash code runs myserver in an until loop. The first line starts myserver and waits for it to end. When it ends, until checks its exit status. If the exit status is 0, it means it ended gracefully (which means you asked it to shut down somehow, and it did so successfully). In that case we don't want to restart it (we just asked it to shut down!). If the exit status is not 0, until will run the loop body, which emits an error message on STDERR and restarts the loop (back to line 1) after 1 second.
Why do we wait a second? Because if something's wrong with the startup sequence of myserver and it crashes immediately, you'll have a very intensive loop of constant restarting and crashing on your hands. The sleep 1 takes away the strain from that.
Now all you need to do is start this bash script (asynchronously, probably), and it will monitor myserver and restart it as necessary. If you want to start the monitor on boot (making the server "survive" reboots), you can schedule it in your user's cron(1) with an #reboot rule. Open your cron rules with crontab:
crontab -e
Then add a rule to start your monitor script:
#reboot /usr/local/bin/myservermonitor
Alternatively; look at inittab(5) and /etc/inittab. You can add a line in there to have myserver start at a certain init level and be respawned automatically.
Edit.
Let me add some information on why not to use PID files. While they are very popular; they are also very flawed and there's no reason why you wouldn't just do it the correct way.
Consider this:
PID recycling (killing the wrong process):
/etc/init.d/foo start: start foo, write foo's PID to /var/run/foo.pid
A while later: foo dies somehow.
A while later: any random process that starts (call it bar) takes a random PID, imagine it taking foo's old PID.
You notice foo's gone: /etc/init.d/foo/restart reads /var/run/foo.pid, checks to see if it's still alive, finds bar, thinks it's foo, kills it, starts a new foo.
PID files go stale. You need over-complicated (or should I say, non-trivial) logic to check whether the PID file is stale, and any such logic is again vulnerable to 1..
What if you don't even have write access or are in a read-only environment?
It's pointless overcomplication; see how simple my example above is. No need to complicate that, at all.
See also: Are PID-files still flawed when doing it 'right'?
By the way; even worse than PID files is parsing ps! Don't ever do this.
ps is very unportable. While you find it on almost every UNIX system; its arguments vary greatly if you want non-standard output. And standard output is ONLY for human consumption, not for scripted parsing!
Parsing ps leads to a LOT of false positives. Take the ps aux | grep PID example, and now imagine someone starting a process with a number somewhere as argument that happens to be the same as the PID you stared your daemon with! Imagine two people starting an X session and you grepping for X to kill yours. It's just all kinds of bad.
If you don't want to manage the process yourself; there are some perfectly good systems out there that will act as monitor for your processes. Look into runit, for example.
Have a look at monit (http://mmonit.com/monit/). It handles start, stop and restart of your script and can do health checks plus restarts if necessary.
Or do a simple script:
while true
do
/your/script
sleep 1
done
In-line:
while true; do <your-bash-snippet> && break; done
This will restart continuously <your-bash-snippet> if it fails: && break will stop the loop if <your-bash-snippet> stop gracefully (return code 0).
To restart <your-bash-snippet> in all cases:
while true; do <your-bash-snippet>; done
e.g. #1
while true; do openconnect x.x.x.x:xxxx && break; done
e.g. #2
while true; do docker logs -f container-name; sleep 2; done
The easiest way to do it is using flock on file. In Python script you'd do
lf = open('/tmp/script.lock','w')
if(fcntl.flock(lf, fcntl.LOCK_EX|fcntl.LOCK_NB) != 0):
sys.exit('other instance already running')
lf.write('%d\n'%os.getpid())
lf.flush()
In shell you can actually test if it's running:
if [ `flock -xn /tmp/script.lock -c 'echo 1'` ]; then
echo 'it's not running'
restart.
else
echo -n 'it's already running with PID '
cat /tmp/script.lock
fi
But of course you don't have to test, because if it's already running and you restart it, it'll exit with 'other instance already running'
When process dies, all it's file descriptors are closed and all locks are automatically removed.
You should use monit, a standard unix tool that can monitor different things on the system and react accordingly.
From the docs: http://mmonit.com/monit/documentation/monit.html#pid_testing
check process checkqueue.py with pidfile /var/run/checkqueue.pid
if changed pid then exec "checkqueue_restart.sh"
You can also configure monit to email you when it does do a restart.
if ! test -f $PIDFILE || ! psgrep `cat $PIDFILE`; then
restart_process
# Write PIDFILE
echo $! >$PIDFILE
fi
watch "yourcommand"
It will restart the process if/when it stops (after a 2s delay).
watch -n 0.1 "yourcommand"
To restart it after 0.1s instead of the default 2 seconds
watch -e "yourcommand"
To stop restarts if the program exits with an error.
Advantages:
built-in command
one line
easy to use and remember.
Drawbacks:
Only display the result of the command on the screen once it's finished
I'm not sure how portable it is across operating systems, but you might check if your system contains the 'run-one' command, i.e. "man run-one".
Specifically, this set of commands includes 'run-one-constantly', which seems to be exactly what is needed.
From man page:
run-one-constantly COMMAND [ARGS]
Note: obviously this could be called from within your script, but also it removes the need for having a script at all.
I've used the following script with great success on numerous servers:
pid=`jps -v | grep $INSTALLATION | awk '{print $1}'`
echo $INSTALLATION found at PID $pid
while [ -e /proc/$pid ]; do sleep 0.1; done
notes:
It's looking for a java process, so I
can use jps, this is much more
consistent across distributions than
ps
$INSTALLATION contains enough of the process path that's it's totally unambiguous
Use sleep while waiting for the process to die, avoid hogging resources :)
This script is actually used to shut down a running instance of tomcat, which I want to shut down (and wait for) at the command line, so launching it as a child process simply isn't an option for me.
I use this for my npm Process
#!/bin/bash
for (( ; ; ))
do
date +"%T"
echo Start Process
cd /toFolder
sudo process
date +"%T"
echo Crash
sleep 1
done

Resources