I have a central server where I periodically start a script (from cron) which checks remote servers. The check is performed serially, so first, one server then another ... .
This script (from the central server) starts another script(lets call it update.sh) on the remote machine, and that script(on the remote machine) is doing something like this:
processID=`pgrep "processName"`
kill $processID
startProcess.sh
The process is killed and then in the script startProcess.sh started like this:
pidof "processName"
if [ ! $? -eq 0 ]; then
nohup "processName" "processArgs" >> "processLog" &
pidof "processName"
if [! $? -eq 0]; then
echo "Error: failed to start process"
...
The update.sh, startprocess.sh and the actual binary of the process that it starts is on a NFS mounted from the central server.
Now what happens sometimes, is that the process that I try to start within the startprocess.sh is not started and I get the error. The strange part is that it is random, sometime the process on one machine starts and another time on that same machine doesn't start. I'm checking about 300 servers and the errors are always random.
There is another thing, the remote servers are at 3 different geo locations (2 in America and 1 in Europe), the central server is in Europe. From what I discover so far is that the servers in America have much more errors than those in Europe.
First I thought that the error has to have something to do with kill so I added a sleep between the kill and the startprocess.sh but that didn't make any difference.
Also it seems that the process from startprocess.sh is not started at all, or something happens to it right when it is being started, because there is no output in the logfile and there should be an output in the logfile.
So, here I'm asking for help
Does anybody had this kind of problem, or know what might be wrong?
Thanks for any help
(Sorry, but my original answer was fairly wrong... Here is the correction)
Using $? to get the exit status of the background process in startProcess.sh leads to wrong result. Man bash states:
Special Parameters
? Expands to the status of the most recently executed foreground
pipeline.
As You mentioned in your comment the proper way of getting the background process's exit status is using the wait built in. But for this bash has to process the SIGCHLD signal.
I made a small test environment for this to show how it can work:
Here is a script loop.sh to run as a background process:
#!/bin/bash
[ "$1" == -x ] && exit 1;
cnt=${1:-500}
while ((++c<=cnt)); do echo "SLEEPING [$$]: $c/$cnt"; sleep 5; done
If the arg is -x then it exits with exit status 1 to simulate an error. If arg is num, then waits num*5 seconds printing SLEEPING [<PID>] <counter>/<max_counter> to stdout.
The second is the launcher script. It starts 3 loop.sh scripts in the background and prints their exit status:
#!/bin/bash
handle_chld() {
local tmp=()
for i in ${!pids[#]}; do
if [ ! -d /proc/${pids[i]} ]; then
wait ${pids[i]}
echo "Stopped ${pids[i]}; exit code: $?"
unset pids[i]
fi
done
}
set -o monitor
trap "handle_chld" CHLD
# Start background processes
./loop.sh 3 &
pids+=($!)
./loop.sh 2 &
pids+=($!)
./loop.sh -x &
pids+=($!)
# Wait until all background processes are stopped
while [ ${#pids[#]} -gt 0 ]; do echo "WAITING FOR: ${pids[#]}"; sleep 2; done
echo STOPPED
The handle_chld function will handle the SIGCHLD signals. Setting option monitor enables for a non-interactive script to receive SIGCHLD. Then the trap is set for SIGCHLD signal.
Then background processes are started. All of their PIDs are remembered in pids array. If SIGCHLD is received then it is checked amongst the /proc/ directories which child process was stopped (the missing one) (it could be also checked using kill -0 <PID> bash built-in). After wait the exit status of the background process is stored in the famous $? pseudo variable.
The main script waits for all pids to stop (otherwise it could not get the exit status of its children) and the it stops itself.
An example output:
WAITING FOR: 13102 13103 13104
SLEEPING [13103]: 1/2
SLEEPING [13102]: 1/3
Stopped 13104; exit code: 1
WAITING FOR: 13102 13103
WAITING FOR: 13102 13103
SLEEPING [13103]: 2/2
SLEEPING [13102]: 2/3
WAITING FOR: 13102 13103
WAITING FOR: 13102 13103
SLEEPING [13102]: 3/3
Stopped 13103; exit code: 0
WAITING FOR: 13102
WAITING FOR: 13102
WAITING FOR: 13102
Stopped 13102; exit code: 0
STOPPED
It can be seen that the exit codes are reported correctly.
I hope this can help a bit!
Related
This question already has answers here:
Timeout a command in bash without unnecessary delay
(24 answers)
Closed 2 years ago.
On stackoverflow there are many solutions - how to close script by timeout or close script if there is an error.
But how to have both approaches together?
If during execution of the script there is an error - close script.
If timeout is out - close script.
I have following code:
#!/usr/bin/env bash
set -e
finish_time=$1
echo "finish_time=" ${finish_time}
(./execute_something.sh) & pid=$!
sleep ${finish_time}
kill $pid
But if there is an error while execution - script still waits, when timeout would be out.
First, I won't use set -e.
You'll explicitly wait on the job you want; the exit status of wait will be the exit status of the job itself.
echo "finish_time = $1"
./execute_something.sh & pid=$!
sleep "$1" & sleep_pid=$!
wait -n # Waits for either the sleep or the script to finish
rv=$?
if kill -0 $pid; then
# Script still running, kill it
# and exit
kill -s ALRM $pid
wait $pid # exit status will indicte it was killed by SIGALRM
exit
else
# Script exited before sleep
kill $sleep_pid
exit $rv
fi
There is a slight race condition here; it goes as follows:
wait -n returns after sleep exits, indicating the script will exit on its own
The script exits before we can check if it is still running
As a result, we assume it actually exited before sleep.
But that just means we'll create a script that ran slightly over the threshold as finishing on time. That's probably not a distinction you care about.
Ideally, wait would set some shell parameter that indicates which process caused it to return.
So basically I have one script that is keeping a server alive. It starts the server process and then starts it again after the process stops. Although sometimes the server becomes non responsive. For that I want to have another script which would ping the server and would kill the process if it wouldn't respond in 60 seconds.
The problem is that if I kill the server process the bash script also gets terminated.
The start script is just while do: sh Server.sh. It calls other shell script that has additional parameters for starting the server. The server is using java so it starts a java process. If the server hangs I use kill -9 pid because nothing else stops it. If the server doesn't hang and does the usual restart it gracefully stops and the bash script start second loop.
Doing The Right Thing
Use a real process supervision system -- your Linux distribution almost certainly includes one.
Directly monitoring the supervised process by PID
An awful, ugly, moderately buggy approach (for instance, able to kill the wrong process in the event of a PID collision) is the following:
while :; do
./Server.sh & server_pid=$!
echo "$server_pid" > server.pid
wait "$server_pid"
done
...and, to kill the process:
#!/bin/bash
# ^^^^ - DO NOT run this with "sh scriptname"; it must be "bash scriptname".
server_pid="$(<server.pid)"; [[ $server_pid ]] || exit
# allow 5 seconds for clean shutdown -- adjust to taste
for (( i=0; i<5; i++ )); do
if kill -0 "$server_pid"; then
sleep 1
else
exit 0 # server exited gracefully, nothing else to do
fi
done
# escalate to a SIGKILL
kill -9 "$server_pid"
Note that we're storing the PID of the server in our pidfile, and killing that directly -- thus, avoiding inadvertently targeting the supervision script.
Monitoring the supervised process and all children via lockfile
Note that this is using some Linux-specific tools -- but you do have linux on your question.
A more robust approach -- which will work across reboots even in the case of pidfile reuse -- is to use a lockfile:
while :; do
flock -x Server.lock sh Server.sh
done
...and, on the other end:
#!/bin/bash
# kill all programs having a handle on Server.lock
fuser -k Server.lock
for ((i=0; i<5; i++)); do
if fuser -s Server.lock; then
sleep 1
else
exit 0
fi
done
fuser -k -KILL Server.lock
When I set up a Jenkins job and found a problem about timeout for shell script.
It works like this:
Start Jenkins → control.sh is launched → test1.sh is launched in control.sh
Part code of control.sh is like:
#!/bin/sh
source func.sh
export TIMEOUT=30
# set timeout as 30s for test1.sh
( ( sleep $TIMEOUT && function_Timeout ) & ./test1.sh )
# this line of code is in a = loop actually
# it will launch test2.sh, test3.sh... one by one
# later, I want to set 30s time out for each of them.
function_Timeout() {
if [ ! -f test1_result_file]: then
killall test1.sh
# the test1_result_file will not
# be created if test1.sh is not finished executing.
fi
}
part of func.sh is as below
#!/bin/sh
function trap_fun() {
TRAP_CODE=$?
{ if [ $TRAP_CODE -ne 0 ]; then
echo "test aborted"
else
echo "test completed"
} 2>/dev/null
trap "trap_fun" EXIT
After control.sh is launched by Jenkins job, the whole control.sh will be terminated when time is over, and the line of killall test1.sh is reached, and the Jenkins job stop and fail.
I guess it's because test1.sh is killed and exit code is not 0, so it cause this problem.
So my question is, is there someway to terminate or end the sub-script (launched by the main one, like control.sh in my case) exit with code 0?
Updated on July 1:
Thanks for the answers so far, I tried #Leon's suggestion, but I found the code 124 sent by timeout's kill action, is still caught by the trap code - trap "trap_fun" EXIT, which is in func.sh.
I added more details. I did a lot google job but still not found a proper way to resolve this problem:(
Thanks for your kind help!
Use the timeout utility from coreutils:
#!/bin/sh
timeout 30 ./test1.sh
status=$?
if [ $status -eq 124 ] #timed out
then
exit 0
fi
exit $status
Note that this is slightly different from your version of timeout handling, where all running instances of test1.sh are being terminated if any one of them times out.
I resolved this problem finally, I added the code below in each testX.sh.
trap 'exit 0' SIGTERM SIGHUP
It is to make test1.sh exit normally after it receives killall signal.
Thanks to all the help!
I am trying to find a way to monitor a process. If the process is not running it should be checked again to make sure it has really crashed. If it has really crashed run a script (start.sh)
I have tried monit with no succes, I have also tried adding this script in crontab: I made it executable with chmod +x monitor.sh
the actual program is called program1
case "$(pidof program | wc -w)" in
0) echo "Restarting program1: $(date)" >> /var/log/program1_log.txt
/home/user/files/start.sh &
;;
1) # all ok
;;
*) echo "Removed double program1: $(date)" >> /var/log/program1_log.txt
kill $(pidof program1 | awk '{print $1}')
;;
esac
The problem is this script does not work, I added it to crontab and set it to run every 2 minutes. If I close the program it won't restart.
Is there any other way to check a process, and run start.sh when it has crashed?
Not to be rude, but have you considered a more obvious solution?
When a shell (e.g. bash or tcsh) starts a subprocess, by default it waits for that subprocess to complete.
So why not have a shell that runs your process in a while(1) loop? Whenever the process terminates, for any reason, legitimate or not, it will automatically restart your process.
I ran into this same problem with mythtv. The backend keeps crashing on me. It's a Heisenbug. Happens like once a month (on average). Very hard to track down. So I just wrote a little script that I run in an xterm.
The, ahh, oninter business means that control-c will terminate the subprocess and not my (parent-process) script. Similarly, the sleep is in there so I can control-c several times to kill the subprocess and then kill the parent-process script while it's sleeping...
Coredumpsize is limited just because I don't want to fill up my disk with corefiles that I cannot use.
#!/bin/tcsh -f
limit coredumpsize 0
while( 1 )
echo "`date`: Running mythtv-backend"
# Now we cannot control-c this (tcsh) process...
onintr -
# This will let /bin/ls directory-sort my logfiles based on day & time.
# It also keeps the logfile names pretty unique.
mythbackend |& tee /....../mythbackend.log.`date "+%Y.%m.%d.%H.%M.%S"`
# Now we can control-c this (tcsh) process.
onintr
echo "`date`: mythtv-backend exited. Sleeping for 30 seconds, then restarting..."
sleep 30
end
p.s. That sleep will also save you in the event your subprocess dies immediately. Otherwise the constant respawning without delay will drive your IO and CPU through the roof, making it difficult to correct the problem.
My bash script is running some program in background and with wait command waits for it to stop. But there is a high possibility that the background process will be killed because it takes too much memory. I want my script to react differently for a process that ended up gently and for a killed one. How do I check this condition?
Make sure your command signals success (with exit code 0) when it succeeds, and failure (non-zero) when it fails.
When a process is killed with SIGKILL by the OOM killer, signaling failure is automatic. (The shell will consider the exit code of signal terminated processes to be 128 + the signal number, so 128+9=137 for SIGKILL).
You then use the fact that wait somepid exits with the same code as the command it waits on in an if statement:
yourcommand &
pid=$!
....
if wait $pid
then
echo "It exited successfully"
else
echo "It exited with failure"
fi
usually they shutdown with a signal, try to have some signal hander function to handle unpredictable shutdowns, or worst case have another monitoring process, like a task manager.
did you try anything?
by the way some signals cant be handled, like segmentation faults, SIGSEGV
Simpler solution is
yourcommand
if [ $? -eq 0 ] ; then
echo "It exited successfully"
else
echo "It exited with failure, exitcode $?"
fi