How can I make a ksh script terminate itself if any issues? - linux

I have written a few ksh scripts, about 6 scripts.
These are written to handle huge data files, something like 207 MB big. while running the script, sometimes it gets stuck and does not end.
Human interruption is required.
In production environment, I want it to run automatically, and should be able to end automatically if any issues without the need of any human interruption.
If there are some issues with a file, the script should end and start executing the next file.
How can make it terminate itself, if it gets stuck?

I assume, that the only way you see the issues is that the script takes too long. In that case a simple script that kills the process after a time-out should be sufficient:
#!/bin/bash
# Killersrcipt
PID=$1
TIME=$2
typeset -i i
i=0
while [ $i -lt $TIME ] ; do
if ps $PID > /dev/null ; then
i=$i+1
sleep 1
else
exit 0
fi
done
kill $PID
Your workflow would then be something like:
#!/bin/bash
process_1 &
killerscript $! 60
process_2 &
killerscript $! 30
...
If you have other ways to detect issues in your processes, you can easily add them to the loop in your killerscript.

Related

Parallel run and wait for pocesses from subshell

Hi all/ I'm trying to make something like parallel tool for shell simply because the functionality of parallel is not enough for my task. The reason is that I need to run different versions of compiler.
Imagine that I need to compile 12 programs with different compilers, but I can run only 4 of them simultaneously (otherwise PC runs out of memory and crashes :). I also want to be able to observe what's going on with each compile, therefore I execute every compile in new window.
Just to make it easier here I'll replace compiler that I run with small script that waits and returns it's process id sleep.sh:
#!/bin/bash
sleep 30
echo $$
So the main script should look like parallel_run.sh :
#!/bin/bash
for i in {0..11}; do
xfce4-terminal -H -e "./sleep.sh" &
pids[$i]=$!
pstree -p $pids
if (( $i % 4 == 0 ))
then
for pid in ${pids[*]}; do
wait $pid
done
fi
done
The problem is that with $! I get pid of xfce4-terminal and not the program it executes. So if I look at ptree of 1st iteration I can see output from main script:
xfce4-terminal(31666)----{xfce4-terminal}(31668)
|--{xfce4-terminal}(31669)
and sleep.sh says that it had pid = 30876 at that time. Thus wait doesn't work at all in this case.
Q: How to get right PID of compiler that runs in subshell?
Maybe there is the other way to solve task like this?
It seems like there is no way to trace PID from parent to child if you invoke process in new xfce4-terminal as terminal process dies right after it executed given command. So I came to the solution which is not perfect, but acceptable in my situation. I run and put compiler's processes in background and redirect output to .log file. Then I run tail on these logfiles and I kill all tails which belongs to current $USER when compilers from current batch are done, then I run the other batch.
#!/bin/bash
for i in {1..8}; do
./sleep.sh > ./process_$i.log &
prcid=$!
xfce4-terminal -e "tail -f ./process_$i.log" &
pids[$i]=$prcid
if (( $i % 4 == 0 ))
then
for pid in ${pids[*]}; do
wait $pid
done
killall -u $USER tail
fi
done
Hopefully there will be no other tails running at that time :)

Ending an mpirun process terminates a bash loop

I'm trying to schedule a series of mpi jobs on an Ubuntu 14.04 LTS machine using a bash script. Basically, I want a simulation to run on every core for a certain amount of time, then terminate and move on to the next case once that time has elapsed.
My issue arises when mpi exits at the end of the first job - it breaks the loop and returns the terminal to my control instead of heading onto the next iteration of the loop.
My script is included below. The file "case_names" is just a text file of directory names. I've tested the script with other commands and it works fine until I uncomment the mpirun call.
#!/bin/bash
while read line;
do
# Access case dierctory
cd $line
echo "Case $line accessed"
# Start simulation
echo "Case $line starting: $(date)"
mpirun -q -np 8 dsmcFoamPlus -parallel > log.dsmcFoamPlus &
# Wait for 10 hour runtime
sleep 36000
# Kill job
pkill mpirun > /dev/null
echo "Case $line terminated: $(date)"
# Return to parent directory
cd ..
done < case_names
Does anyone know of a way to stop mpirun from breaking the loop like this?
So far I've tried GNOME task scheduler and task-spooler, but neither have worked (likely due to aliases that have to be invoked before the commands I use become available). I'd really rather not have to resort to setting up slurm. I've also tried using the disown command to separate the mpi process from the shell I'm running the scheduling script in, and have even written a separate script just to kill processes which the scheduling script runs remotely.
Many thanks in advance!
I've managed to find a workaround that allows me to schedule tasks with a bash script like I wanted. Since this solves my issue, I'm posting it as an answer (although I would still welcome an explanation as to why mpi behaves in this way in loops).
The solution lay in writing a separate script for both calling and then killing mpi, which would itself be called by the scheduling script. Since this child bash process has no loops in it, there are no issues with mpi breaking them after being killed. Also, once this script has exited, the scheduling loop can continue unimpeded.
My (now working) code is included below.
Scheduling script:
while read line;
do
cd $line
echo "CWD: $(pwd)"
echo "Case $line accessed"
bash ../run_job
echo "Case $line terminated: $(date)"
cd ..
done < case_names
Execution script (run_job):
mpirun -q -np 8 dsmcFoamPlus -parallel > log.dsmcFoamPlus &
echo "Case $line starting: $(date)"
sleep 600
pkill mpirun
I hope someone will find this useful.

How to check if file size is not incrementing ,if not then kill the $$ of script

I am trying to figure out a way to monitor the files I am dumping from my script. If there is no increment seen in child files then kill my script.
I am doing this to free up the resources when not needed. Here is what I think of , but I think my apporch is going to add burden to CPU. Can anyone please suggest more efficent way of doing this?
Below script is suppose to poll in every 15 sec and collect two file size of same file, if the size of the two samples are same then exit.
checkUsage() {
while true; do
sleep 15
fileSize=$(stat -c%s $1)
sleep 10;
fileSizeNew=$(stat -c%s $1)
if [ "$fileSize" == "$fileSizeNew" ]; then
echo -e "[Error]: No activity noted on this window from 5 sec. Exiting..."
kill -9 $$
fi
done
}
And I am planning to call it as follow (in background):
checkUsage /var/log/messages &
I can also get solution if, someone suggest how to monitor tail command and if nothing printing on tail then exit. NOT SURE WHY PEOPLE ARE CONFUSED. End goal of this question is to ,check if the some file is edited in last 15 seconds. If not exit or throw some error.
I have achived this by above script,but I don't know if this is the smartest way of achiveing this. I have asked this question to know views from other if there is any alternative way or better way of doing it.
I would based the check on file modification time instead of size, so something like this (untested code):
checkUsage() {
while true; do
# Test if file mtime is 'second arg' seconds older than date, default to 10 seconds
if [ $(( $(date +"%s") - $(stat -c%Y /var/log/message) )) -gt ${2-10} ]; then
echo -e "[Error]: No activity noted on this window from ${2-10} sec. Exiting..."
return 1
fi
#Sleep 'first arg' second, 15 seconds by default
sleep ${1-15}
done
}
The idea is to compare the file mtime with current time, if it's greater than second argument seconds, print the message and return.
And then I would call it like this later (or with no args to use defaults):
[ checkusage 20 10 ] || exit 1
Which would exit the script with code 1 as when the function return from it's infinite loop (as long as the file is modified)
Edit: reading me again, the target file could be a parameter too, to allow a better reuse of the function, left as an exercise to the reader.
If on Linux, in a local file system (Ext4, BTRFS, ...) -not a network file system- then you could consider inotify(7) facilities: something could be triggered when some file or directory changes or is accessed.
In particular, you might have some incron job thru incrontab(5) file; maybe it could communicate with some other job ...
PS. I am not sure to understand what you really want to do...
I suppose an external programme is modifying /var/log/messages.
If this is the case, below is my script (with minor changes to yours)
#Bash script to monitor changes to file
#!/bin/bash
checkUsage() # Note that U is in caps
{
while true
do
sleep 15
fileSize=$(stat -c%s $1)
sleep 10;
fileSizeNew=$(stat -c%s $1)
if [ "$fileSize" == "$fileSizeNew" ]
then
echo -e "[Notice : ] no changes noted in $1 : gracefully exiting"
exit # previously this was kill -9 $$
# changing this to exit would end the program gracefully.
# use kill -9 to kill a process which is not under your control.
# kill -9 sends the SIGKILL signal.
fi
done
}
checkUsage $1 # I have added this to your script
#End of the script
Save the script as checkusage and run it like :
./checkusage /var/log/messages &
Edit :
Since you're looking for better solutions I would suggest inotifywait, thanks for the suggestion from the other answerer.
Below would be my code :
while inotifywait -t 10 -q -e modify $1 >/dev/null
do
sleep 15 # as you said the polling would happen in 15 seconds.
done
echo "Script exited gracefully : $1 has not been changed"
Below are the details from the inotifywait manpage
-t <seconds>, --timeout <seconds> Exit if an appropriate event has not occurred within <seconds> seconds. If <seconds> is zero (the default),
wait indefinitely for an event.
-e <event>, --event <event> Listen for specific event(s) only. The events which can be listened for are listed in the EVENTS section.
This option can be specified more than once. If omitted, all events
are listened for.
-q, --quiet If specified once, the program will be less verbose. Specifically, it will not state when it has completed establishing all
inotify watches.
modify(Event) A watched file or a file within a watched directory was
written to.
Notes
You might have to install the inotify-tools first to make use of the inotifywait command. Check the inotify-tools page at Github.

Linux infinite loop with background

I have a little script called "CheekyScript.sh" that looks something like this:
#!/bin/bash
nohup mvn run_something_pretty_long
This clearly work pretty fine as it starts a long process in the background that continues running after the session has expired and the user has logged out.
What I wish to achieve is pretty simple, introduce a little infinite loop, to this process is being ran over and over again but only AFTER the nohup is completed. Of course I still wish this entire bash script and the nohup within to run long after the session expired and I'm logged out.
I was thinking something similar:
#!/bin/bash
while true
do
nohup mvn run_something_pretty_long
sleep 60
done
Obviously is what this does is that it starts the nohup process every 60 seconds. The desired thing would be wait for the nohup, wait a minute and start the loop again.
I was wondering what is the best practice solution for something like this?
Thank you very much in advance.
use crontab
add an entry like this
1 * * * * /path/to/something
In the something script
#!/bin/bash
LOCKFILE=/var/lock/mvn.lock
[ -f $LOCKFILE ] && exit 0
# Upon exit, remove lockfile.
trap "{ rm -f $LOCKFILE ; exit 255; }" EXIT
touch $LOCKFILE
mvn run_something_pretty_long
exit 0
This tries to run the script once a minute and mostly fails as the lockfile exists. But if the script is finished the lockfile isn't there and it starts again
By default cron emails all output to the user that owns the job
You want to run your long running script either once, or repeatedly. And you want to run both of these using nohup. Since you already have one script that handles the first (run once) case, make two copies of your "CheekyScript.sh". The first one runs once, and the second you edit to run repeatedly (and can optionally check for a done condition).
This one runs once,
#!/bin/bash
#CheekyScriptOnce.sh
nohup mvn run_something_pretty_long
This one runs repeatedly,
#!/bin/bash
#CheekyRepeat.sh
thing="mvn run_something_pretty_long"
delay=60;
nohup (while [ 1 ] ; do $thing; sleep $delay; done)
But you want some way to signal done. A control file can handle that,
#!/bin/bash
#CheekyRepeatConditional.sh
thing="mvn run_something_pretty_long"
delay=60;
if [ ! -d etc ] ; then mkdir etc; fi
touch etc/Cheeky.run
nohup (while [ -f etc/Cheeky.run ] ; do $thing; sleep $delay; done)

Check if process runs if not execute script.sh

I am trying to find a way to monitor a process. If the process is not running it should be checked again to make sure it has really crashed. If it has really crashed run a script (start.sh)
I have tried monit with no succes, I have also tried adding this script in crontab: I made it executable with chmod +x monitor.sh
the actual program is called program1
case "$(pidof program | wc -w)" in
0) echo "Restarting program1: $(date)" >> /var/log/program1_log.txt
/home/user/files/start.sh &
;;
1) # all ok
;;
*) echo "Removed double program1: $(date)" >> /var/log/program1_log.txt
kill $(pidof program1 | awk '{print $1}')
;;
esac
The problem is this script does not work, I added it to crontab and set it to run every 2 minutes. If I close the program it won't restart.
Is there any other way to check a process, and run start.sh when it has crashed?
Not to be rude, but have you considered a more obvious solution?
When a shell (e.g. bash or tcsh) starts a subprocess, by default it waits for that subprocess to complete.
So why not have a shell that runs your process in a while(1) loop? Whenever the process terminates, for any reason, legitimate or not, it will automatically restart your process.
I ran into this same problem with mythtv. The backend keeps crashing on me. It's a Heisenbug. Happens like once a month (on average). Very hard to track down. So I just wrote a little script that I run in an xterm.
The, ahh, oninter business means that control-c will terminate the subprocess and not my (parent-process) script. Similarly, the sleep is in there so I can control-c several times to kill the subprocess and then kill the parent-process script while it's sleeping...
Coredumpsize is limited just because I don't want to fill up my disk with corefiles that I cannot use.
#!/bin/tcsh -f
limit coredumpsize 0
while( 1 )
echo "`date`: Running mythtv-backend"
# Now we cannot control-c this (tcsh) process...
onintr -
# This will let /bin/ls directory-sort my logfiles based on day & time.
# It also keeps the logfile names pretty unique.
mythbackend |& tee /....../mythbackend.log.`date "+%Y.%m.%d.%H.%M.%S"`
# Now we can control-c this (tcsh) process.
onintr
echo "`date`: mythtv-backend exited. Sleeping for 30 seconds, then restarting..."
sleep 30
end
p.s. That sleep will also save you in the event your subprocess dies immediately. Otherwise the constant respawning without delay will drive your IO and CPU through the roof, making it difficult to correct the problem.

Resources