I have a shell script that calls five other scripts from it. The first script creates 50 qsub jobs in the cluster. Individual job execution time varies from a couple of minutes to an hour. I need to know when all the 50 jobs get finished because after completing all the jobs I need to run the second script. How to find whether all the qsub jobs are completed or not? One possible solution can be using an infinite loop and check job status by using qstate command with job ID. In this case, I need to check the job status continuously. It is not an excellent solution. Is it possible that after execution, qsub job will notify me by itself. Hence, I don't need to monitor frequently job status.
qsub is capable of handling job dependencies, using -W depend=afterok:jobid.
e.g.
#!/bin/bash
# commands to run on the cluster
COMMANDS="script1.sh script2.sh script3.sh"
# intiliaze JOBID variable
JOBIDS=""
# queue all commands
for CMD in $COMMANDS; do
# queue command and store the job id
JOBIDS="$JOBIDS:`qsub $CMD`"
done
# queue post processing, depended on the submitted jobs
qsub -W depend=afterok:$JOBIDS postprocessing.sh
exit 0
More examples can be found here http://beige.ucs.indiana.edu/I590/node45.html
I never heard about how to do that, and I would be really interested if someone came with a good answer.
In the meanwhile, I suggest that you use file tricks. Either your script outputs a file at the end, or you check for the existence of the log files (assuming they are created only at the end).
while [ ! -e ~/logs/myscript.log-1 ]; do
sleep 30;
done
Related
I used following command to submit my dependent job.
qsub current_job_file -W depend=afterany:previous_job_id
Then I find out my current job is under status 'H'. And it won't automatically run after the previous job finished. Is it how it suppose to be or I made a mistake somewhere? How can I let it run automatically after the previous job finish?
I also tried the following command. The result is the same.
qsub -W depend=afterany:previous_job_id current_job_file
That is how it is supposed to be. If your current job is dependent on another job and the dependency is "after" then it will be held until the other job finishes (or starts depending on what kind of dependency you have used. In your case it is "any" so it will wait for the other job to finish) and then move your current job to "Q" (queued) state for PBS scheduler to consider the job for running.
I can't seem to get my script to run in parallel every minute via cron on Ubuntu 14.
I have created a cron job which executes every minute. The cron job executes a script that runs much longer than a minute. When a minute expires it seems the new cron execution overwrites the previous execution. Is this correct? Any ideas welcomed.
I need concurrent independent running jobs. The cron job runs a script which queries a mysql database. The idea is to poll a db- if yes execute script in its own process.
cron will not stop a previous execution of a process to start a new one. cron will simply kick off the new process even though the old process is still running.
If you need cron to terminate the previous process, you'll need to modify your script to handle that itself.
You need a locking mechanism to identify that the script is already running.
There are several ways of doing this but you need to be careful to use an atomic method.
I use lock directories as creating a directory is guaranteed to be atomic -
LOCKDIR=/tmp/myproc.lock
if ! mkdir $LOCKDIR >/dev/null 2>&1
then
print -u2 "Processing already running - terminating"
exit 1
fi
trap "rm -rf $LOCKDIR" EXIT
This is a common occurrence. Try adding a check in your script to see if a lockfile already exists. If it does, exit. If not, continue.
Cronjobs are not overrun. They do however have the possibility of overlapping. Unless your script explicitly kills any pre-existing process, it shouldn't be able to stop the previously running script.
However, introducing the concept of lockfiles will save you from all these confusions altogether.
I am new to SLURM. My problem is that I have a multi-stage job, which needs to be run on a cluster, whose jobs are managed by SLURM. Specifically I want to schedule a job which:
Grabs N nodes,
Installs a software on all of them
(once all nodes finish the installation successfully) it creates a
database instance on the nodes
Loads the database
(once loading is done successfully) Runs a set of queries, for benchmarking purpose
Drops the database and returns the nodes
Each step could be run using a separate bash script; while the execution of the scripts and transitions between stages are coordinated by a master node.
My problem is that I know how to allocate nodes and call a single command or script on each (which runs as a stand-alone job on each node) using SLURM. But as soon as the command is done (or the called script is finished) on each node, the node returns to pool of free resources, leaving the allocated nodes queue for my job. But the above use case involves several stages/scripts; and needs coordination between them.
I am wondering what the correct way is to design/run a set of scripts for such a use case, using SLURM. Any suggestion or example would be extremely helpful, and highly appreciated.
You simply need to encapsulate all your scripts into a single one for submission:
#!/bin/bash
#SBATCH --nodes=4 --exclusive
# Setting Bash to exit whenever a command exits with a non-zero status.
set -e
set -o pipefail
echo "Installing software on each of $SLURM_NODELIST"
srun ./install.sh
echo "Creating database instance"
./createDBInstance.sh $SLURM_NODELIST
echo "Loading DB"
./loadDB.sh params
echo Benchmarking
./benchmarks.sh params
echo Done.
You'll need to fill in the blanks... Make sure that your script follow the standard of exiting with a non-zero status on error.
The command 'qstat -a' outputs lots of lines of information for completed jobs all with status 'C'. It seems that they will stay forever. How to cleanup these unneeded job information since those jobs are already 'completed'? Thanks!
This is controlled by the qmgr parameter keep_completed. keep_completed specifies a number of seconds after completion a job should continue to be visible. If you would like to immediately delete a job without waiting this amount of time, you can execute
qdel -p <jobid>
Type qstat -r to get only the running jobs
I've been using Sun Grid Engine to run my jobs on a node of a cluster.
Usually I would wait for the job to complete before exiting and I use:
qsub -sync yes perl Script.pl
However now I don't use anymore Sun Grid Engine but PBS Pro 10.4
I'm not able to find a corresponding instruction to -sync.
Could someone help me?
Thanks in advance
PBSPro doesn't have a -sync equivalent but you might be able to use the
-I option combined with the use of expect to tell it what code to run in order to get the same effect.
The equivalent of -sync for PBS is -Wblock=true.
This prevents qsub from exiting until the job has completed. It is perhaps unusual to need this, but I found it useful when using some software that was not designed for HPC. The software executes multiple instances of a worker program, which run simultaneously. However, it then has to wait for one (or sometimes more) of the instances to complete, and do some work with the results, before spawning the next. If the worker program completes without writing a particular file, it is assumed to have failed. I was able to write a wrapper script for the worker program, to qsub it, and used the -Wblock=true option to make it wait for the worker program job to complete.