qsub slow on cluster with a single computer - qsub

Torque is installed in a single computer, which is used for both head node
and computation node. I didnt installed Maui for job schedule, but use the built-in function of torque.
I find qsub is slow when submitting many jobs, for example:
for i in `ls *tt.sh`
do
echo $i
qsub $i
done
it take a while to submit jobs in the end of the scripts list.
This happens even if the computer is in low load. The submission
is slow if there are merely 70 scripts in the list.
Are there some options I could tweak with torque, or I have to install Maui
for job scheduling?
Thanks!

Related

single slurm array vs multiple sbatch calls

I can run N embarrassingly parallel jobs by using a slurm array like:
#SBATCH --array=1-N
Alternately I think I can achieve the same from a scheduling perspective (i.e. scheduled independently and as soon as resources become available) by manually launching 8 job. For example with a simply bash script with a loop.
Since the latter is far more flexible, I don't see the utility I using the --array option built into slurm.
Am I missing something?
Arrays offer a simple way to create parametrised jobs without writing the Bash loop. It
(obviously) creates the jobs and assign them a parameter ;
takes care of output file name parametrisation ;
makes the submission of a dependent job that should run after all those jobs are completer much easier
makes the output of squeue less cluttered
Furthermore, the jobs in an array can be managed as a whole, the squeue, scancel, etc. command can work on the whole array as opposed to writing another loop to cancel them for instance. This is even more interesting in the case you have multiple arrays running at the same time ; you do not need to manage the tracking of each individual job by yourself.
Finally, especially for large arrays, it makes the scheduler easier and can increase the job throughput.
If you need flexibility, then job arrays are not the solution, but maybe a workflow manager could help you.

Automatic qsub job completion status notification

I have a shell script that calls five other scripts from it. The first script creates 50 qsub jobs in the cluster. Individual job execution time varies from a couple of minutes to an hour. I need to know when all the 50 jobs get finished because after completing all the jobs I need to run the second script. How to find whether all the qsub jobs are completed or not? One possible solution can be using an infinite loop and check job status by using qstate command with job ID. In this case, I need to check the job status continuously. It is not an excellent solution. Is it possible that after execution, qsub job will notify me by itself. Hence, I don't need to monitor frequently job status.
qsub is capable of handling job dependencies, using -W depend=afterok:jobid.
e.g.
#!/bin/bash
# commands to run on the cluster
COMMANDS="script1.sh script2.sh script3.sh"
# intiliaze JOBID variable
JOBIDS=""
# queue all commands
for CMD in $COMMANDS; do
# queue command and store the job id
JOBIDS="$JOBIDS:`qsub $CMD`"
done
# queue post processing, depended on the submitted jobs
qsub -W depend=afterok:$JOBIDS postprocessing.sh
exit 0
More examples can be found here http://beige.ucs.indiana.edu/I590/node45.html
I never heard about how to do that, and I would be really interested if someone came with a good answer.
In the meanwhile, I suggest that you use file tricks. Either your script outputs a file at the end, or you check for the existence of the log files (assuming they are created only at the end).
while [ ! -e ~/logs/myscript.log-1 ]; do
sleep 30;
done

Running a multi-stage job using SLURM

I am new to SLURM. My problem is that I have a multi-stage job, which needs to be run on a cluster, whose jobs are managed by SLURM. Specifically I want to schedule a job which:
Grabs N nodes,
Installs a software on all of them
(once all nodes finish the installation successfully) it creates a
database instance on the nodes
Loads the database
(once loading is done successfully) Runs a set of queries, for benchmarking purpose
Drops the database and returns the nodes
Each step could be run using a separate bash script; while the execution of the scripts and transitions between stages are coordinated by a master node.
My problem is that I know how to allocate nodes and call a single command or script on each (which runs as a stand-alone job on each node) using SLURM. But as soon as the command is done (or the called script is finished) on each node, the node returns to pool of free resources, leaving the allocated nodes queue for my job. But the above use case involves several stages/scripts; and needs coordination between them.
I am wondering what the correct way is to design/run a set of scripts for such a use case, using SLURM. Any suggestion or example would be extremely helpful, and highly appreciated.
You simply need to encapsulate all your scripts into a single one for submission:
#!/bin/bash
#SBATCH --nodes=4 --exclusive
# Setting Bash to exit whenever a command exits with a non-zero status.
set -e
set -o pipefail
echo "Installing software on each of $SLURM_NODELIST"
srun ./install.sh
echo "Creating database instance"
./createDBInstance.sh $SLURM_NODELIST
echo "Loading DB"
./loadDB.sh params
echo Benchmarking
./benchmarks.sh params
echo Done.
You'll need to fill in the blanks... Make sure that your script follow the standard of exiting with a non-zero status on error.

Force qsub (PBS) to wait the job's end before exiting

I've been using Sun Grid Engine to run my jobs on a node of a cluster.
Usually I would wait for the job to complete before exiting and I use:
qsub -sync yes perl Script.pl
However now I don't use anymore Sun Grid Engine but PBS Pro 10.4
I'm not able to find a corresponding instruction to -sync.
Could someone help me?
Thanks in advance
PBSPro doesn't have a -sync equivalent but you might be able to use the
-I option combined with the use of expect to tell it what code to run in order to get the same effect.
The equivalent of -sync for PBS is -Wblock=true.
This prevents qsub from exiting until the job has completed. It is perhaps unusual to need this, but I found it useful when using some software that was not designed for HPC. The software executes multiple instances of a worker program, which run simultaneously. However, it then has to wait for one (or sometimes more) of the instances to complete, and do some work with the results, before spawning the next. If the worker program completes without writing a particular file, it is assumed to have failed. I was able to write a wrapper script for the worker program, to qsub it, and used the -Wblock=true option to make it wait for the worker program job to complete.

Does a PBS batch system move multiple serial jobs across nodes?

If I need to run many serial programs "in parallel" (because the problem is simple but time consuming - I need to read in many different data sets for the same program), the solution is simple if I only use one node. All I do is keep submitting serial jobs with an ampersand after each command, e.g. in the job script:
./program1 &
./program2 &
./program3 &
./program4
which will naturally run each serial program on a different processor. This works well on a login server or standalone workstation, and of course for a batch job asking for only one node.
But what if I need to run 110 different instances of the same program to read 110 different data sets? If I submit to multiple nodes (say 14) with a script which submits 110 ./program# commands, will the batch system run each job on a different processor across the different nodes, or will it try to run them all on the same, 8 core node?
I have tried to use a simple MPI code to read different data, but various errors result, with about 100 out of the 110 processes succeeding, and the others crashing. I have also considered job arrays, but I'm not sure if my system supports it.
I have tested the serial program extensively on individual data sets - there are no runtime errors, and I do not exceed the available memory on each node.
No, PBS won't automatically distribute the jobs among nodes for you. But this is a common thing to want to do, and you have a few options.
Easiest and in some ways most advantagous for you is to bunch the tasks into 1-node sized chunks, and submit those bundles as individual jobs. This will get your jobs started faster; a 1-node job will normally get scheduled faster than a (say) 14 node job, just because there's more one-node sized holes in the schedule than 14. This works particularly well if all the jobs take roughly the same amount of time, because then doing the division is pretty simple.
If you do want to do it all in one job (say, to simplify the bookkeeping), you may or may not have access to the pbsdsh command; there's a good discussion of it here. This lets you run a single script on all the processors in your job. You then write a script which queries $PBS_VNODENUM to find out which of the nnodes*ppn jobs it is, and runs the appropriate task.
If not pbsdsh, Gnu parallel is another tool which can enormously simplify these tasks. It's like xargs, if you're familiar with that, but will run commands in parallel, including on multiple nodes. So you'd submit your (say) 14-node job and have the first node run a gnu parallel script. The nice thing is that this will do scheduling for you even if the jobs are not all of the same length. The advice we give to users on our system for using gnu parallel for these sorts of things is here. Note that if gnu parallel isn't installed on your system, and for some reason your sysadmins won't do it, you can set it up in your home directory, it's not a complicated build.
You should consider job arrays.
Briefly, you insert #PBS -t 0-109 in your shell script (where the range 0-109 can be any integer range you want, but you stated you had 110 datasets) and torque will:
run 110 instances of your script, allocating each with the resources you specify (in the script with #PBS tags or as arguments when you submit).
assign a unique integer from 0 to 109 to the environment variable PBS_ARRAYID for each job.
Assuming you have access to environment variables within the code, you can just tell each job to run on data set number PBS_ARRAYID.

Resources