Understanding the -t option in qsub - pbs

The documentation is a bit unclear on exactly what the -t option is doing on a job submission using qsub
http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/commands/qsub.htm
From the documentation:
-t Specifies the task ids of a job array. Single task arrays are allowed.
The array_request argument is an integer id or a range of integers.
Multiple ids or id ranges can be combined in a comma delimited list.
Examples: -t 1-100 or -t 1,10,50-100
Here's an example where things go wrong, I've requested 2 nodes, 8 processes per node, and an array of 16 jobs. Which I had hoped would be distributed naturally across the 2 nodes, but the 16 tasks were distributed ad-hoc across more than 2 nodes.
$ echo 'hostname' | qsub -q gpu -l nodes=2:ppn=8 -t 1-16
52727[]
$ cat STDIN.o52727-* | sort
gpu-3.local
gpu-3.local
gpu-3.local
gpu-3.local
gpu-5.local
gpu-5.local
gpu-5.local
gpu-5.local
gpu-5.local
gpu-5.local
gpu-7.local
gpu-7.local
gpu-7.local
gpu-7.local
gpu-7.local
gpu-7.local

I suspect this will not completely answer your question, but what exactly you hope to accomplish remains unclear.
Specifying an array with qsub -t simply creates individual jobs, all with the same primary ID. Submitting the way you indicated will create 16 jobs, each requesting 16 total cores. This syntax merely makes it easier to submit a large number of jobs at once, without having to script a submission loop.
With Torque alone (i.e., disregarding the scheduler), you can force jobs to specific nodes by saying something like this:
qsub -l nodes=gpu-node01:ppn=8+gpu-node02:ppn=8
A more advanced scheduler can give you greater flexibility (e.g., Moab or Maui allow "-l nodes=2:ppn=8,nallocpolicy=exactnode", which applies NODEALLOCATIONPOLICY EXACTNODE to the job when scheduling, and will give you 8 cores each on exactly two nodes (ANY two nodes, in this case)).

Related

How to determine job array size for lots of jobs?

What is the best way to process lots of files in parallel via Slurm?
I have a lot of files (let's say 10000) in a folder (Each files get 10 secs or so to process). I want to determine sbatch job array size as 1000, naturally. (#SBATCH --array=1-10000%100) But it seems I can't determine more than some numbers(probably 1k). How you handle job array numbers? It seems to me because of my process don't take too much time, I think i should determine one job NOT for one file but for multiple files, right?
Thank you
If the process time is 10 second you should consider packing the tasks in a single job, both because such short jobs take longer to schedule than to run and because there is a limit on the number of jobs in an array.
Your submission script could look like this:
#!/bin/bash
#SBATCH --ntasks=16 # or any other number depending on the size of the cluster and the maximum allowed wall time
#SBATCH --mem-per-cpu=...
#SBATCH --time=... # based on the number of files and number of tasks
find . -name file_pattern -print0 | xargs -I{} -0 -P $SLURM_NTASKS srun -n1 -c1 --exclusive name_of_the_program {}
Make sure to replace all the ... and file_pattern and name_of_the_program with appropriate values.
The script will look for all files matching file_pattern in the submission directory and run the name_of_the_program program on it, limiting the number of concurrent instantes to the number of CPUs (more precisely number of tasks) requested. Note the use of --exclusive here which is specific for this use case and is deprecated with --exact in recent Slurm versions.

Dealing with job submission limits

I am running slurm job arrays with --array, and I would like to run about 2000 tasks/array items. However this is beyond the cluster's job submission limit of ~500 at a time.
Are there any tips/best practices for splitting this up? I'd like to submit it all at once and still be able to pass the array id arguments 1-2000 to my programs if possible. I think something like waiting to submit pieces of the array might be helpful but I'm not sure how to do this at the moment.
If the limit is on the size of an array:
You will have to split the array into several job arrays. The --array parameter accepts values of the form <START>-<END> so you can submit four jobs:
sbatch --array=1-500 ...
sbatch --array=501-1000 ...
sbatch --array=1001-1500 ...
sbatch --array=1501-200 ...
This way you will bypass the 500-limit and still keep the SLURM_ARRAY_TASK_ID ranging from 1 to 2000.
To ease things a bit, you can write this all in one line like this:
paste -d- <(seq 1 500 2000) <(seq 500 500 2000) | xargs -I {} sbatch --array={} ...
If the limit is on the number of submitted jobs:
Then one option is to have the last job of the array submit the following chunk.
#!/bin/bash
#SBATCH ...
...
...
if [[ $((SLURM_ARRAY_TASK_ID % 500)) == 0 ]] ; then
sbatch --array=$((SLURM_ARRAY_TASK_ID+1))-$((SLURM_ARRAY_TASK_ID+500)) $0
fi
Note that ideally, the last running job of the array should submit the job, and it may or may not be the one with the highest TASK ID, but this has worked for all practical purposes in many situations.
Another options is to setup a cron job to monitor the queue and submit each chunk when possible, or to use a workflow manager that will that for you.
you can run a script to submit your jobs and try to make the program sleep a few seconds after every 500 submissions. see https://www.osc.edu/resources/getting_started/howto/howto_submit_multiple_jobs_using_parameters

Cancel jobs submitted previous to a date or with JOBID lower than a given integer

I have realized that the jobs submitted with a previous version of my software are useless because of a bug, so I want to cancel them. However, I also have newer jobs that I would like to keep running. All the jobs have the same job name and are running in the same partition.
I have written the following script to cancel the jobs with an ID lower than a given one.
#!\bin\bash
if [ $1 ]
then
MAX_JOBID=$1
else
echo "An integer value is needed"
exit
fi
JOBIDLIST=$(squeue -u $USER -o "%F")
for JOBID in $JOBIDLIST
do
if [ "$JOBID" -lt "$MAX_JOBID" ]
then
echo "Cancelling job "$JOBID
scancel $JOBID
fi
done
I would say that this is a recurrent situation for someone developing a software and I wonder if there is a direct way to do it using slurm commands. Alternatively, do you use some tricks like appending the software commit ID to the job name to overcome this kind of situations?
Unfortunately there is no direct way to cancel the job in such scenarios.
Alternatively, like you pointed out, naming the job by adding software version/commit along with job name is useful. In that case you can use, scancel --name=JOB_NAME_VERSION to cancel all the jobs with that job name.
Also, if newly submitted jobs can be hold using scontrol hold <jobid> and then all the PENDING job can be cancelled using scancel --state=PENDING
In my case, I used a similar approach (like yours) by having squeue piped the output to awk and cancelled the first N number of jobs I wanted to remove. Its a one-liner script.
Something like this:
eg: squeue arguments | awk 'NR>=2 && NR<=N{print $1}' | xargs /usr/bin/scancel
In addition to the suggestions by #j23, you can organise your jobs with
job arrays ; if all your jobs are similar in terms of submission script, make them a job array, and submit one job array per version of your software. Then you can cancel an entire job array with just one scancel command
a workflow management system ; they enable submitting and managing sets of jobs (possibly on different clusters) easily
Fireworks https://materialsproject.github.io/fireworks/
Bosco https://osg-bosco.github.io/docs/
Slurm pipelines https://github.com/acorg/slurm-pipeline
Luigi https://github.com/spotify/luigi

Cores assigned to SLURM job

Let's say I want to submit a slurm job just assigning the total amount of tasks (--ntasks=someNumber), without specifying the number of nodes and the tasks per node. Is there a way to know within the launched slurm script how many cores are assigned by slurm for each of the reserved nodes? I need to know this info to properly create a machinefile for the program I'm launching, that must be structured like this:
node02:7
node06:14
node09:3
Once the job is launched, the only way I figured out to see what cores have been allocated on the nodes is using the command:
scontrol show jobid -dd
In its output the abovementioned info is stored (together with plenty of other details).
Is there a better way to get this info?
The way the srun documentation illustrates creating a machine file is by running srun hostname. To get the output you want you could run
srun hostname -s | sort | uniq -c | awk '{print $2":"$1}' > $MACHINEFILE
You should check the documentation of your program to see if it accepts a machine file with repetitions rather than a suffix count. If so you can simplify the command as
srun hostname -s > $MACHINEFILE
And of course the first step is actually to make sure you indeed need a machine file in the first place as many parallel programs/libraries have Slurm support and can gather the needed information from the environment variables setup by Slurm upon job start.

qsub job array, where each job uses a single process?

I have a job script with the following PBS specifications at the beginning:
#PBS -t 0-99
#PBS -l nodes=1:ppn=1
The first line says that this is a job array, with jobs indexed from 0 to 99.
I want each individual indexed job to use only a single node and a single core per node, hence my second PBS line. But I am worried that TORQUE qsub will interpret the second line as saying that the whole job array should run sequentially on a single core.
How does TORQUE qsub interpret the PBS second line?
It interprets it as 100 jobs that should each use 1 execution slot on one node. For more information, please look at the qsub documentation and look for the details on the -t switch.

Resources