Job array step single execution - slurm

I have a sbatch script to submit job arrays to Slurm with different steps:
#!/bin/bash
#SBATCH --ntasks 1
#SBATCH --nodes 1
#SBATCH --time 00-01:00:00
#SBATCH --array=0-15
dir="TEST_$SLURM_ARRAY_JOB_ID"
org=base-case
dst=$dir/case-$SLURM_ARRAY_TASK_ID
#step 0 -> I'd like that this step was executed only by one task!
srun mkdir $dir
#step 1
srun cp -r $org $dst
#step 2
srun python createParamsFile.py $dst $SLURM_ARRAY_TASK_ID
#step 3
srun python simulation.py $dst
I'd like to run step 0 just once, since the rest of the jobs in the array will share the directory created.
It is not a big deal, because once the directory is created the remaining attempts raise an error on creating the directory. But it is always better to avoid error messages in the logs and slurm steps abortions Per example in this case:
/usr/bin/mkdir: cannot create directory 'TEST_111224': File exists
srun: error: s02r3b83: task 0: Exited with exit code 1
srun: Terminating job step 111226.0
It is true that if I execute the mkdir command without the srun, step 0 does not exist and it is not terminated abruptly. But I still get the error.

Use the -p option of mkdir so that mkdir only creates the directory if not already present, and you will not have the errors in the log.
srun mkdir -p $dir
Note that removing srun in your case will not change anything as only one task per job is requested (--ntasks=1). The error is not because many tasks in a job create the same directory, but because many jobs in an array create the same directory.

Related

Slurm: srun inside sbatch is ignored / skipped Can anyone explain why?

I'm still exploring how to work with the Slurm scheduler and this time I really got stuck. The following batch script somehow doesn't work:
#!/usr/bin/env bash
#SBATCH --job-name=parallel-plink
#SBATCH --mem=400GB
#SBATCH --ntasks=4
cd ~/RS1
for n in {1..4};
do
echo "Starting ${n}"
srun --input none --exclusive --ntasks=1 -c 1 --mem-per-cpu=100G plink --memory 100000 --bfile RS1 --distance triangle bin --parallel ${n} 4 --out dt-output &
done
Since most of the SBATCH options are inside the batch script the invocation is just: 'sbatch script.sh'
The slurm-20466.out only contains the four echo'ing outputs: cat slurm-20466.out
Starting 1
Starting 2
Starting 3
Starting 4
I double checked the command without srun and that works without errors.
I must confess I am also responsible for the Slurm scheduler configuration itself. Let me know if I could try to change anything or when more information is needed.
You start your srun commands in the background to have them run in parallel. But you never wait for the commands to finish.
So the loop runs through very quickly, echoes the "Starting ..." lines, starts the srun command in the background and afterwards finishes. After that, your sbatch-script is done and terminates successfully, meaning that your job is done. With that, your allocation is revoked and your srun commands are also terminated. You might be able to see that they started with sacct.
You need to instruct the batch script to wait for the work to be done before it terminates, by waiting for the background processes to finish. To do that, you simply have to add a wait command in your script at the end:
#!/usr/bin/env bash
#SBATCH --job-name=parallel-plink
#SBATCH --mem=400GB
#SBATCH --ntasks=4
cd ~/RS1
for n in {1..4};
do
echo "Starting ${n}"
srun --input none --exclusive --ntasks=1 -c 1 --mem-per-cpu=100G plink --memory 100000 --bfile RS1 --distance triangle bin --parallel ${n} 4 --out dt-output &
done
wait

How to use srun command to assign different GPU for each task in a node with multiple GPUs?

How can I change my slurm script below so that each python job gets a unique GPU? The node had 4 GPUs, I would like to run 1 python job per each GPU.
The problem is that all jobs use the first GPU and other GPUs are idle.
#!/bin/bash
#SBATCH --qos=maxjobs
#SBATCH -N 1
#SBATCH --exclusive
for i in `seq 0 3`; do
cd ${i}
srun python gpu_code.py &
cd ..
done
wait
In your example your four jobs will be executed sequentially. I suggest submitting four separate jobs that only request a single GPU. Then the jobs only use one GPU and will be executed simultaneously. If the jobs have depencies you can use:
sbatch --dependency=afterok:${jobid_of_previous_job} submit.sh. This job will start after the prvious has finished.
As you do not request GPUs in the submission scripts, you will have to manage the CUDA_VISIBLE_DEVICES var by yourself to direct each python script to one specific GPU.
Try with
#!/bin/bash
#SBATCH --qos=maxjobs
#SBATCH -N 1
#SBATCH --exclusive
for i in `seq 0 3`; do
cd ${i}
export CUDA_VISIBLE_DEVICES=$i
python gpu_code.py &
cd ..
done
wait

slurm job name in a for loop

I would like to have my job name function of the parameters of the loop.
#!/bin/bash
#SBATCH -n 4
#SBATCH -p batch576
MAXLEVEL=8
process_id=$!
for Oh in '0.0001' '0.0005'
do
for H in '1.' '0.8'
do
mkdir half$Oh$H
cp half h.py RP.py `pwd`/half$Oh$H/
cd half$Oh$H
srun --mpi=pmi2 -J half${Oh}${H} ./half $Oh $H $MAXLEVEL &
cd ..
done
done
wait $process_id
Instead of test_min i would like : half0.00011. half0.00010.8 ....
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
658 batch576 test_min XXX R 0:06 1 no de1-ib2
Do you have any ideas ?
Thanks you
If you're submitting this job using sbatch, it will be a single job with multiple job steps. The -J option in srun names the jobsteps in your Job, not the job itself. And by default, squeue does not show job step information. Try the --steps paramater for squeue to show the job step names.

Why does slurm assign more tasks than I asked when I "sbatch" multiple jobs with a .sh file?

I submit some cluster mode spark jobs which run just fine when I do it one by one with below sbatch specs.
#!/bin/bash -l
#SBATCH -J Spark
#SBATCH --time=0-05:00:00 # 5 hour
#SBATCH --partition=batch
#SBATCH --qos qos-batch
###SBATCH -N $NODES
###SBATCH --ntasks-per-node=$NTASKS
### -c, --cpus-per-task=<ncpus>
### (multithreading) Request that ncpus be allocated per process
#SBATCH -c 7
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --dependency=singleton
If I use a launcher to submit the same job with different node and task numbers, the system gets confused and tries to assign according to $SLURM_NTASK which gives 16. However I ask for example only 1 node,3tasks.
#!/bin/bash -l
for n in {1..4}
do
for t in {3..4}
do
echo "Running benchmark with ${n} nodes and ${t} tasks per node"
sbatch -N ${n} --ntasks-per-node=${t} spark-teragen.sh
sleep 5
sbatch -N ${n} --ntasks-per-node=${t} spark-terasort.sh
sleep 5
sbatch -N ${n} --ntasks-per-node=${t} spark-teravalidate.sh
sleep 5
done
done
How can I fix the error below by preventing slurm assign weird number of tasks per node which exceeds the limit.
Error:
srun: Warning: can't honor --ntasks-per-node set to 3 which doesn't match the
requested tasks 16 with the number of requested nodes 1. Ignoring --ntasks-per-node.
srun: error: Unable to create step for job 233838: More processors requested than
permitted

Questions on alternative ways to run 4 parallel jobs

Below are three different sbatch scripts that produce roughly similar results.
(I show only the parts where the scripts differ; the ## prefix indicates the output obtained by submitting the scripts to sbatch.)
Script 0
#SBATCH -n 4
srun -l hostname -s
## ==> slurm-7613732.out <==
## 0: node-73
## 1: node-73
## 2: node-73
## 3: node-73
Script 1
#SBATCH -n 1
#SBATCH -a 1-4
srun hostname -s
## ==> slurm-7613733_1.out <==
## node-72
##
## ==> slurm-7613733_2.out <==
## node-73
##
## ==> slurm-7613733_3.out <==
## node-72
##
## ==> slurm-7613733_4.out <==
## node-73
Script 2
#SBATCH -N 4
srun -l -n 4 hostname -s
## ==> slurm-7613738.out <==
## 0: node-74
## 2: node-76
## 1: node-75
## 3: node-77
Q: Why would one choose one such approach over the others?
(I see that the jobs spawned by Script 0 all ran on the same node, but I can't tell if this is a coincidence.)
Also, the following variant of Script 2 (the only difference being -N 2 instead of -N 4) fails:
Script 3
#SBATCH -N 2
srun -l -n 4 hostname -s
## ==> slurm-7614825.out <==
## srun: error: Unable to create job step: More processors requested than permitted
Ditto for the following variant of Script 2 (the only difference between this and Script 3 is that here srun also has the flag -c 2):
Script 4
#SBATCH -N 2
srun -l -n 4 -c 2 hostname -s
## ==> slurm-7614827.out <==
## srun: error: Unable to create job step: More processors requested than permitted
Qs: are the errors I get with Script 3 and Script 4 due to wrong syntax, wrong semantics, or site-specific configs? IOW, is there something inherently wrong with these scripts (that would cause them to fail under any instance of SLURM), or are the errors only due to violations of restrictions imposed by the particular instance of SLURM I'm submitting the jobs to? If the latter is the case, how can I pinpoint the configs responsible for the error?
Q: Why would one choose one such approach over the others?
Script 0: you request 4 tasks, to be allocated at the same time to a single job, with no other specification as to how those tasks should be allocated to nodes. Typical use: an MPI program.
Script 1: you request 4 jobs, each with 1 task. The jobs will be scheduled independently one from another. Typical use: Embarrassingly parallel jobs.
Script 2: you request 4 nodes, with one task per node. It is similar to Script 0 except that you request the tasks to be allocated to four distinct nodes. Typical use: MPI program with a lot of IOs on local disks for instance.
The fact that all jobs were allocated the same first node is due to the fact that Slurm always allocates the nodes in the same order, and you probably run all the tests one after another so the other started on the resources the previous one just freed.
Script 3: You request two nodes, with implicitly, 1 task per node, so you are allocated two tasks, but then you try to use 4 tasks with srun. You should change it to
#SBATCH -N 2
#SBATCH --tasks-per-node 2
srun -l -n 4 hostname -s
two request two tasks per node, or
#SBATCH -N 2
#SBATCH -n 4
srun -l -n 4 hostname -s
to request four tasks, with no additional constraint on the distribution of tasks across nodes.
Script 4: You request two nodes, with implicitly, 1 task per node, and, also implicitly, one CPU per task, so you are allocated two CPUs, but then you try to use 4 tasks with srun, each with 2 CPUS so 8 in total. You should change it to
#SBATCH -N 2
#SBATCH --tasks-per-node 2
#SBATCH --cpus-per-task 2
srun -l -n 4 -c 2 hostname -s
or,
#SBATCH -N 2
#SBATCH -n 4
#SBATCH --cpus-per-task 2
srun -l -n 4 -c 2 hostname -s
The bottom line: in the submission script, you request resources with the #SBATCH directives, and you cannot use more resource than than in the subsequent calls to srun.

Resources