I'd like to run the same program on a large number of different input files. I could just submit each as a separate Slurm submission, but I don't want to swamp the queue by dumping 1000s of jobs on it at once. I've been trying to figure out how to process the same number of files by instead creating an allocation first, then within that allocation looping over all the files with srun, giving each invocation a single core from the allocation. The problem is that no matter what I do, only one job step runs at a time. The simplest test case I could come up with is:
#!/usr/bin/env bash
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
wait
It doesn't matter how many cores I assign the allocation:
time salloc -n 1 test
time salloc -n 2 test
time salloc -n 4 test
it always takes 4 seconds. Is it not possible to have multiple job steps execute in parallel?
It turned out to be that the default memory per cpu was not defined, so even single core jobs were running by reserving all the node's RAM.
Setting DefMemPerCPU, or specifying explicit RAM reservations did the trick.
Beware that in that scenario, you measure both the running time and the waiting time. Your submission script should look like this:
#!/usr/bin/env bash
time {
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
wait
}
and simply submit with
salloc -n 1 test
salloc -n 2 test
salloc -n 4 test
You then should observe the difference, along with messages such as srun: Job step creation temporarily disabled, retrying when using n<4.
Since the OP solved his issue but didn't provide the code, I'll share my take on this problem below.
In my case, I encountered the error/warning step creation temporarily disabled, retrying (Requested nodes are busy). This is because, the srun command that executed first, allocated all the memory. The same cause as encountered by the OP. To solve this, one first optionally(?) specify the total memory allocation for sbatch (if you are using an sbatch script):
#SBATCH --ntasks=4
#SBATCH --mem=[XXXX]MB
And then specify the memory use per srun task:
srun --exclusive --ntasks=1 --mem-per-cpu [XXXX/4]MB sleep 1 &
srun --exclusive --ntasks=1 --mem-per-cpu [XXXX/4]MB sleep 1 &
srun --exclusive --ntasks=1 --mem-per-cpu [XXXX/4]MB sleep 1 &
srun --exclusive --ntasks=1 --mem-per-cpu [XXXX/4]MB sleep 1 &
wait
I didn't specify CPU count for srun because in my sbatch script I included #SBATCH --cpus-per-task=1. For the same reason I suspect you could use --mem instead of --mem-per-cpu in the srun command, but I haven't tested this configuration.
Related
I'm still exploring how to work with the Slurm scheduler and this time I really got stuck. The following batch script somehow doesn't work:
#!/usr/bin/env bash
#SBATCH --job-name=parallel-plink
#SBATCH --mem=400GB
#SBATCH --ntasks=4
cd ~/RS1
for n in {1..4};
do
echo "Starting ${n}"
srun --input none --exclusive --ntasks=1 -c 1 --mem-per-cpu=100G plink --memory 100000 --bfile RS1 --distance triangle bin --parallel ${n} 4 --out dt-output &
done
Since most of the SBATCH options are inside the batch script the invocation is just: 'sbatch script.sh'
The slurm-20466.out only contains the four echo'ing outputs: cat slurm-20466.out
Starting 1
Starting 2
Starting 3
Starting 4
I double checked the command without srun and that works without errors.
I must confess I am also responsible for the Slurm scheduler configuration itself. Let me know if I could try to change anything or when more information is needed.
You start your srun commands in the background to have them run in parallel. But you never wait for the commands to finish.
So the loop runs through very quickly, echoes the "Starting ..." lines, starts the srun command in the background and afterwards finishes. After that, your sbatch-script is done and terminates successfully, meaning that your job is done. With that, your allocation is revoked and your srun commands are also terminated. You might be able to see that they started with sacct.
You need to instruct the batch script to wait for the work to be done before it terminates, by waiting for the background processes to finish. To do that, you simply have to add a wait command in your script at the end:
#!/usr/bin/env bash
#SBATCH --job-name=parallel-plink
#SBATCH --mem=400GB
#SBATCH --ntasks=4
cd ~/RS1
for n in {1..4};
do
echo "Starting ${n}"
srun --input none --exclusive --ntasks=1 -c 1 --mem-per-cpu=100G plink --memory 100000 --bfile RS1 --distance triangle bin --parallel ${n} 4 --out dt-output &
done
wait
How can I change my slurm script below so that each python job gets a unique GPU? The node had 4 GPUs, I would like to run 1 python job per each GPU.
The problem is that all jobs use the first GPU and other GPUs are idle.
#!/bin/bash
#SBATCH --qos=maxjobs
#SBATCH -N 1
#SBATCH --exclusive
for i in `seq 0 3`; do
cd ${i}
srun python gpu_code.py &
cd ..
done
wait
In your example your four jobs will be executed sequentially. I suggest submitting four separate jobs that only request a single GPU. Then the jobs only use one GPU and will be executed simultaneously. If the jobs have depencies you can use:
sbatch --dependency=afterok:${jobid_of_previous_job} submit.sh. This job will start after the prvious has finished.
As you do not request GPUs in the submission scripts, you will have to manage the CUDA_VISIBLE_DEVICES var by yourself to direct each python script to one specific GPU.
Try with
#!/bin/bash
#SBATCH --qos=maxjobs
#SBATCH -N 1
#SBATCH --exclusive
for i in `seq 0 3`; do
cd ${i}
export CUDA_VISIBLE_DEVICES=$i
python gpu_code.py &
cd ..
done
wait
How can I run a number of python scripts in different nodes in SLURM?
Suppose,
I select 5 cluster nodes using #SBATCH --nodes=5
and
I have 5 python scripts code1.py, code2.py....code5.py and I want to run each of these scripts in 5 different nodes simultaneously. How can I achieve this?
Do these five scripts need to run in a single job? Do they really need to run simultaneously? Is there some communication happeneing between them? Or are they independent from one another?
If they are essentially independent, then you should most likely pu tthem into 5 different jobs with one nodes each. That way you don't have to find five free nodes, but your the first job can start as soon as there is a single free node. If there are many scripts you want to start like that, it might be interesting to look into job arrays.
If you need to run them in parallel, you will need to use srun in your jobscript to start the scripts. This example shows a job where you have 10 cores per task and each node has one task.
#!/bin/bash
#[...]
#SBATCH -N 5
#SBATCH -n 5
#SBATCH -c 10
#[...]
srun -N 1 -n1 python code1.py &
srun -N 1 -n1 python code2.py &
srun -N 1 -n1 python code3.py &
srun -N 1 -n1 python code4.py &
srun -N 1 -n1 python code5.py &
wait
You need to run the srun calls in the background, as bash would otherwise wait for them to finish before executing the next one.
I have the following SLURM job script named gzip2zipslurm.sh:
#!/bin/bash
#SBATCH --mem 70G
#SBATCH --ntasks 4
echo "Task 1"
srun -n1 java -Xmx10g -jar tar2zip-1.0.0-jar-with-dependencies.jar articles.A-B.xml.tar.gz &
echo "Task 2"
srun -n1 java -Xmx10g -jar tar2zip-1.0.0-jar-with-dependencies.jar articles.C-H.xml.tar.gz &
echo "Task 3"
srun -n1 java -Xmx10g -jar tar2zip-1.0.0-jar-with-dependencies.jar articles.I-N.xml.tar.gz &
echo "Task 4"
srun -n1 java -Xmx10g -jar tar2zip-1.0.0-jar-with-dependencies.jar articles.O-Z.xml.tar.gz &
echo "Waiting for job steps to end"
wait
echo "Script complete"
I submit it to SLURM by sbatch gzip2zipslurm.sh.
When I do, the output of the SLURM log file is
Task 1
Task 2
Task 3
Task 4
Waiting for job steps to end
The tar2zip program reads the given tar.gz file an re-packages it as a ZIP file.
The Problem: Only one CPU (out of 16 available on an idle node) is doing any work. With top I can see that all in all 5 srun commands have been started (4 for my tasks and 1 implicit for the sbatch job, I guess) but there is only one Java process. I can also see it on the files being worked on, only one is written.
How do I manage that all 4 tasks are actually executed in parallel?
Thanks for any hints!
The issue might be with the memory reservation. In the submission script, you set --mem=70GB, that is the global memory usage of the job.
When srun is used within a submission script, it inherits parameters from sbatch, including the --mem=70GB. So you actually implicitly run the following.
srun --mem 70G -n1 java -Xmx10g -jar ...
Try explicitly stating the memory to 70GB/4 with:
srun --mem 17G -n1 java -Xmx10g -jar ...
Also, as per the documentation, you should use --exclusive with srun in such a context.
srun --exclusive --mem 17G -n1 java -Xmx10g -jar ...
This option can also be used when initiating more than one job step
within an existing resource allocation, where you want separate
processors to be dedicated to each job step. If sufficient processors
are not available to initiate the job step, it will be deferred. This
can be thought of as providing a mechanism for resource management to
the job within it's allocation.
The scenario is this one, I allocate ressources (2 nodes, 64 CPUs) to job with salloc:
salloc -N 1-2 -n 64 -c 1 -w cluster-node[2-3] -m cyclic -t 5
salloc: Granted job allocation 1720
Then, I use srun to create steps to my job:
for i in (seq 70)
srun --exclusive -N 1 -n 1 --jobid=1720 sleep 60 &
end
Because I created more steps than available cpus for my job, steps are "pending" until a free CPU.
When I use squeue with -s option to list steps, I'm only able to view the running ones.
squeue -s -O stepid:12,stepname:10,stepstate:9
1720.0 sleep RUNNING
[...]
1720.63 sleep RUNNING
My question is, does steps have status different from RUNNING like jobs, and if yes, is there a way to view those with squeue (or other command) ?
Not sure Slurm can offer the information. One alternative would be to use GNU Parallel so that jobs steps are not started at all until a CPU is available. In the current setting all job steps are started at once and those which do not have a CPU available are waiting.
So with the same allocation as you use, replace
for i in (seq 70)
srun --exclusive -N 1 -n 1 --jobid=1720 sleep 60 &
end
with
parallel -P $SLURM_NTASKS srun --exclusive -N 1 -n 1 --jobid=1720 sleep 60
Then the output of squeue should list RUNNING and PENDING steps.
N.B. not sure the --jobid= option is needed here BTW