What's the difference between the two following parallelization schemes on Slurm?
Scheme 1
Run sbatch script.sh
#!/bin/bash
#SBATCH --ntasks=8
## more options
srun echo hello
This summons 8 jobs that run echo hello
Scheme 2
I've accomplished something similar using array jobs.
#!/bin/bash
#SBATCH --job-name=arrayJob
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-8
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
# Print this sub-job's task ID
echo hello
Is there any difference between the two schemes? They both seem to accomplish the same thing.
Scheme 1 is one single job (with 8 tasks) while Scheme 2 is 8 distinct jobs (each with one task). In the first case, all the tasks will be scheduled at the same time, while in the second case, the 8 tasks will be scheduled independently one of another.
With the job array (Scheme 2), if 8 CPUs become available at once, they will all start at the same time, but if only 4 CPUs become available at first, 4 tasks will run, the other 4 remaining pending. When the initial 4 are done, the other 4 are started. It is typically used in the case of embarrassingly parallel jobs, where the processes do not need to communicate or synchronise, like for applying the same program to a list of files.
By contrast, with a single job (Scheme 1), Slurm will start the 8 tasks at the same time, so it will need 8 CPUS to become available at the same time. This is typically only used with parallel jobs where processes need to communicate with each others, for instance using an Message Passing Interface library.
Related
I have a large task that I'd like to divide up into parts and run in parallel using slurm. Specifically, I'd like to divide the work up into more parts than can run concurrently on my system with the resources that I'd like to allocate. The basic idiom I have in mind is to call sbatch, dividing the work into k (greater than some N maximum number of concurrent tasks) parts, and have slurm queue tasks until there are resources available to run them. Concretely, imagine I have some file where each line represents a piece of work to be done, and I'd like to divide that work into 1000 pieces, running some single-threaded script to process each piece on a cluster with a total of 128 cpus. My current sbatch script looks something like:
#!/bin/bash
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=128
split -a 4 -d -n l/1000 workfile work_part_
PART_ID=0000 srun --ntasks=1 ./do_work.sh &
PART_ID=0000 srun --ntasks=1 ./do_work.sh &
.
.
.
PART_ID=0999 srun --ntasks=1 ./do_work.sh &
wait
Where do_work.sh uses PART_ID to find the appropriate work part file to work from, and writes out some output part file. When I try to run this, I get a bunch Resource temporarily unavailable errors and many of the job steps show as 'Cancelled' or 'Cancelled by <my user id>'. I feel like this should be some kind of common use-case, but I can't puzzle out how to get slurm to allocate the resources properly in a single job. I know that I could just perform the split outside of slurm and queue up 1000 separate jobs with srun but I'd prefer to have them all grouped together under one job with sbatch.
For this specific use case, srun needs the --exclusive keyword otherwise it will inherit the full allocation even if you specify --ntasks=1.
Note that in recent version of slurm --exclusive was deprecated by --exact and that in even more recent versions, if you specify --cpus-per-task explicitly on the srun line, --exact will be implied.
You will see in the logs that all srun will start, but only 128 of them will do the work, while the others will complain that resources are temporarily unavailable until some other terminate and free up resources.
Note that you can use a Bash loop or the GNU Parallel tool to avoid writing explicitly all steps in the submission script.
I am trying to run a simple parallel program on a SLURM cluster (4x raspberry Pi 3) but I have no success. I have been reading about it, but I just cannot get it to work. The problem is as follows:
I have a Python program named remove_duplicates_in_scraped_data.py. This program is executed on a single node (node=1xraspberry pi) and inside the program there is a multiprocessing loop section that looks something like:
pool = multiprocessing.Pool()
input_iter= product(FeaturesArray_1, FeaturesArray_2, repeat=1)
results = pool.starmap(refact_featureMatch, input_iter)
The idea is that when it hits that part of the program it should distribute the calculations, one thread per element in the iterator and combine the results in the end.
So, the program remove_duplicates_in_scraped_data.py runs once (not multiple times) and it spawns different threads during the pool calculation.
On a single machine (without using SLURM) it works just fine, and for the particular case of a raspberry pi, it spawns 4 threads, does the calcuations, saves it in results and continues the progarm as a single thread.
I would like to exploit all the 16 threads of the SLURM cluster but I cannot seem to get it to work. And I am confident that the cluster has been configured correctly, since it can run all the multiprocessing examples (e.g. calculate the digits of pi) using SLURM in all 16 threads of the cluster.
Now, looking at the SLURM configuration with sinfo -N -l we have:
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
node01 1 picluster* idle 4 4:1:1 1 0 1 (null) none
node02 1 picluster* idle 4 4:1:1 1 0 1 (null) none
node03 1 picluster* idle 4 4:1:1 1 0 1 (null) none
node04 1 picluster* idle 4 4:1:1 1 0 1 (null) none
Each cluster reports 4 sockets, 1 Core and 1 Thread and as far as SLURM is concerned 4 CPUs.
I wish to exploit all the 16 CPUs and if I run my progam as:
srun -N 4 -n 16 python3 remove_duplicates_in_scraped_data.py
It will just run 4 copies of the main progam in each node, resulting in 16 threads. But this is not what I want. I want a single instance of the program, which then spawns the 16 threads across the cluster. At least we know that with srun -N -n 16 the cluster works.
So, I tried instead changing the program as follows:
#!/usr/bin/python3
#SBATCH -p picluster
#SBATCH --nodes=4
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=1
#SBATCH --sockets-per-node=4
sys.path.append(os.getcwd())
...
...
...
pool = multiprocessing.Pool()
input_iter= product(FeaturesArray_1, FeaturesArray_2, repeat=1)
results = pool.starmap(refact_featureMatch, input_iter)
...
...
and executing it with
sbatch remove_duplicates_in_scraped_data.py
The slurm job is created successfully and I see that all nodes have been allocated on the cluster
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
picluster* up infinite 4 alloc node[01-04]
The program starts running as a single thread on node01 but when it hits the parallel part it only spawns 4 threads on node01 and nothing on all the other nodes.
I tried different combination of settings, even tried to run it via a script
#!/bin/bash
#SBATCH -p picluster
#SBATCH --nodes=4
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=1
#SBATCH --ntasks-per-core=1
#SBATCH --sockets-per-node=4
python3 remove_duplicates_in_scraped_data.py
but I just cannot get it to spawn on the other nodes.
Can you please help me?
Is it even possible to do this? i.e. use python's multiprocessing pool on different nodes of a cluster?
If not, what other options do I have?
The cluster also has dask configured. Would that be able to work better?
Please help as I am really stuck with this.
Thanks
Pythons multiprocessing package is limited to shared memory parallelization. It spawns new processes that all have access to the main memory of a single machine.
You cannot simply scale out such a software onto multiple nodes. As the different machines do not have a shared memory that they can access.
To run your program on multiple nodes at once, you should have a look into MPI (Message Passing Interface). There is also a python package for that.
Depending on your task, it may also be suitable to run the program 4 times (so one job per node) and have it work on a subset of the data. It is often the simpler approach, but not always possible.
So, instead I run DASK with the SLURM cluster and the Python script sweems to parallelise well. This required the least amount of code changes. So the above multiprocessing pool code was changed to:
cluster = SLURMCluster( header_skip=['--mem'],
queue='picluster',
cores=4,
memory='1GB'
)
cluster.scale(cores=16) #the number of nodes to request
dask_client = Client(cluster)
lazy_results=[]
for pair in input_iter:
res = dask_client.submit(refact_featureMatch, pair[0], pair[1])
lazy_results.append(res)
results = dask_client.gather(lazy_results)
There might be of course better ways of doing this via DASK. I am open to suggestions :)
so I have the following submission script:
#!/bin/bash
#
#SBATCH --job-name=P6
#SBATCH --output=P6.txt
#SBATCH --partition=workq
#SBATCH --ntasks=512
#SBATCH --time=18:00:00
#SBATCH --mem-per-cpu=2500
#SBATCH --cpus-per-task=1
#SBATCH --array=1-512
srun ./P6 $SLURM_ARRAY_TASK_ID
What I want to do is to run 512 instances of the program P6 with an argument from 1 to 512, and I far as I know the submission above does that. However, upon inspecting squeue and sacct, SLURM seems to have assigned 512 CPU's to each task!
What did I did wrong?
You asked for 512 tasks for every job. Ask for a single one (or the number you consider appropriate for your code):
#SBATCH --ntasks=1
BTW, there are a few minor problems in your submission script. All jobs in the job array will be named the same way (which is not really a problem), but they will also share the stdout file, so you will have mixed information of all tasks in the P6.txt. I would advise you to differentiate them with the JobID or the TaskId (%j/%A/%a).
Also, you don't define the standard error destination, so if anything fails or is written in stderr, you will lose that information. My reccomendation is to define the standard error too (#SBATCH --error=P6.txt.%j).
Another detail is that the working folder ir not defined. It will work as long as you submit the script from the proper folder, but if you try to submit it from another place, it will fail.
While both "Task Spooler" and "at" handle multiple queues and allow the execution of commands at a later point, the at project handles output from commands by emailing the results to the user who queued the command, while Task Spooler allows you to get at the results from the command line instead.
But what I am looking for is a way that would allow me to run 5 jobs simultaneously and keep rest of the jobs in a queue, so when any one of the 5 is over it would start the next one.
So, if 5 jobs running and 4 more in the queue as soon as any of them is finished, the next one would start executing and again 5 jobs would running simultaneously.
is there a way to handle such task?
It depends of course how you want to start your tasks. But lets assume they are loop based. The following would launch all N commands in the background.
#!/usr/bin/env bash
for i in {1..N}; do
# do awesome command based on $i
command $i &
done
wait
So if you want to launch only 5 jobs, you need to keep track of what is running :
#!/usr/bin/env bash
Njobs=5
for i in {1..N}; do
# Checks how many jobs are currently running
while [[ $(jobs -p | wc -l) > $Njobs ]]; do
sleep 0.1
done
# do awesome command based on $i
command $i &
done
wait
If you're using task spooler you can do what you're asking. Use the -S <number> flag to specify the number of "slots" (jobs that run concurrently). You can even use -D <job id> to make different jobs depend on another specific job's completion.
So in your example, if you set tsp -S 5, task spooler would run the first 5 jobs and queue up the next 4. Once one of the original 5 jobs completed, the next queued up job (based on lowest job id) would then begin. This would continue to happen as running jobs finish and more slots open up.
Also note for anyone else reading this, on Ubuntu (and maybe other Debian-based systems) task spooler is called tsp so as not to conflict with the openssl-ts tool. On most other systems it should called just ts. Which is why even on Ubuntu, task spooler will refer to itself as ts.
From the manual, regarding slots:
MULTI-SLOT
ts by default offers a queue where each job runs only after the previous finished. Nevertheless, you can change the
maximum number of jobs running at once with the -S [num] parameter. We call that number the amount of slots. You can
also set the initial number of jobs with the environment variable TS_SLOTS . When increasing this setting, queued
waiting jobs will be run at once until reaching the maximum set. When decreasing this setting, no other job will be
run until it can meet the amount of running jobs set. When using an amount of slots greater than 1, the action of
some commands may change a bit. For example, -t without jobid will tail the first job running, and -d will try to set
the dependency with the last job added.
-S [num]
Set the maximum amount of running jobs at once. If you don't specify num it will return the maximum amount of
running jobs set.
You have already a tool that does this: GNU Parallel
parallel --jobs 4 bash ::: script1.sh script2.sh script3.sh script4.sh
See Parallel tutorial for examples.
For the case where fewer jobs than tasks run:
for f in $(seq $TASKS); do
echo ${RANDOM}e-04
done | parallel --jobs $JOBS "echo {#} {}; sleep {}"
Example results for TASKS=9:
JOBS=1 JOBS=5
1 17994e-04 4 2844e-04
2 25155e-04 2 5752e-04
3 7859e-04 3 13084e-04
4 11812e-04 1 13749e-04
5 19851e-04 8 2546e-04
6 1568e-04 7 12086e-04
7 24074e-04 6 16087e-04
8 8435e-04 9 9826e-04
9 1407e-04 5 27257e-04
OS: Cent-OS
I have some 30,000 jobs(or Scripts) to run. Each job takes 3-5 Min. I have 48 CPUs(nproc = 48). I can use 40 CPUs to run 40 Jobs parallelly. please suggest some script or tools can handle 30,000 Jobs by running each 40 Jobs parallely.
What I had done:
I created 40 Different folders and executed the jobs parallely by creating a shell script for each directory.
I want to know better ways to handle this kind of jobs next time.
As Mark Setchell says: GNU Parallel.
find scripts/ -type f | parallel
If you insists on keeping 8 CPUs free:
find scripts/ -type f | parallel -j-8
But usually it is more efficient simply to use nice as that will give you all 48 cores when no one else needs them:
find scripts/ -type f | nice -n 15 parallel
To learn more:
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (man parallel_tutorial). You command line
with love you for it.
I have used REDIS to do this sort of thing - it is very simple to install and the CLI is easy to use.
I mainly used LPUSH to push all the jobs onto a "queue" in REDIS and BLPOP to do a blocking remove of a job from the queue. So you would LPUSH 30,000 jobs (or script names or parameters) at the start, then start 40 processes in the background (1 per CPU) and each process would sit in a loop doing BLPOP to get a job, run it and do the next.
You can add layers of sophistication to log completed jobs in another "queue".
Here is a little demonstration of what to do...
First, start a Redis server on any machine in your network:
./redis-server & # start REDIS server in background
Or, you could put this in your system startup if you use it always.
Now push 3 jobs onto queue called jobs:
./redis-cli # start REDIS command line interface
redis 127.0.0.1:6379> lpush jobs "job1"
(integer) 1
redis 127.0.0.1:6379> lpush jobs "job2"
(integer) 2
redis 127.0.0.1:6379> lpush jobs "job3"
(integer) 3
See how many jobs there are in queue:
redis 127.0.0.1:6379> llen jobs
(integer) 3
Wait with infinite timeout for job
redis 127.0.0.1:6379> brpop jobs 0
1) "jobs"
2) "job1"
redis 127.0.0.1:6379> brpop jobs 0
1) "jobs"
2) "job2"
redis 127.0.0.1:6379> brpop jobs 0
1) "jobs"
2) "job3"
This last one will wait a LONG time as there are no jobs in queue:
redis 127.0.0.1:6379> brpop jobs 0
Of course, this is readily scriptable:
Put 30,000 jobs in queue:
for ((i=0;i<30000;i++)) ; do
echo "lpush jobs job$i" | redis-cli
done
If your Redis server is on a remote host, just use:
redis-cli -h <HOSTNAME>
Here's how to check progress:
echo "llen jobs" | redis-cli
(integer) 30000
Or, more simply maybe:
redis-cli llen jobs
(integer) 30000
And you could start 40 jobs like this:
#!/bin/bash
for ((i=0;i<40;i++)) ; do
./Keep1ProcessorBusy $i &
done
And then Keep1ProcessorBusy would be something like this:
#!/bin/bash
# Endless loop picking up jobs and processing them
while :
do
job=$(echo brpop jobs 0 | redis_cli)
# Set processor affinity here too if you want to force it, use $1 parameter we were called with
do $job
done
Of course, the actual script or job you want run could also be stored in Redis.
As a totally different option, you could look at GNU Parallel, which is here. And also remember that you can run the output of find through xargs with the -P option to parallelise stuff.
Just execute those scripts, Linux will internally distribute those tasks properly amongst available CPUs. This is upon the Linux task scheduler. But, if you want you can also execute a task on a particular CPU by using taskset (see man taskset). You can do it from a script to execute your 30K tasks. Remember in this manual way, be sure about what you are doing.