Slurm divide work into more parts than concurrency degree - slurm

I have a large task that I'd like to divide up into parts and run in parallel using slurm. Specifically, I'd like to divide the work up into more parts than can run concurrently on my system with the resources that I'd like to allocate. The basic idiom I have in mind is to call sbatch, dividing the work into k (greater than some N maximum number of concurrent tasks) parts, and have slurm queue tasks until there are resources available to run them. Concretely, imagine I have some file where each line represents a piece of work to be done, and I'd like to divide that work into 1000 pieces, running some single-threaded script to process each piece on a cluster with a total of 128 cpus. My current sbatch script looks something like:
#!/bin/bash
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=128
split -a 4 -d -n l/1000 workfile work_part_
PART_ID=0000 srun --ntasks=1 ./do_work.sh &
PART_ID=0000 srun --ntasks=1 ./do_work.sh &
.
.
.
PART_ID=0999 srun --ntasks=1 ./do_work.sh &
wait
Where do_work.sh uses PART_ID to find the appropriate work part file to work from, and writes out some output part file. When I try to run this, I get a bunch Resource temporarily unavailable errors and many of the job steps show as 'Cancelled' or 'Cancelled by <my user id>'. I feel like this should be some kind of common use-case, but I can't puzzle out how to get slurm to allocate the resources properly in a single job. I know that I could just perform the split outside of slurm and queue up 1000 separate jobs with srun but I'd prefer to have them all grouped together under one job with sbatch.

For this specific use case, srun needs the --exclusive keyword otherwise it will inherit the full allocation even if you specify --ntasks=1.
Note that in recent version of slurm --exclusive was deprecated by --exact and that in even more recent versions, if you specify --cpus-per-task explicitly on the srun line, --exact will be implied.
You will see in the logs that all srun will start, but only 128 of them will do the work, while the others will complain that resources are temporarily unavailable until some other terminate and free up resources.
Note that you can use a Bash loop or the GNU Parallel tool to avoid writing explicitly all steps in the submission script.

Related

SLURM and Python multiprocessing pool on a cluster

I am trying to run a simple parallel program on a SLURM cluster (4x raspberry Pi 3) but I have no success. I have been reading about it, but I just cannot get it to work. The problem is as follows:
I have a Python program named remove_duplicates_in_scraped_data.py. This program is executed on a single node (node=1xraspberry pi) and inside the program there is a multiprocessing loop section that looks something like:
pool = multiprocessing.Pool()
input_iter= product(FeaturesArray_1, FeaturesArray_2, repeat=1)
results = pool.starmap(refact_featureMatch, input_iter)
The idea is that when it hits that part of the program it should distribute the calculations, one thread per element in the iterator and combine the results in the end.
So, the program remove_duplicates_in_scraped_data.py runs once (not multiple times) and it spawns different threads during the pool calculation.
On a single machine (without using SLURM) it works just fine, and for the particular case of a raspberry pi, it spawns 4 threads, does the calcuations, saves it in results and continues the progarm as a single thread.
I would like to exploit all the 16 threads of the SLURM cluster but I cannot seem to get it to work. And I am confident that the cluster has been configured correctly, since it can run all the multiprocessing examples (e.g. calculate the digits of pi) using SLURM in all 16 threads of the cluster.
Now, looking at the SLURM configuration with sinfo -N -l we have:
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
node01 1 picluster* idle 4 4:1:1 1 0 1 (null) none
node02 1 picluster* idle 4 4:1:1 1 0 1 (null) none
node03 1 picluster* idle 4 4:1:1 1 0 1 (null) none
node04 1 picluster* idle 4 4:1:1 1 0 1 (null) none
Each cluster reports 4 sockets, 1 Core and 1 Thread and as far as SLURM is concerned 4 CPUs.
I wish to exploit all the 16 CPUs and if I run my progam as:
srun -N 4 -n 16 python3 remove_duplicates_in_scraped_data.py
It will just run 4 copies of the main progam in each node, resulting in 16 threads. But this is not what I want. I want a single instance of the program, which then spawns the 16 threads across the cluster. At least we know that with srun -N -n 16 the cluster works.
So, I tried instead changing the program as follows:
#!/usr/bin/python3
#SBATCH -p picluster
#SBATCH --nodes=4
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=1
#SBATCH --sockets-per-node=4
sys.path.append(os.getcwd())
...
...
...
pool = multiprocessing.Pool()
input_iter= product(FeaturesArray_1, FeaturesArray_2, repeat=1)
results = pool.starmap(refact_featureMatch, input_iter)
...
...
and executing it with
sbatch remove_duplicates_in_scraped_data.py
The slurm job is created successfully and I see that all nodes have been allocated on the cluster
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
picluster* up infinite 4 alloc node[01-04]
The program starts running as a single thread on node01 but when it hits the parallel part it only spawns 4 threads on node01 and nothing on all the other nodes.
I tried different combination of settings, even tried to run it via a script
#!/bin/bash
#SBATCH -p picluster
#SBATCH --nodes=4
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=1
#SBATCH --ntasks-per-core=1
#SBATCH --sockets-per-node=4
python3 remove_duplicates_in_scraped_data.py
but I just cannot get it to spawn on the other nodes.
Can you please help me?
Is it even possible to do this? i.e. use python's multiprocessing pool on different nodes of a cluster?
If not, what other options do I have?
The cluster also has dask configured. Would that be able to work better?
Please help as I am really stuck with this.
Thanks
Pythons multiprocessing package is limited to shared memory parallelization. It spawns new processes that all have access to the main memory of a single machine.
You cannot simply scale out such a software onto multiple nodes. As the different machines do not have a shared memory that they can access.
To run your program on multiple nodes at once, you should have a look into MPI (Message Passing Interface). There is also a python package for that.
Depending on your task, it may also be suitable to run the program 4 times (so one job per node) and have it work on a subset of the data. It is often the simpler approach, but not always possible.
So, instead I run DASK with the SLURM cluster and the Python script sweems to parallelise well. This required the least amount of code changes. So the above multiprocessing pool code was changed to:
cluster = SLURMCluster( header_skip=['--mem'],
queue='picluster',
cores=4,
memory='1GB'
)
cluster.scale(cores=16) #the number of nodes to request
dask_client = Client(cluster)
lazy_results=[]
for pair in input_iter:
res = dask_client.submit(refact_featureMatch, pair[0], pair[1])
lazy_results.append(res)
results = dask_client.gather(lazy_results)
There might be of course better ways of doing this via DASK. I am open to suggestions :)

HPC SLURM and batch calls to MPI-enabled application in Master-Worker system

I am trying to implement some sort of Master-Worker system in a HPC with the resource manager SLURM, and I am looking for advices on how to implement such a system.
I have to use some python code that plays the role of the Master, in the sense that between batches of calculations the Master will run 2 seconds of its own calculations, before sending a new batch of work to the Workers. Each Worker must run an external executable over a single node of the HPC. The external executable (Gromacs) is itself MPI-enabled. There will be ~25 Workers and many batches of calculations.
What I have in mind atm (also see EDIT further below):
What I'm currently trying:
Allocate via SLURM as many MPI tasks as I want to use nodes, within a bash script that I'm calling via sbatch run.sh
#!/bin/bash -l
#SBATCH --nodes=4
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
module load required_env_module_for_external_executable
srun python my_python_code.py
Catch within python my_python_code.py the current MPI rank, and use rank/node 0 to run the Master python code
from mpi4py import MPI
name = MPI.Get_processor_name()
rank = MPI.COMM_WORLD.Get_rank()
size = MPI.COMM_WORLD.Get_size()
if rank == 0: # Master
run_initialization_and_distribute_work_to_Workers()
else: # Workers
start_Worker_waiting_for_work()
Within the python code of the Workers, start the external (MPI-enabled) application using MPI.COMM_SELF.Spawn()
def start_Worker_waiting_for_work():
# here we are on a single node
executable = 'gmx_mpi'
exec_args = 'mdrun -deffnm calculation_n'
# create some relationship between current MPI rank
# and the one the executable should use ?
mpi_info = MPI.Info.Create()
mpi_info.Set('host', MPI.Get_processor_name())
commspawn = MPI.COMM_SELF.Spawn(executable, args=exec_args,
maxprocs=1, info=mpi_info)
commspawn.Barrier()
commspawn.Disconnect()
res_analysis = do_some_analysis() # check what the executable produced
return res_analysis
What I would like some explanations on:
Can someone confirm that this approach seems valid for implementing the desired system ? Or is it obvious this has no chance to work ? If so, please, why ?
I am not sure that MPI.COMM_SELF.Spawn() will make the executable inherit from the SLURM resource allocation. If not, how to fix this ? I think that MPI.COMM_SELF.Spawn() is what I am looking for, but I'm not sure.
The external executable requires some environment modules to be loaded. If they are loaded at sbatch run.sh, are they still loaded when I invoke from MPI.COMM_SELF.Spawn() from my_python_code.py ?
As a slightly different approach, is it possible to have something like pre-allocations/reservations to book resources for the Workers, then use MPI.COMM_WORLD.Spawn() together with the pre-allocations/reservations ? The goal is also to avoid entering the SLURM queue at each new batch, as this may waste a lot of clock time (hence the will to book all required resources at the very beginning).
Since the python Master has to always stay alive anyways, SLURM job dependencies cannot be useful here, can they ?
Thank you so much for any help you may provide !
EDIT: Simplification of the workflow
In an attempt to keep my question simple, I first omited the fact that I actually had the Workers doing some analysis. But this work can be done on the Master using OpenMP multiprocessing, as Gilles Gouillardet suggested. It executes fast enough.
Then the Workers are necessary indeed, because each task takes about 20-25 min on a single Worker/Node.
I also added some bits about maintaining my own queue of tasks to be sent to the SLURM queue and ultimately to the Workers, just in case the number of tasks t would exceed a few tens/hundreds jobs. This should provide some flexibility also in the future, when re-using this code for different applications.
Probably this is fine like this. I will try to go this way and update these lines. EDIT: It works fine.
At first glance, this looks over convoluted to me:
there is no communication between a slave and GROMACS
there is some master/slave communications, but is MPI really necessary?
are the slaves really necessary? (e.g. can the master process simply serialize the computation and then directly start GROMACS?)
A much simpler architecture would be to have one process on your frontend, that will do:
prepare the GROMACS inputs
sbatch gromacs (start several jobs in a row)
wait for the GROMACS jobs to complete
analyze the GROMACS outputs
re-iterate or exit
If the slave is doing some work you do not want to serialize on the master, can you replace the MPI communications by using files on a shared filesystem? in that case, you can do the computation on the compute nodes within a GROMACS job, before and after executing GROMACS. If not, maybe TCP/IP based communications can do the trick.

Slurm Question: Array Job VS srun in a sbatch

What's the difference between the two following parallelization schemes on Slurm?
Scheme 1
Run sbatch script.sh
#!/bin/bash
#SBATCH --ntasks=8
## more options
srun echo hello
This summons 8 jobs that run echo hello
Scheme 2
I've accomplished something similar using array jobs.
#!/bin/bash
#SBATCH --job-name=arrayJob
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-8
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
# Print this sub-job's task ID
echo hello
Is there any difference between the two schemes? They both seem to accomplish the same thing.
Scheme 1 is one single job (with 8 tasks) while Scheme 2 is 8 distinct jobs (each with one task). In the first case, all the tasks will be scheduled at the same time, while in the second case, the 8 tasks will be scheduled independently one of another.
With the job array (Scheme 2), if 8 CPUs become available at once, they will all start at the same time, but if only 4 CPUs become available at first, 4 tasks will run, the other 4 remaining pending. When the initial 4 are done, the other 4 are started. It is typically used in the case of embarrassingly parallel jobs, where the processes do not need to communicate or synchronise, like for applying the same program to a list of files.
By contrast, with a single job (Scheme 1), Slurm will start the 8 tasks at the same time, so it will need 8 CPUS to become available at the same time. This is typically only used with parallel jobs where processes need to communicate with each others, for instance using an Message Passing Interface library.

SLURM Embarrasingly parrallel submission taking too many resources

so I have the following submission script:
#!/bin/bash
#
#SBATCH --job-name=P6
#SBATCH --output=P6.txt
#SBATCH --partition=workq
#SBATCH --ntasks=512
#SBATCH --time=18:00:00
#SBATCH --mem-per-cpu=2500
#SBATCH --cpus-per-task=1
#SBATCH --array=1-512
srun ./P6 $SLURM_ARRAY_TASK_ID
What I want to do is to run 512 instances of the program P6 with an argument from 1 to 512, and I far as I know the submission above does that. However, upon inspecting squeue and sacct, SLURM seems to have assigned 512 CPU's to each task!
What did I did wrong?
You asked for 512 tasks for every job. Ask for a single one (or the number you consider appropriate for your code):
#SBATCH --ntasks=1
BTW, there are a few minor problems in your submission script. All jobs in the job array will be named the same way (which is not really a problem), but they will also share the stdout file, so you will have mixed information of all tasks in the P6.txt. I would advise you to differentiate them with the JobID or the TaskId (%j/%A/%a).
Also, you don't define the standard error destination, so if anything fails or is written in stderr, you will lose that information. My reccomendation is to define the standard error too (#SBATCH --error=P6.txt.%j).
Another detail is that the working folder ir not defined. It will work as long as you submit the script from the proper folder, but if you try to submit it from another place, it will fail.

Running a multi-stage job using SLURM

I am new to SLURM. My problem is that I have a multi-stage job, which needs to be run on a cluster, whose jobs are managed by SLURM. Specifically I want to schedule a job which:
Grabs N nodes,
Installs a software on all of them
(once all nodes finish the installation successfully) it creates a
database instance on the nodes
Loads the database
(once loading is done successfully) Runs a set of queries, for benchmarking purpose
Drops the database and returns the nodes
Each step could be run using a separate bash script; while the execution of the scripts and transitions between stages are coordinated by a master node.
My problem is that I know how to allocate nodes and call a single command or script on each (which runs as a stand-alone job on each node) using SLURM. But as soon as the command is done (or the called script is finished) on each node, the node returns to pool of free resources, leaving the allocated nodes queue for my job. But the above use case involves several stages/scripts; and needs coordination between them.
I am wondering what the correct way is to design/run a set of scripts for such a use case, using SLURM. Any suggestion or example would be extremely helpful, and highly appreciated.
You simply need to encapsulate all your scripts into a single one for submission:
#!/bin/bash
#SBATCH --nodes=4 --exclusive
# Setting Bash to exit whenever a command exits with a non-zero status.
set -e
set -o pipefail
echo "Installing software on each of $SLURM_NODELIST"
srun ./install.sh
echo "Creating database instance"
./createDBInstance.sh $SLURM_NODELIST
echo "Loading DB"
./loadDB.sh params
echo Benchmarking
./benchmarks.sh params
echo Done.
You'll need to fill in the blanks... Make sure that your script follow the standard of exiting with a non-zero status on error.

Resources