single slurm array vs multiple sbatch calls - slurm

I can run N embarrassingly parallel jobs by using a slurm array like:
#SBATCH --array=1-N
Alternately I think I can achieve the same from a scheduling perspective (i.e. scheduled independently and as soon as resources become available) by manually launching 8 job. For example with a simply bash script with a loop.
Since the latter is far more flexible, I don't see the utility I using the --array option built into slurm.
Am I missing something?

Arrays offer a simple way to create parametrised jobs without writing the Bash loop. It
(obviously) creates the jobs and assign them a parameter ;
takes care of output file name parametrisation ;
makes the submission of a dependent job that should run after all those jobs are completer much easier
makes the output of squeue less cluttered
Furthermore, the jobs in an array can be managed as a whole, the squeue, scancel, etc. command can work on the whole array as opposed to writing another loop to cancel them for instance. This is even more interesting in the case you have multiple arrays running at the same time ; you do not need to manage the tracking of each individual job by yourself.
Finally, especially for large arrays, it makes the scheduler easier and can increase the job throughput.
If you need flexibility, then job arrays are not the solution, but maybe a workflow manager could help you.

Related

HPC SLURM and batch calls to MPI-enabled application in Master-Worker system

I am trying to implement some sort of Master-Worker system in a HPC with the resource manager SLURM, and I am looking for advices on how to implement such a system.
I have to use some python code that plays the role of the Master, in the sense that between batches of calculations the Master will run 2 seconds of its own calculations, before sending a new batch of work to the Workers. Each Worker must run an external executable over a single node of the HPC. The external executable (Gromacs) is itself MPI-enabled. There will be ~25 Workers and many batches of calculations.
What I have in mind atm (also see EDIT further below):
What I'm currently trying:
Allocate via SLURM as many MPI tasks as I want to use nodes, within a bash script that I'm calling via sbatch run.sh
#!/bin/bash -l
#SBATCH --nodes=4
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
module load required_env_module_for_external_executable
srun python my_python_code.py
Catch within python my_python_code.py the current MPI rank, and use rank/node 0 to run the Master python code
from mpi4py import MPI
name = MPI.Get_processor_name()
rank = MPI.COMM_WORLD.Get_rank()
size = MPI.COMM_WORLD.Get_size()
if rank == 0: # Master
run_initialization_and_distribute_work_to_Workers()
else: # Workers
start_Worker_waiting_for_work()
Within the python code of the Workers, start the external (MPI-enabled) application using MPI.COMM_SELF.Spawn()
def start_Worker_waiting_for_work():
# here we are on a single node
executable = 'gmx_mpi'
exec_args = 'mdrun -deffnm calculation_n'
# create some relationship between current MPI rank
# and the one the executable should use ?
mpi_info = MPI.Info.Create()
mpi_info.Set('host', MPI.Get_processor_name())
commspawn = MPI.COMM_SELF.Spawn(executable, args=exec_args,
maxprocs=1, info=mpi_info)
commspawn.Barrier()
commspawn.Disconnect()
res_analysis = do_some_analysis() # check what the executable produced
return res_analysis
What I would like some explanations on:
Can someone confirm that this approach seems valid for implementing the desired system ? Or is it obvious this has no chance to work ? If so, please, why ?
I am not sure that MPI.COMM_SELF.Spawn() will make the executable inherit from the SLURM resource allocation. If not, how to fix this ? I think that MPI.COMM_SELF.Spawn() is what I am looking for, but I'm not sure.
The external executable requires some environment modules to be loaded. If they are loaded at sbatch run.sh, are they still loaded when I invoke from MPI.COMM_SELF.Spawn() from my_python_code.py ?
As a slightly different approach, is it possible to have something like pre-allocations/reservations to book resources for the Workers, then use MPI.COMM_WORLD.Spawn() together with the pre-allocations/reservations ? The goal is also to avoid entering the SLURM queue at each new batch, as this may waste a lot of clock time (hence the will to book all required resources at the very beginning).
Since the python Master has to always stay alive anyways, SLURM job dependencies cannot be useful here, can they ?
Thank you so much for any help you may provide !
EDIT: Simplification of the workflow
In an attempt to keep my question simple, I first omited the fact that I actually had the Workers doing some analysis. But this work can be done on the Master using OpenMP multiprocessing, as Gilles Gouillardet suggested. It executes fast enough.
Then the Workers are necessary indeed, because each task takes about 20-25 min on a single Worker/Node.
I also added some bits about maintaining my own queue of tasks to be sent to the SLURM queue and ultimately to the Workers, just in case the number of tasks t would exceed a few tens/hundreds jobs. This should provide some flexibility also in the future, when re-using this code for different applications.
Probably this is fine like this. I will try to go this way and update these lines. EDIT: It works fine.
At first glance, this looks over convoluted to me:
there is no communication between a slave and GROMACS
there is some master/slave communications, but is MPI really necessary?
are the slaves really necessary? (e.g. can the master process simply serialize the computation and then directly start GROMACS?)
A much simpler architecture would be to have one process on your frontend, that will do:
prepare the GROMACS inputs
sbatch gromacs (start several jobs in a row)
wait for the GROMACS jobs to complete
analyze the GROMACS outputs
re-iterate or exit
If the slave is doing some work you do not want to serialize on the master, can you replace the MPI communications by using files on a shared filesystem? in that case, you can do the computation on the compute nodes within a GROMACS job, before and after executing GROMACS. If not, maybe TCP/IP based communications can do the trick.

Do not use all MATLAB pool workers

I have set up a local Matlab (R2015b) pool of workers according to my CPU configuration (quad-core, multi-threading => 8 workers in total.)
I have simulations that last 24h but I want to be able to use my computer at the same time. Therefore, I limit myself to 4 simulations a day (sent via batch) so that I can keep working at the same time.
My question is this: how can I queue several jobs without eating up the 8 workers? Another related question is if I reduce the size of the pool to 4 workers, will I still be able to run Matlab smoothly?
Thank you very much for your answer.
I would say that the best solution to your problem is to do it via bash in stead of matlab. In bash you have a command called nice which allows you to down prioritize the simulation. Which means that if you are using the computer you will get the power, and if you are not using it, the power goes to the computations.
Regarding the second part of your question. The easiest way to queue all the jobs is to make a bash script something like the following:
for f in $(find . -name name_of_matlab_script*)
do
nice -n 10 matlab -nodisplay <$f
done
where the name of the matlab scripts would be called something with the same base and then the start will take care of the rest. Then it will run the scripts after each other however give priority to what you otherwise use your computer for.
If you want more advanced scheduling software I normally uses Slurm.
Regarding the 4 workers in stead of 8, then as Ander Biguri says in the comments, as few as possible as long as you do not add to much extra time.

Waiting on many parallel shell commands with Perl

Concise-ish problem explanation:
I'd like to be able to run multiple (we'll say a few hundred) shell commands, each of which starts a long running process and blocks for hours or days with at most a line or two of output (this command is simply a job submission to a cluster). This blocking is helpful so I can know exactly when each finishes, because I'd like to investigate each result and possibly re-run each multiple times in case they fail. My program will act as a sort of controller for these programs.
for all commands in parallel {
submit_job_and_wait()
tries = 1
while ! job_was_successful and tries < 3{
resubmit_with_extra_memory_and_wait()
tries++
}
}
What I've tried/investigated:
I was so far thinking it would be best to create a thread for each submission which just blocks waiting for input. There is enough memory for quite a few waiting threads. But from what I've read, perl threads are closer to duplicate processes than in other languages, so creating hundreds of them is not feasible (nor does it feel right).
There also seem to be a variety of event-loop-ish cooperative systems like AnyEvent and Coro, but these seem to require you to rely on asynchronous libraries, otherwise you can't really do anything concurrently. I can't figure out how to make multiple shell commands with it. I've tried using AnyEvent::Util::run_cmd, but after I submit multiple commands, I have to specify the order in which I want to wait for them. I don't know in advance how long each submission will take, so I can't recv without sometimes getting very unlucky. This isn't really parallel.
my $cv1 = run_cmd("qsub -sync y 'sleep $RANDOM'");
my $cv2 = run_cmd("qsub -sync y 'sleep $RANDOM'");
# Now should I $cv1->recv first or $cv2->recv? Who knows!
# Out of 100 submissions, I may have to wait on the longest one before processing any.
My understanding of AnyEvent and friends may be wrong, so please correct me if so. :)
The other option is to run the job submission in its non-blocking form and have it communicate its completion back to my process, but the inter-process communication required to accomplish and coordinate this across different machines daunts me a little. I'm hoping to find a local solution before resorting to that.
Is there a solution I've overlooked?
You could rather use Scientific Workflow software such as fireworks or pegasus which are designed to help scientists submit large numbers of computing jobs to shared or dedicated resources. But they can also do much more so it might be overkill for your problem, but they are still worth having a look at.
If your goal is to try and find the tightest memory requirements for you job, you could also simply submit your job with a large amount or requested memory, and then extract actual memory usage from accounting (qacct), or , cluster policy permitting, logging on the compute node(s) where your job is running and view the memory usage with top or ps.

How do I process a partial order of tasks concurrently using Perl?

I have a partially ordered set of tasks, where for each task all of the tasks that are strictly before it in the partial order must be executed before it can be executed. I want to execute tasks which are not related (either before or after one other) concurrently to try to minimise the total execution time - but without starting a task before its dependencies are completed.
The tasks will run as (non-perl) child processes.
How should I approach solving a problem like this using Perl? What concurrency control facilities and data structures are available?
I would use a hash of arrays. For each task, all its prerequisities will be mentioned in the corresponding array:
$prereq{task1} = [qw/task2 task3 task4/];
I would keep completed tasks in a different hash, and then just
my #prereq = #{ $prereq{$task} };
if (#prereq == grep exists $completed{$_}, #prereq) {
run($task);
}
Looks like a full solution is NP-complete.
As for a partial solution, I would use some form of reference counting to determine which jobs are ready to run, Forks::Super::Job to run the background jobs and check their statuses and POSIX::pause to sleep when maximum number of jobs is spawned.
No threads are involved since you're already dealing with separate processes.
Read the first link for possible algorithms/heuristics to determine runnable jobs' priorities.

Does a PBS batch system move multiple serial jobs across nodes?

If I need to run many serial programs "in parallel" (because the problem is simple but time consuming - I need to read in many different data sets for the same program), the solution is simple if I only use one node. All I do is keep submitting serial jobs with an ampersand after each command, e.g. in the job script:
./program1 &
./program2 &
./program3 &
./program4
which will naturally run each serial program on a different processor. This works well on a login server or standalone workstation, and of course for a batch job asking for only one node.
But what if I need to run 110 different instances of the same program to read 110 different data sets? If I submit to multiple nodes (say 14) with a script which submits 110 ./program# commands, will the batch system run each job on a different processor across the different nodes, or will it try to run them all on the same, 8 core node?
I have tried to use a simple MPI code to read different data, but various errors result, with about 100 out of the 110 processes succeeding, and the others crashing. I have also considered job arrays, but I'm not sure if my system supports it.
I have tested the serial program extensively on individual data sets - there are no runtime errors, and I do not exceed the available memory on each node.
No, PBS won't automatically distribute the jobs among nodes for you. But this is a common thing to want to do, and you have a few options.
Easiest and in some ways most advantagous for you is to bunch the tasks into 1-node sized chunks, and submit those bundles as individual jobs. This will get your jobs started faster; a 1-node job will normally get scheduled faster than a (say) 14 node job, just because there's more one-node sized holes in the schedule than 14. This works particularly well if all the jobs take roughly the same amount of time, because then doing the division is pretty simple.
If you do want to do it all in one job (say, to simplify the bookkeeping), you may or may not have access to the pbsdsh command; there's a good discussion of it here. This lets you run a single script on all the processors in your job. You then write a script which queries $PBS_VNODENUM to find out which of the nnodes*ppn jobs it is, and runs the appropriate task.
If not pbsdsh, Gnu parallel is another tool which can enormously simplify these tasks. It's like xargs, if you're familiar with that, but will run commands in parallel, including on multiple nodes. So you'd submit your (say) 14-node job and have the first node run a gnu parallel script. The nice thing is that this will do scheduling for you even if the jobs are not all of the same length. The advice we give to users on our system for using gnu parallel for these sorts of things is here. Note that if gnu parallel isn't installed on your system, and for some reason your sysadmins won't do it, you can set it up in your home directory, it's not a complicated build.
You should consider job arrays.
Briefly, you insert #PBS -t 0-109 in your shell script (where the range 0-109 can be any integer range you want, but you stated you had 110 datasets) and torque will:
run 110 instances of your script, allocating each with the resources you specify (in the script with #PBS tags or as arguments when you submit).
assign a unique integer from 0 to 109 to the environment variable PBS_ARRAYID for each job.
Assuming you have access to environment variables within the code, you can just tell each job to run on data set number PBS_ARRAYID.

Resources