How to run make in several threads? - multithreading

In OpenCV docs I met the following text:
From build directory execute make, it is recommended to do this in several threads
Is it just easy thing?

GNU make supports the -j option to use multiple threads:
-j [jobs], --jobs[=jobs]
Specifies the number of jobs (commands) to run simultaneously. If there is more than one -j option, the last one is effective. If the -j option is given without an argument, make will not limit
the number of jobs that can run simultaneously. When make invokes a sub-make, all instances of make will coordinate to run the specified number of jobs at a time; see the section PARALLEL MAKE AND
THE JOBSERVER for details.

Related

single slurm array vs multiple sbatch calls

I can run N embarrassingly parallel jobs by using a slurm array like:
#SBATCH --array=1-N
Alternately I think I can achieve the same from a scheduling perspective (i.e. scheduled independently and as soon as resources become available) by manually launching 8 job. For example with a simply bash script with a loop.
Since the latter is far more flexible, I don't see the utility I using the --array option built into slurm.
Am I missing something?
Arrays offer a simple way to create parametrised jobs without writing the Bash loop. It
(obviously) creates the jobs and assign them a parameter ;
takes care of output file name parametrisation ;
makes the submission of a dependent job that should run after all those jobs are completer much easier
makes the output of squeue less cluttered
Furthermore, the jobs in an array can be managed as a whole, the squeue, scancel, etc. command can work on the whole array as opposed to writing another loop to cancel them for instance. This is even more interesting in the case you have multiple arrays running at the same time ; you do not need to manage the tracking of each individual job by yourself.
Finally, especially for large arrays, it makes the scheduler easier and can increase the job throughput.
If you need flexibility, then job arrays are not the solution, but maybe a workflow manager could help you.

HPC SLURM and batch calls to MPI-enabled application in Master-Worker system

I am trying to implement some sort of Master-Worker system in a HPC with the resource manager SLURM, and I am looking for advices on how to implement such a system.
I have to use some python code that plays the role of the Master, in the sense that between batches of calculations the Master will run 2 seconds of its own calculations, before sending a new batch of work to the Workers. Each Worker must run an external executable over a single node of the HPC. The external executable (Gromacs) is itself MPI-enabled. There will be ~25 Workers and many batches of calculations.
What I have in mind atm (also see EDIT further below):
What I'm currently trying:
Allocate via SLURM as many MPI tasks as I want to use nodes, within a bash script that I'm calling via sbatch run.sh
#!/bin/bash -l
#SBATCH --nodes=4
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
module load required_env_module_for_external_executable
srun python my_python_code.py
Catch within python my_python_code.py the current MPI rank, and use rank/node 0 to run the Master python code
from mpi4py import MPI
name = MPI.Get_processor_name()
rank = MPI.COMM_WORLD.Get_rank()
size = MPI.COMM_WORLD.Get_size()
if rank == 0: # Master
run_initialization_and_distribute_work_to_Workers()
else: # Workers
start_Worker_waiting_for_work()
Within the python code of the Workers, start the external (MPI-enabled) application using MPI.COMM_SELF.Spawn()
def start_Worker_waiting_for_work():
# here we are on a single node
executable = 'gmx_mpi'
exec_args = 'mdrun -deffnm calculation_n'
# create some relationship between current MPI rank
# and the one the executable should use ?
mpi_info = MPI.Info.Create()
mpi_info.Set('host', MPI.Get_processor_name())
commspawn = MPI.COMM_SELF.Spawn(executable, args=exec_args,
maxprocs=1, info=mpi_info)
commspawn.Barrier()
commspawn.Disconnect()
res_analysis = do_some_analysis() # check what the executable produced
return res_analysis
What I would like some explanations on:
Can someone confirm that this approach seems valid for implementing the desired system ? Or is it obvious this has no chance to work ? If so, please, why ?
I am not sure that MPI.COMM_SELF.Spawn() will make the executable inherit from the SLURM resource allocation. If not, how to fix this ? I think that MPI.COMM_SELF.Spawn() is what I am looking for, but I'm not sure.
The external executable requires some environment modules to be loaded. If they are loaded at sbatch run.sh, are they still loaded when I invoke from MPI.COMM_SELF.Spawn() from my_python_code.py ?
As a slightly different approach, is it possible to have something like pre-allocations/reservations to book resources for the Workers, then use MPI.COMM_WORLD.Spawn() together with the pre-allocations/reservations ? The goal is also to avoid entering the SLURM queue at each new batch, as this may waste a lot of clock time (hence the will to book all required resources at the very beginning).
Since the python Master has to always stay alive anyways, SLURM job dependencies cannot be useful here, can they ?
Thank you so much for any help you may provide !
EDIT: Simplification of the workflow
In an attempt to keep my question simple, I first omited the fact that I actually had the Workers doing some analysis. But this work can be done on the Master using OpenMP multiprocessing, as Gilles Gouillardet suggested. It executes fast enough.
Then the Workers are necessary indeed, because each task takes about 20-25 min on a single Worker/Node.
I also added some bits about maintaining my own queue of tasks to be sent to the SLURM queue and ultimately to the Workers, just in case the number of tasks t would exceed a few tens/hundreds jobs. This should provide some flexibility also in the future, when re-using this code for different applications.
Probably this is fine like this. I will try to go this way and update these lines. EDIT: It works fine.
At first glance, this looks over convoluted to me:
there is no communication between a slave and GROMACS
there is some master/slave communications, but is MPI really necessary?
are the slaves really necessary? (e.g. can the master process simply serialize the computation and then directly start GROMACS?)
A much simpler architecture would be to have one process on your frontend, that will do:
prepare the GROMACS inputs
sbatch gromacs (start several jobs in a row)
wait for the GROMACS jobs to complete
analyze the GROMACS outputs
re-iterate or exit
If the slave is doing some work you do not want to serialize on the master, can you replace the MPI communications by using files on a shared filesystem? in that case, you can do the computation on the compute nodes within a GROMACS job, before and after executing GROMACS. If not, maybe TCP/IP based communications can do the trick.

Specify multiple GRES type options in SLURM

I've been using SLURM to request specific GPUs, like so;
--gres=gpu:TYPE:1
On the cluster I'm using there are 4 different GPUs available, all with their specific gres types.
For some jobs I don't care which GPU is used, so I can specify:
--gres=gpu:1
However, sometimes I'd like to have some specific types, but among those dont really care about which one. Basically the first one that is available.
So I would hope to specify something like:
--gres=gpu:TYPE1:1 OR --gres=gpu:TYPE2:1
So that it would pick whichever is available first.
However, I've been unable to find such an option. This does option exist SLURM?
Contrarily to the --constraint option, the --gres option does not allow logical constructs. One option would be to submit two jobs and scancel the one that starts later.

Force qsub (PBS) to wait the job's end before exiting

I've been using Sun Grid Engine to run my jobs on a node of a cluster.
Usually I would wait for the job to complete before exiting and I use:
qsub -sync yes perl Script.pl
However now I don't use anymore Sun Grid Engine but PBS Pro 10.4
I'm not able to find a corresponding instruction to -sync.
Could someone help me?
Thanks in advance
PBSPro doesn't have a -sync equivalent but you might be able to use the
-I option combined with the use of expect to tell it what code to run in order to get the same effect.
The equivalent of -sync for PBS is -Wblock=true.
This prevents qsub from exiting until the job has completed. It is perhaps unusual to need this, but I found it useful when using some software that was not designed for HPC. The software executes multiple instances of a worker program, which run simultaneously. However, it then has to wait for one (or sometimes more) of the instances to complete, and do some work with the results, before spawning the next. If the worker program completes without writing a particular file, it is assumed to have failed. I was able to write a wrapper script for the worker program, to qsub it, and used the -Wblock=true option to make it wait for the worker program job to complete.

Does a PBS batch system move multiple serial jobs across nodes?

If I need to run many serial programs "in parallel" (because the problem is simple but time consuming - I need to read in many different data sets for the same program), the solution is simple if I only use one node. All I do is keep submitting serial jobs with an ampersand after each command, e.g. in the job script:
./program1 &
./program2 &
./program3 &
./program4
which will naturally run each serial program on a different processor. This works well on a login server or standalone workstation, and of course for a batch job asking for only one node.
But what if I need to run 110 different instances of the same program to read 110 different data sets? If I submit to multiple nodes (say 14) with a script which submits 110 ./program# commands, will the batch system run each job on a different processor across the different nodes, or will it try to run them all on the same, 8 core node?
I have tried to use a simple MPI code to read different data, but various errors result, with about 100 out of the 110 processes succeeding, and the others crashing. I have also considered job arrays, but I'm not sure if my system supports it.
I have tested the serial program extensively on individual data sets - there are no runtime errors, and I do not exceed the available memory on each node.
No, PBS won't automatically distribute the jobs among nodes for you. But this is a common thing to want to do, and you have a few options.
Easiest and in some ways most advantagous for you is to bunch the tasks into 1-node sized chunks, and submit those bundles as individual jobs. This will get your jobs started faster; a 1-node job will normally get scheduled faster than a (say) 14 node job, just because there's more one-node sized holes in the schedule than 14. This works particularly well if all the jobs take roughly the same amount of time, because then doing the division is pretty simple.
If you do want to do it all in one job (say, to simplify the bookkeeping), you may or may not have access to the pbsdsh command; there's a good discussion of it here. This lets you run a single script on all the processors in your job. You then write a script which queries $PBS_VNODENUM to find out which of the nnodes*ppn jobs it is, and runs the appropriate task.
If not pbsdsh, Gnu parallel is another tool which can enormously simplify these tasks. It's like xargs, if you're familiar with that, but will run commands in parallel, including on multiple nodes. So you'd submit your (say) 14-node job and have the first node run a gnu parallel script. The nice thing is that this will do scheduling for you even if the jobs are not all of the same length. The advice we give to users on our system for using gnu parallel for these sorts of things is here. Note that if gnu parallel isn't installed on your system, and for some reason your sysadmins won't do it, you can set it up in your home directory, it's not a complicated build.
You should consider job arrays.
Briefly, you insert #PBS -t 0-109 in your shell script (where the range 0-109 can be any integer range you want, but you stated you had 110 datasets) and torque will:
run 110 instances of your script, allocating each with the resources you specify (in the script with #PBS tags or as arguments when you submit).
assign a unique integer from 0 to 109 to the environment variable PBS_ARRAYID for each job.
Assuming you have access to environment variables within the code, you can just tell each job to run on data set number PBS_ARRAYID.

Resources