Hybrid parallelization with OpenMP and MPI

Hybrid parallelization with OpenMP and MPI - multithreading

I'm trying to setup a program which runs across a cluster of 20 nodes, each with 12 cores each. The idea is to have the head process distribute some data out to each node, and have each node perform some operations on the data using OpenMP to utilize the 12 cores. I'm relatively new to this an not sure about the best way to set this up.
We use PBS as the scheduler and my original plan was to create a single MPI process on each node, and let OpenMP create 12 threads per node.
#PBS -l nodes=20:ppn=1
But when I run this, OpenMP seems to only create 1 thread per process. How can I set this up so OpenMP will always create 12 threads per MPI process?
Edit: As soon as I specify to use more than 1 process in PBS, OpenMP will start using 6 threads per process, can't seem to figure out why using only 1 process per node isn't working.

Related

how can i assign multiple cores to every process in my os

I have VM which builds on cantos 7,so I want to make every process in the system to use more than one core in parallel
I have a 24 CPU on my server, but every process can't use more than 1 core, after that the CPU is always 100%. I need to make more utilization of CPU with other cores.
I need to make a huge process to use multipe cores instead of one core?
How can I do such a thing like that?

Can threads of a process made run on different CPUS

Would like to know, if thread of a process can be made to run on different set of CPU's in Linux?
For instance, let's say we start a process with 30 threads then first 15 threads from this process is made to run on core 0-14 using taskset program, and rest of threads on core 15-29?
Is above configuration possible?

How to run binary executables in multi-thread HPC cluster?

I have this tool called cgatools from complete genomics (http://cgatools.sourceforge.net/docs/1.8.0/). I need to run some genome analyses in High-Performance Computing Cluster. I tried to run the job allocating more than 50 cores and 250gb memory, but it only uses one core and limits the memory to less than 2GB. What would be my best option in this case? Is there a way to run binary executables in HPC cluster making it use all the allocated memory?

The scheduler just runs the binary provided by you on the first node allocated. The onus of splitting the job and running it in parallel is on the binary. Hence, you see that you are using one core out of the fifty allocated.
Parallelising at the code level
You will need to make sure that the binary that you are submitting as a job to the cluster has some mechanism to understand the nodes that are allocated (interaction with the Job Scheduler) and a mechanism to utilize the allocated resources (MPI, PGAS etc.).
If it is parallelized, submitting the binary through a job submission script (through a wrapper like mpirun/mpiexec) should utilize all the allocated resources.
Running black box serial binaries in parallel
If not, the only other possible workload distribution mechanism across the resources is the data parallel mode, wherein, you use the cluster to supply multiple inputs to the same binary and run the processes in parallel to effectively reduce the time taken to solve the problem.
You can set the granularity based on the memory required for each run. For example, if each process needs 1GB of memory, you can run 16 processes per node (with assumed 16 cores and 16GB memory etc.)
The parallel submission of multiple inputs on a single node can be done through the tool Parallel. You can then submit multiple jobs to the cluster, with each job requesting 1 node (exclusive access and the parallel tool) and working on different input elements respectively.
If you do not want to launch 'n' separate jobs, you can use the mechanisms provided by the scheduler like blaunch to specify the machine on which the job is supposed to be run dynamically. You can parse the names of the machines allocated by the scheduler and further use blaunch like script to emulate the submission of n jobs from the first node.
Note: These class of applications are better off being run on a cloud like setup instead of typical HPC systems [effective utilization of the cluster at all the levels of available parallelism (cluster, thread and SIMD) is a key part of HPC.]

Multithreading on SLURM

I have a Perl script that forks using the Parallel::ForkManager module.
To my knowledge, if I fork 32 child processes and ask the SLURM scheduler to run the job on 4 nodes, 8 processors per node, the code will execute each child process on every core.
Someone in my lab said that if I run a job on multiple nodes that the other nodes are not used, and I'm wasting time and money. Is this accurate?
If I use a script that forks am I limited to one node with SLURM?

As far as I know Parallel::ForkManager doesn't make use of MPI, so if you're using mpirun I don't see how it's going to communicate across nodes. A simple test is to have each child output hostname.
One thing that commonly happens with non-MPI software launched with mpirun is that you duplicate all your effort across all nodes, so that they are all doing the exact same thing instead of sharing the work. If you use Parallel::MPI it should work just fine.

Hyrbid MPI / OpenMP

I've been trying to use OpenMPI with OpenMP and when I run try to run 2 MPI processes and 4 threads on one machine, all threads are executed on the same core at 25% usage instead of on 4 separate cores. I was able to fix this using --enable-mpi-threads when building OpenMPI; but now I am having an issue with this being a duel CPU machine.
There are 8 cores per processor, 2 processors in each server. If I run 2 MPI processes and 8 threads then everything is fine as long as the 2 processes started on separate processors, but if I try and do 1 MPI process with 16 threads it reverts to stacking every thread on one core.
Has anyone had any experience running OpenMPI and OpenMP together?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string