MPI & pthreads: nodes with different numbers of cores - multithreading

Introduction
I want to write a hybrid MPI/pthreads code. My goal is to have one MPI process started on each node and have each of those processes split into multiple threads that will actually do the job, but with communication only happening between the separate MPI processes.
There are quite a few tutorials describing this situation, called hybrid programming, but they typically assume a homogeneous cluster. However, the one I am using has heterogeneous nodes: they have different processors and different numbers of cores, i.e. the nodes are a combination of 4/8/12/16 core machines.
I am aware that running an MPI process across this cluster will make my code slow down to the speed of the slowest CPU used; I accept that fact. With that I would like to get to my question.
Is there a way to start N MPI processes -- with one MPI process per node -- and let each know how many physical cores are available to it at that node?
The MPI implementation I have access to is OpenMPI. The nodes are a mix of Intel and AMD CPUs. I thought of using a machinefile with each node specified as having one slot, then figuring out the number of cores locally. However, there seem to be problems with doing that. I am surely not the first person with this problem, but somehow searching the web didn't point me in the right direction yet. Is there a standard way of solving this problem other than finding oneself a homogeneous cluster?

Launching one process only per node is very simple with Open MPI:
mpiexec -pernode ./mympiprogram
The -pernode argument is equivalent to -npernode 1 and it instructs the ORTE launcher to start one process per node present in the host list. This method has the advantage that it works regardless of how the actual host list is provided, i.e. works both when it comes from tight coupling with some resource manager (e.g. Torque/PBS, SGE, LSF, SLURM, etc.) and with manually provided hosts. It also works even if the host list contains nodes with multiple slots.
Knowing the number of cores is a bit tricky and very OS-specific. But Open MPI ships with the hwloc library which provides an abstract API to query the system components, including the number of cores:
hwloc_topology_t topology;
/* Allocate and initialize topology object. */
hwloc_topology_init(&topology);
/* Perform the topology detection. */
hwloc_topology_load(topology);
/* Get the number of cores */
unsigned nbcores = hwloc_get_nbobjs_by_type(topology, HWLOC_OBJ_CORE);
/* Destroy topology object. */
hwloc_topology_destroy(topology);
If you want to make the number of cores across the cluster available to each MPI process in your job, a simple MPI_Allgather is what you need:
/* Obtain the number or MPI processes in the job */
int nranks;
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
unsigned cores[nranks];
MPI_Allgather(&nbcores, 1, MPI_UNSIGNED,
cores, 1, MPI_UNSIGNED, MPI_COMM_WORLD);

Related

Will running a parallel process on twice the machines using half the cores result in a speed increase?

I'm designing very compute heavy algorithm, and I'm more constrained by time than access to remote machines on which to run the algorithm.
My question is the following:
Let's say each machine I have access to has 24 cores, and I have 48 tasks to run. Currently, I'm dispatching the algorithm to two machines, each of which use their 24 cores to handle 24 of the tasks.
If I instead dispatched the same process to 4 machines which each spawned 12 threads, would it (likely) result in the tasks being completed quicker? I'm curious if having some extra cores available on the machine means computations are performed faster than if every single core is occupied running an individual thread.
This is highly dependent of the actual algorithm, the actual dataset, the target hardware including the interconnection network if data communicate and the input/data data are heavy (or if the algorithm runs very quickly). Some applications scale better on many machines with few cores and some scale better on few machines with many cores. In high-performance computing researchers worked during decades so to understand the performance of hybrid applications and there is no clear answer to this: it depends (Note that the question is already quite hard to answer for a given well defined application with a well defined dataset so for people to write research papers on it).
If your tasks are memory-bound, then using more machine with less core is often better. If the amount of transferred data is big or the algorithm require a low-latency, then using fewer machines is often better (typically one big SMP). There are many others things to consider since machines are not just a bag of core. NUMA effect should be considered for example as well as caches, the storage device system and even the OS (not all subsystem scale on a given machine regarding the OS).

How to execute an application using a specific core or cores?

I'm writing an application that needs to be executed on a specific core of a processor.
For Example:
If we have 4 cores and i want to execute code on 2nd core only. I need help how to do this.
I'm writing an application that needs to be executed on a specific core of a processor.
This is extremely improbable on most platforms (since most multi-core processors are homogeneous). You really need to explain, motivate and justify such an usual requirement.
You can't do that in general. And if you could do that, how exactly you should proceed is operating system specific, and platform specific. Most multi-core processors are homogeneous (all the cores are the same), some are not.
On Linux/x86-64, the kernel scheduler sees all cores as the same, and will move a task (e.g. a thread of a multi-threaded process) from one core to another at arbitrary moments. Since scheduling is preemptive.
On some processors, moving periodically (e.g dozen of times per second) a task from one core to another is actually recommended (and done automagically by the kernel, or the firmware - e.g. SMM) to avoid overheating of that core. Read about dark silicon.
Some unusual hardware (e.g. ARM big.LITTLE) have two sets of different cores (e.g. 2 high-end ARM cores with 2 low-end ones, all sharing the same memory). If your platform is such, please state that in your question, and ask how to achieve processor affinity on your specific platform. Very likely your OS has appropriate system calls for that purpose.
Some high-end motherboards are multi-sockets. In such case, a RAM module is closer to one socket (in timing) than to another. You then care about non-uniform memory access.
So read more about processor affinity and non-uniform memory access. Most OSes have some support for both. On Linux, see pthread_setaffinity_np(3), sched_setaffinity(2), numa(7) etc...
To learn more about OSes, read Operating Systems: Three Easy pieces.
Notice that by pinning some thread to a some fixed core, you might lower the performance of your program. Since processor affinity is rarely useful.
The programmer can prescribe his/her own affinities (hard affinities) but
Rule of thumb: use the default scheduler unless a good reason not to.
here is a C/C++ function to assign a thread to a certain core
Kernel scheduler API
#include <sched.h>
int sched_setaffinity(pid_t pid, unsigned int len, unsigned long * mask);
sets the current affinity mask of process 'pid' to *mask
'len' is the system word size: sizeof(unsigned int long)
To query affinity of a running process:
[~]$ taskset -p 3935
pid 3945's current affinity mask: f

available threads in Knights Landing

I am programming on a Knights Landing node which has 68 cores and 4 hyperthreads/core. I am working on a hybrid MPI/OpenMP application.
My question is if the 4 hyperthreads are meant to be used as OpenMP
threads or how could I use them? When I run my program using the
following scheme:
export OMP_NUM_THREADS=1
mpirun -np 68 ./app
it runs much more faster than when I use the scheme:
export OMP_NUM_THREADS=4
mpirun -np 68 ./app
Maybe the problem is that the threads for a certain MPI are not close to
each other. However, I don't know how to do it.
In summary, can I use the 4 hyperthreads/core as OpenMP threads?
Thanks.
As you're probably using Intel MPI and OpenMP runtimes, allow me to forward you some links with valuable information for pinning MPI and OpenMP threads into processor cores/threads. Process/thread binding is a must nowadays to achieve high performance. Even though the OS tries to do its best, moving one process/thread from one core/thread to another location implies that the data needs to be transferred as well. For that matter, take a look at Running an MPI/OpenMP Program and Environment Variables for Process Pinning. For instance, if you run with 68 MPI ranks, then you probably start placing each MPI rank into a different core. You can double check if mpirun is honoring your requests by setting I_MPI_DEBUG environment variable (as described here).

Local and Global size influence on program execution - OpenCl

After reading a lot of definitions regarding global work size and local work size I still don't really understand what they are and how they work.
I think that global work size determine how many times kernel function will be called, but local work size?
I thought that local work size determine how many threads are gonna be used in the same time in parallel, but am I really correct?
Is local size a number of threads executing one kernel program per one global size value? I mean when we have global size = 1 and local size = 1, then kernel function will be called one time and only one thread will be working on it.
But when we have Global Size = 4096 and local size (if allowed that high) is 1024 then we have 4096 calls of kernel function and each call have 1024 threads working on it at the same time? Am I correct?
Here is some example code i found:
and my another question is: how local size change influence that code?
As i see it is clearly working on global_id's, no local one's so is local size change to bigger one than lets say 1 will influence time spent executing that algorithm?
And when we would have for loop in that algorithm, is it changing anything then regarding local size influence? Do we need to use local_id's to see any difference when changing local size?
I tested that on few of my programs, and even when I used only global_id's changing local work size gave me significantly shorter executing times.
So how does it work? I don't get it.
Thank you in advance!
I thought that local work size determine how many threads are gonna be
used in the same time in parallel, but am I really correct?
Correct but it is per compute unit, not whole device. If there are more compute units than local thread groups, then device is not fully used. When there are more thread groups than compute units but not exact multiple, some compute units wait for other at the end. When both values equal(or exact multiple), then "how many times" is important to fully occupy all ALUS.
For example a 8-core cpu could define 8 compute units(maybe +8 more with hardware multithreads). But a GPU with similar price can have 20 to 64 compute units. Then, even within a single compute unit, many groups of threads can be "in-flight" which is not explicitly tuned but changed by resource usage per thread and per compute unit and maybe per gpu.
how local size change influence that code? As i see it is clearly
working on global_id's, no local one's so is local size change to
bigger one than lets say 1 will influence time spent executing that
algorithm?
Vectorizable/parallelizable kernel codes could have advantage of distributing threads to ALUs, SIMDs of a core or wider SIMDs of a gpu compute unit. For a CPU, 8 scalar instructions could be issued at the same time. For a GPU, it could be as large as thousands. So when you decrease local size to 1, you limit width of parallel thread issue to 1 ALU which cripples performance for many architectures. When you make local size too big, resource per thread falls and performance takes a hit. If you don't have any idea, opencl api can tune local size for you if you give a null to its parameter.
And when we would have for loop in that algorithm, is it changing
anything then regarding local size influence? Do we need to use
local_id's to see any difference when changing local size?
For old and static scheduling architectures, loop unrolling is advised with a unroll step size equal to width of basic SIMD width. No, local id is just a query of a threads id in its compute unit so no need to query if you don't need it.
I tested that on few of my programs, and even when I used only
global_id's changing local work size gave me significantly shorter
executing times. So how does it work?
If kernel needs insane resources, you could think of 1 thread per local group. If kernel doesn't need any resource except immediate values, you should make it maximum local value. Resource allocation per thread(because of kernel codes) is important. New architectures have load balancing so it may not matter in future if you let api choose the optimum value.
To keep all ALUs busy, scheduler issues many threads per core, when one thread is waiting for memory operation, another thread can do ALU operation at the same time. This is good when resource usage is small. When you use %50 of all resources of a compute unit, it can have only 2 threads in flight. Threads share sharable resources such as L1 cache,local memory,register file.
Codes such as c[i]=a[i]+b[i] for scalar floats, are vectorizable. You can have better performance using float8,float16 and similar structs if compiler is not already doing it in background. This way it needs less threads to accomplish all work and also accesses to memory is faster. You can also add a loop in kernel to decrase number of threads even more, which is good for CPU since less thread dispatching is needed between 2 data blocks. For GPU, it may not matter.
Trivial example for a CPU:
4 core, local size = 10, global size = 100
core 1 and 2 have 3 thread groups each. Core 3 and 4 have only 2 thread groups.
1: 30 threads --> fully performant
2: 30 threads
3: 20 threads --> less performant, better preemption for other jobs
4: 20 threads
while instruction pipelining doesn't have much bubbles for cores 1 and 2, bubbles start after some time for cores 3 and 4 so they can be used for other jobs such as a second kernel running in parallel or operating system or some array copying. When you use all cores equally such as for 120 threads, then they finish more work per second but CPU cannot do array copies if kernels already using memory.(unless OS does preemption for other threads)

Matlabpool number of threads vs core

I have a laptop running Ubuntu on Intel(R) Core(TM) i5-2410M CPU # 2.30GHz. According to Intel website for the above processor (located here), this processor has two cores and can run 4 threads at a time in parallel (because although it has 2 physical cores it has 4 logical cores).
When I start matlabpool it starts with local configuration and says it has connected to 2 labs. I suppose this means that it can run 2 threads in parallel. Does it not know that the CPU can actually run 4 threads in parallel?
In my experience, the local configuration of matlabpool uses, by default, the number of physical cores a machine possesses, rather than the number of logical cores. Hence on your machine, matlabpool only connects to two labs.
However, this is just a setting and can be overwritten with the following command:
matlabpool poolsize n
where n is an integer between 1 and 12 denoting the number of labs you want Matlab to use.
Now we get to the interesting bit that I'm a bit better equipped to answer thanks to a quick lesson from #RodyOldenhuis in the comments.
Hyper-threading implies a given physical core can have two threads run through it at the same time. Of course, they can't literally be processed simultaneously. The idea goes more like this: If one of the threads is inefficient in allocating tasks to the core, then the core may exhibit some "down-time". A second thread can take advantage of this "down-time" to get some work done.
In my experience, Matlab is often efficient in its allocation of threads to cores, therefore with one Matlab thread (ie one lab) running through it, a core may have very little "down-time" and hence there will be very little advantage to hyper-threading. My desktop is a core-i7 with 4 physical cores but 8 logical cores. However, I notice very little difference between running a parfor loop with 4 labs versus 8 labs. In fact, 8 labs is often slower due to the start-up costs associated with initializing the extra labs.
Of course, this is probably all complicated by other external factors such as what other programs you might be running simultaneously to Matlab too.
In summary, my suspicion is that even though you could force Matlab to initialize 4 labs (or even 12 labs), you won't see much of a speed-up over 2 labs, since Matlab is generally fairly efficient at allocating tasks to the processor.

Resources