How to limit the number of cores used by the Erlang VM (BEAM)? - multithreading

I'm running experiments on a node with 2 x Quad-Core Xeon E5520 2.2GHz, 24.0GB RAM, and Erlang R15B02 (SMP enabled). I wonder if I can limit the number of cores used by the Erlang VM so that I can temporarily disable some cores and increase the number step by step in order to test scalability.
I don't have root access on this node. So I'm expecting some method which is either by specifying parameters to erl or by Erlang code.

You can limit the number of cores Erlang uses via the +S option to erl, which allows you to set the number of scheduler kernel threads Erlang creates. See the erl man page for more details.
Note that Erlang linked-in port drivers and native implemented functions (NIFs) can both create their own threads and thus affect how many cores an Erlang process will use independently of the threads specified via the +S option, though none of the standard drivers or NIFs do this. Also the +A option to erl creates a pool of asynchronous threads for use by drivers that could also affect the number of cores used, and by default the async thread pool has 10 threads (it was empty by default prior to Erlang/OTP version R16B).

Related

Is it possible to find out which NUMA system memory bank the current thread belongs to?

I'm writing a NUMA-aware algorithm and need this information for optimal memory keeping. It would be nice if you know a solution for JVM(for example using oshi), but I can't find it even for C/C++
Threads are not bound to a given core by default (so not to a NUMA node). Thus, it does not make sense to get the NUMA node of a thread since it can switch from once node to another at any time. If your threads are bound to cores or at least to NUMA nodes (possibly using taskset or a pthread system call on Linux for example), then this is possible but AFAIK it is not possible in Java, but possible in C though it is certainly not portable.
On Linux, in C, you can get the current CPU of a running thread using sched_getcpu(). Note that AFAIK a "CPU" does not mean a micro-processor but a core or even an hardware thread in practice (this is what can be seen in /proc/cpuinfo for example). Then, you can use libnuma so to get the NUMA node information. More specifically, the numa_node_of_cpu function should give you the NUMA node of the target CPU.
The only portable C library I known which is able to do that is libhwloc (which uses libnuma internally on Linux). You can get more information about it here and there.
AFAIK controlling the NUMA allocation policy from the JVM is not possible in general (especially if threads are not bound). If threads are bound (done manually) and the JVM performs local allocation (very likely but not garanteed), then the default first-touch policy (which can be tuned by numactl on Linux and may be a different one on some platforms) should map the pages of the referenced data on the NUMA node doing the write.

available threads in Knights Landing

I am programming on a Knights Landing node which has 68 cores and 4 hyperthreads/core. I am working on a hybrid MPI/OpenMP application.
My question is if the 4 hyperthreads are meant to be used as OpenMP
threads or how could I use them? When I run my program using the
following scheme:
export OMP_NUM_THREADS=1
mpirun -np 68 ./app
it runs much more faster than when I use the scheme:
export OMP_NUM_THREADS=4
mpirun -np 68 ./app
Maybe the problem is that the threads for a certain MPI are not close to
each other. However, I don't know how to do it.
In summary, can I use the 4 hyperthreads/core as OpenMP threads?
Thanks.
As you're probably using Intel MPI and OpenMP runtimes, allow me to forward you some links with valuable information for pinning MPI and OpenMP threads into processor cores/threads. Process/thread binding is a must nowadays to achieve high performance. Even though the OS tries to do its best, moving one process/thread from one core/thread to another location implies that the data needs to be transferred as well. For that matter, take a look at Running an MPI/OpenMP Program and Environment Variables for Process Pinning. For instance, if you run with 68 MPI ranks, then you probably start placing each MPI rank into a different core. You can double check if mpirun is honoring your requests by setting I_MPI_DEBUG environment variable (as described here).

How do I force MPI to not run on all cores if I have more threads than cores?

Context: I'm debugging a simulation code that requires that the number of MPI threads does not change when continuing the simulation from a restart file. This code was running on a large cluster, but I'm debugging it on a smaller local machine so that I don't have to wait to submit the job to a queue. The code requires 72 threads, which is more than the number of cores on the local machine. This is not a problem in itself - I can run with more threads than cores, and just take the performance hit, which is not a major issue when debugging.
The Problem: I want to leave some cores free for other tasks and other users. For instance, if my small local computer has 48 cores, I want to run my 72 threads on, say, 36 cores, and leave 12 cores free. I want to debug my large code locally without completely taking over the machine.
Assuming I'm willing to deal with the memory and performance issues of running on more threads than cores, how do I actually do this? Do I have to get into the back-end of the scheduler somehow? Does it depend on whether I'm using MPICH or Open-MPI etc?
I'm essentially looking for something like mpirun -np 72 --cpus-per-proc 0.5, if that were possible.
taskset -c 0-35 mpiexec -np 72 ./a.out should do the trick if the process are to be launched all on the same host and should work with basically all MPI distributions (Open MPI, MPICH, Intel MPI, etc.). Also, make sure to disable any process binding by the MPI library, i.e. pass --bind-to none for Open MPI 1.8+, -bind-to none for MPICH with Hydra or -genv I_MPI_PIN=0 for Intel MPI.

Matlabpool number of threads vs core

I have a laptop running Ubuntu on Intel(R) Core(TM) i5-2410M CPU # 2.30GHz. According to Intel website for the above processor (located here), this processor has two cores and can run 4 threads at a time in parallel (because although it has 2 physical cores it has 4 logical cores).
When I start matlabpool it starts with local configuration and says it has connected to 2 labs. I suppose this means that it can run 2 threads in parallel. Does it not know that the CPU can actually run 4 threads in parallel?
In my experience, the local configuration of matlabpool uses, by default, the number of physical cores a machine possesses, rather than the number of logical cores. Hence on your machine, matlabpool only connects to two labs.
However, this is just a setting and can be overwritten with the following command:
matlabpool poolsize n
where n is an integer between 1 and 12 denoting the number of labs you want Matlab to use.
Now we get to the interesting bit that I'm a bit better equipped to answer thanks to a quick lesson from #RodyOldenhuis in the comments.
Hyper-threading implies a given physical core can have two threads run through it at the same time. Of course, they can't literally be processed simultaneously. The idea goes more like this: If one of the threads is inefficient in allocating tasks to the core, then the core may exhibit some "down-time". A second thread can take advantage of this "down-time" to get some work done.
In my experience, Matlab is often efficient in its allocation of threads to cores, therefore with one Matlab thread (ie one lab) running through it, a core may have very little "down-time" and hence there will be very little advantage to hyper-threading. My desktop is a core-i7 with 4 physical cores but 8 logical cores. However, I notice very little difference between running a parfor loop with 4 labs versus 8 labs. In fact, 8 labs is often slower due to the start-up costs associated with initializing the extra labs.
Of course, this is probably all complicated by other external factors such as what other programs you might be running simultaneously to Matlab too.
In summary, my suspicion is that even though you could force Matlab to initialize 4 labs (or even 12 labs), you won't see much of a speed-up over 2 labs, since Matlab is generally fairly efficient at allocating tasks to the processor.

How can I change the default processor affinity in Linux?

I want to run a number of benchmarks on a multi-core system running Linux. I want to reserve one of the cores for my benchmarks. I know that I can use sched_setaffinity to limit my benchmarks to that core. How can I keep all other processes off my core? In other words, how can I set the default affinity of all processes to not include my core?
Even if you keep all the other processes off your "reserved for benchmarking" core, bear in mind that you can't stop them from consuming a variable and unpredictable proportion of the limited memory bandwidth to a multi-core chip, and that you can't stop them making variable demands on the shared L2 and L3 caches.
IMHO reproducible, scientific benchmarking needs a machine all to itself.

Resources