HPC cluster: select the number of Threads in SLURM sbatch when run Ansys Fluent - slurm

I'm running Fluent on HPC via Slurm and have a question: Fluent use processors(threads) to solve. Currently i can only request Cores count on Slurm using --ntasks=64. So is there a way to request the number of threads?
Exp: I have 2 node
Node 1: 32 cores, 64 threads
Node 2: 32 cores, 32 threads
So when i run with --ntasks=64, Slurm using 2 node, so lose 32 threads, and i can not request more than 64 threads (I think Slurm is understanding 64 cores)
Actually when we call 64 threads only Node 1 is needed. Right?
Above is an example, my HPC actually has many nodes like that and i want to request threads without specifying it because the configuration of the nodes is different.

Related

Curious about how to specify the number of core for MPI in order to get the fastest scientific computation

I have been running several scientific program package in conjunction with MPI by using the following command
nohup mpirun -np N -x OMP_NUM_THREADS=M program.exe < input > output &
where the value of N and M depend on the physical CPU cores of my machine. For example, my machine has the specification like this
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2440 0 # 2.40GHz
Stepping: 7
In this case, I first tried setting with N = 24 and M = 1, so the calculation ran very slowly. Then I changed N and M to 12 and 2 respectively. So I found that the latter had obviously provided me the fastest computation.
I was wondering that why did I set N & M are 12 and 2 provide more performance higher than the first case ?
there is no absolute rule on how to run MPI+OpenMP application.
the only advice is not to run an OpenMP process on more than one socket
(OpenMP was designed for SMP machines with flat memory access, but today, most systems are NUMA)
then just experiment.
some apps run best in flat MPI (e.g. one thread per task), while some other work best with one MPI task per socket, and all available cores for OpenMP.
last but not least, if you run more than one OpenMP thread per MPI task, make sure your MPI library bound the MPI tasks as expected.
for example, if you run with 12 OpenMP threads but MPI bind tasks to one core, you will end up doing time sharing and performance will be horrible.
or if you run with 12 OpenMP threads, and MPI task was bound to 12 cores, make sure the 12 cores are on the same socket (and not 6 on each socket)
There is no general rule about this because, most of the time, this performance is dependent on the computation properties of the application itself.
Applications with coarse synchronization granularity may scale well using plain MPI code (no multithreading).
If the synchronization granularity is fine, then using shared memory multithreading (such as OpenMP) and placing all the threads in a process close to each other (in the same socket) becomes more important: synchronization is cheaper and memory access latency is critical.
Finally, compute-bound applications (performance is limited by the processor) are likely not to benefit from hyper-threading at all, since two threads sharing a core contend for the functional units it contains. In this case, you may find applications that perform better using N=2 and M=6 than using N=2 and M=12.
indeeed there is no absolute rule on how to run MPI+OpenMP application.
I agree with all Gilles said.
so I want to talk about the CPU in your case.
in the specification you give, it shows the system enables hyper-thread.
but this not always helps. your computer has 12 physical cores in fact.
so I advice you try some combinations that make M * N = 12 to 24,
like 12*1, 6*2, 6*3
which one is best, depends on how well your application.

How do I force MPI to not run on all cores if I have more threads than cores?

Context: I'm debugging a simulation code that requires that the number of MPI threads does not change when continuing the simulation from a restart file. This code was running on a large cluster, but I'm debugging it on a smaller local machine so that I don't have to wait to submit the job to a queue. The code requires 72 threads, which is more than the number of cores on the local machine. This is not a problem in itself - I can run with more threads than cores, and just take the performance hit, which is not a major issue when debugging.
The Problem: I want to leave some cores free for other tasks and other users. For instance, if my small local computer has 48 cores, I want to run my 72 threads on, say, 36 cores, and leave 12 cores free. I want to debug my large code locally without completely taking over the machine.
Assuming I'm willing to deal with the memory and performance issues of running on more threads than cores, how do I actually do this? Do I have to get into the back-end of the scheduler somehow? Does it depend on whether I'm using MPICH or Open-MPI etc?
I'm essentially looking for something like mpirun -np 72 --cpus-per-proc 0.5, if that were possible.
taskset -c 0-35 mpiexec -np 72 ./a.out should do the trick if the process are to be launched all on the same host and should work with basically all MPI distributions (Open MPI, MPICH, Intel MPI, etc.). Also, make sure to disable any process binding by the MPI library, i.e. pass --bind-to none for Open MPI 1.8+, -bind-to none for MPICH with Hydra or -genv I_MPI_PIN=0 for Intel MPI.

MPI & pthreads: nodes with different numbers of cores

Introduction
I want to write a hybrid MPI/pthreads code. My goal is to have one MPI process started on each node and have each of those processes split into multiple threads that will actually do the job, but with communication only happening between the separate MPI processes.
There are quite a few tutorials describing this situation, called hybrid programming, but they typically assume a homogeneous cluster. However, the one I am using has heterogeneous nodes: they have different processors and different numbers of cores, i.e. the nodes are a combination of 4/8/12/16 core machines.
I am aware that running an MPI process across this cluster will make my code slow down to the speed of the slowest CPU used; I accept that fact. With that I would like to get to my question.
Is there a way to start N MPI processes -- with one MPI process per node -- and let each know how many physical cores are available to it at that node?
The MPI implementation I have access to is OpenMPI. The nodes are a mix of Intel and AMD CPUs. I thought of using a machinefile with each node specified as having one slot, then figuring out the number of cores locally. However, there seem to be problems with doing that. I am surely not the first person with this problem, but somehow searching the web didn't point me in the right direction yet. Is there a standard way of solving this problem other than finding oneself a homogeneous cluster?
Launching one process only per node is very simple with Open MPI:
mpiexec -pernode ./mympiprogram
The -pernode argument is equivalent to -npernode 1 and it instructs the ORTE launcher to start one process per node present in the host list. This method has the advantage that it works regardless of how the actual host list is provided, i.e. works both when it comes from tight coupling with some resource manager (e.g. Torque/PBS, SGE, LSF, SLURM, etc.) and with manually provided hosts. It also works even if the host list contains nodes with multiple slots.
Knowing the number of cores is a bit tricky and very OS-specific. But Open MPI ships with the hwloc library which provides an abstract API to query the system components, including the number of cores:
hwloc_topology_t topology;
/* Allocate and initialize topology object. */
hwloc_topology_init(&topology);
/* Perform the topology detection. */
hwloc_topology_load(topology);
/* Get the number of cores */
unsigned nbcores = hwloc_get_nbobjs_by_type(topology, HWLOC_OBJ_CORE);
/* Destroy topology object. */
hwloc_topology_destroy(topology);
If you want to make the number of cores across the cluster available to each MPI process in your job, a simple MPI_Allgather is what you need:
/* Obtain the number or MPI processes in the job */
int nranks;
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
unsigned cores[nranks];
MPI_Allgather(&nbcores, 1, MPI_UNSIGNED,
cores, 1, MPI_UNSIGNED, MPI_COMM_WORLD);

Matlabpool number of threads vs core

I have a laptop running Ubuntu on Intel(R) Core(TM) i5-2410M CPU # 2.30GHz. According to Intel website for the above processor (located here), this processor has two cores and can run 4 threads at a time in parallel (because although it has 2 physical cores it has 4 logical cores).
When I start matlabpool it starts with local configuration and says it has connected to 2 labs. I suppose this means that it can run 2 threads in parallel. Does it not know that the CPU can actually run 4 threads in parallel?
In my experience, the local configuration of matlabpool uses, by default, the number of physical cores a machine possesses, rather than the number of logical cores. Hence on your machine, matlabpool only connects to two labs.
However, this is just a setting and can be overwritten with the following command:
matlabpool poolsize n
where n is an integer between 1 and 12 denoting the number of labs you want Matlab to use.
Now we get to the interesting bit that I'm a bit better equipped to answer thanks to a quick lesson from #RodyOldenhuis in the comments.
Hyper-threading implies a given physical core can have two threads run through it at the same time. Of course, they can't literally be processed simultaneously. The idea goes more like this: If one of the threads is inefficient in allocating tasks to the core, then the core may exhibit some "down-time". A second thread can take advantage of this "down-time" to get some work done.
In my experience, Matlab is often efficient in its allocation of threads to cores, therefore with one Matlab thread (ie one lab) running through it, a core may have very little "down-time" and hence there will be very little advantage to hyper-threading. My desktop is a core-i7 with 4 physical cores but 8 logical cores. However, I notice very little difference between running a parfor loop with 4 labs versus 8 labs. In fact, 8 labs is often slower due to the start-up costs associated with initializing the extra labs.
Of course, this is probably all complicated by other external factors such as what other programs you might be running simultaneously to Matlab too.
In summary, my suspicion is that even though you could force Matlab to initialize 4 labs (or even 12 labs), you won't see much of a speed-up over 2 labs, since Matlab is generally fairly efficient at allocating tasks to the processor.

How to limit the number of cores used by the Erlang VM (BEAM)?

I'm running experiments on a node with 2 x Quad-Core Xeon E5520 2.2GHz, 24.0GB RAM, and Erlang R15B02 (SMP enabled). I wonder if I can limit the number of cores used by the Erlang VM so that I can temporarily disable some cores and increase the number step by step in order to test scalability.
I don't have root access on this node. So I'm expecting some method which is either by specifying parameters to erl or by Erlang code.
You can limit the number of cores Erlang uses via the +S option to erl, which allows you to set the number of scheduler kernel threads Erlang creates. See the erl man page for more details.
Note that Erlang linked-in port drivers and native implemented functions (NIFs) can both create their own threads and thus affect how many cores an Erlang process will use independently of the threads specified via the +S option, though none of the standard drivers or NIFs do this. Also the +A option to erl creates a pool of asynchronous threads for use by drivers that could also affect the number of cores used, and by default the async thread pool has 10 threads (it was empty by default prior to Erlang/OTP version R16B).

Resources