Impossible CPU core/threads configuration

Impossible CPU core/threads configuration - slurm

I am trying to set up a test instance of slurmd and seemingly cannot get it to accept my CPU.
Here's the output of lscpu:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core(TM) i9-12900K
CPU family: 6
Model: 151
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 2
CPU max MHz: 6700.0000
CPU min MHz: 800.0000
BogoMIPS: 6374.40
Flags: ...
Now the obvious choice of config IMHO would be:
Sockets=1 CoresPerSocket=16 ThreadsPerCore=2
but
slurmd: error: Thread count (24) not multiple of core count (16)
Also setting CPUs=16 doesn't work either
slurmd: error: Thread count (24) not multiple of core count (16)
slurmd: error: Node configuration differs from hardware: CPUs=16:24(hw) Boards=1:1(hw) SocketsPerBoard=1:1(hw) CoresPerSocket=16:16(hw) ThreadsPerCore=2:1(hw)
Setting ThreadsPerCore=1 doesn't change anything.
I think I did try every possible combination of settings and it always fails due to a mismatch of detected numbers or expected multiplication results.
So,
How do I make slurm believe me that my CPU actually exists and preferably even start?
Why do these config settings exist at all when the daemon seemingly only trust its own hardware detection? Do they have an effect?

How do I make slurm believe me that my CPU actually exists and preferably even start?
You can set CPUs=24 in the node definition line and set the config_overrides parameter.
From the man page: "If set, consider the configuration of each node to be that specified in the slurm.conf configuration file"
Why do these config settings exist at all when the daemon seemingly only trust its own hardware detection? Do they have an effect?
I think this act as a sanity check on one hand, and on the other hand, it allows the slurm controller to know in advance the node configuration, which is needed to build its internal data structures upon startup.

Related

Parallel computing: how to share computing resources among users?

I am running a simulation on a Linux machine with the following specs.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU # 2.40GHz
Stepping: 4
CPU MHz: 3099.902
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 28160K
This is the run command line script for my solver.
/path/to/meshfree/installation/folder/meshfree_run.sh # on 1 (serial) worker
/path/to/meshfree/installation/folder/meshfree_run.sh N # on N parallel MPI processes
I share the system with another colleague of mine. He uses 10 cores for his solution. What would be the fastest option for me in this case? Using 30 MPI processes?
I am a Mechanical Engineer with very little knowledge on parallel computing. So please excuse me if the question is too stupid.

Q : "What would be the fastest option for me in this case? ...running short on time. I am already in the middle of a simulation."
Salutes to Aachen. If it were not for the ex-post remark, the fastest option would be to pre-configure the computing eco-system so that:
check full details of your NUMA device - using lstopo, or lstopo-no-graphics -.ascii not the lscpu
initiate your jobs having as many as possible MPI-worker processes mapped on physical (and best each one exclusively mapped onto its private) CPU-core ( as these deserve this as they carry the core FEM / meshing processing workload )
if your FH policy does not forbid one doing so, you may ask system administrator to introduce CPU-affinity mapping ( that will protect your in-cache data from eviction and expensive re-fetches, that would make 10-CPUs mapped exclusively for use by your colleague and the said 30-CPUs mapped exclusively for your application runs and the rest of the listed resources ~ the 40-CPUs ~ being "shared"-for-use by both, by your respective CPU-affinity masks.
Q : "Using 30 MPI processes?"
No, this is not a reasonable assumption for ASAP processing - use as many CPUs for workers, as possible for an already MPI-parallelised FEM-simulations ( they have high degree of parallelism and most often a by-nature "narrow"-locality ( be it represented as a sparse-matrix / N-band-matrix ) solvers, so the parallel-portion is often very high, compared to other numerical problems ) - the Amdahl's Law explains why.
Sure, there might be some academic-objections about some slight difference possible, for cases, where the communication overheads might got slightly reduced on one-less worker(s), yet the need for a brute-force processing rules in FEM/meshed-solvers ( communication costs are typically way less expensive, than the large-scale, FEM-segmented numerical computing part, sending but a small amount of neighbouring blocks' "boundary"-node's state data )
The htop will show you the actual state ( may note process:CPU-core wandering around, due to HT / CPU-core Thermal-balancing tricks, that decrease the resulting performance )
And do consult the meshfree Support for their Knowledge Base sources on Best Practices.
Next time - the best option would be to acquire a less restrictive computing infrastructure for processing critical workloads ( given a business-critical conditions consider this to be the risk of smooth BAU, the more if impacting your business-continuity ).

Understanding output of lscpu

You can see the output from lscpu command -
jack#042:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 # 2.60GHz
Stepping: 1
CPU MHz: 2600.000
CPU max MHz: 2600.0000
CPU min MHz: 1200.0000
BogoMIPS: 5201.37
Virtualization: VT-x
Hypervisor vendor: vertical
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55
I can see that there are 2 sockets (which is like a processor ??) and inside each of the socket we have 14 cores. So, in total 2x14=28 physical cores. Normally, a CPU can contain multiple cores, so number of CPUs can never be smaller than number of Cores. But, as shown in the output CPUs(s): 56 and this is what is confusing me.
I can see that Thread(s) per core: 2, so these 28 cores can behave like 2x28=56 logical cores.
Question 1: What does this CPUs(s): 56 denote? Does CPU(s) denote number of Virtual/Logical cores, as it cannot be a Physical core atleast?
Question 2: What does this NUMA node mean? Does it represent the socket?

(Copied at the OP’s request.)
“CPU(s): 56” represents the number of logical cores, which equals “Thread(s) per core” × “Core(s) per socket” × “Socket(s)”. One socket is one physical CPU package (which occupies one socket on the motherboard); each socket hosts a number of physical cores, and each core can run one or more threads. In your case, you have two sockets, each containing a 14-core Xeon E5-2690 v4 CPU, and since that supports hyper-threading with two threads, each core can run two threads.
“NUMA node” represents the memory architecture; “NUMA” stands for “non-uniform memory architecture”. In your system, each socket is attached to certain DIMM slots, and each physical CPU package contains a memory controller which handles part of the total RAM. As a result, not all physical memory is equally accessible from all CPUs: one physical CPU can directly access the memory it controls, but has to go through the other physical CPU to access the rest of memory. In your system, logical cores 0–13 and 28–41 are in one NUMA node, the rest in the other. So yes, one NUMA node equals one socket, at least in typical multi-socket Xeon systems.

NUMA stands for Non-Uniform Memory Access. The value of NUMA nodes has to do with performance in terms of accessing the memory, and it's not involved in calculating the number of CPU's you have.
The calculation of 56 CPUs you are getting is based on
CPU's = number of sockets x number of cores per socket x number of threads per core
Here, 2 threads per core indicate that hyper-threading is enabled.
So, you don't have 56 physical processors, but rather a combination of sockets, cores and hyper-threading. The bottom line is that you can run 56 threads in parallel. You can think of sockets to be equivalent of a physical processor.
-- edited based on the excellent comment by Margaret Bloom.

Threads per core: A hardware thread is a sufficient set of registers to represent the current state of one software thread. A core with two hardware threads can execute instructions on behalf of two different software threads without incurring the overhead of context switches between them. The amount of real parallelism that it can achieve will vary depending on what the threads are doing and, on the processor make and model.
Cores per Socket: A core is what we traditionally think of as a processor or a CPU, and a socket is the interface between one or more cores and the memory system. A socket also is the physical connection between a chip or a multi-chip module and the main board. In addition to the cores, a chip/module typically will have at least two levels of memory cache. Each core typically will have its own L1 cache, and then all of the cores on the chip/module will have to share (i.e., compete for) access to at least one higher level cache and, to the main memory.
Socket(s): see above. Big systems (e.g., rack servers) often have more than one. Personal computers, less often.
NUMA...: I can't tell you much about NUMA except to say that communication between threads running on different NUMA nodes works differently from, and is more expensive than, communication between threads running on the same node.

Curious about how to specify the number of core for MPI in order to get the fastest scientific computation

I have been running several scientific program package in conjunction with MPI by using the following command
nohup mpirun -np N -x OMP_NUM_THREADS=M program.exe < input > output &
where the value of N and M depend on the physical CPU cores of my machine. For example, my machine has the specification like this
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2440 0 # 2.40GHz
Stepping: 7
In this case, I first tried setting with N = 24 and M = 1, so the calculation ran very slowly. Then I changed N and M to 12 and 2 respectively. So I found that the latter had obviously provided me the fastest computation.
I was wondering that why did I set N & M are 12 and 2 provide more performance higher than the first case ?

there is no absolute rule on how to run MPI+OpenMP application.
the only advice is not to run an OpenMP process on more than one socket
(OpenMP was designed for SMP machines with flat memory access, but today, most systems are NUMA)
then just experiment.
some apps run best in flat MPI (e.g. one thread per task), while some other work best with one MPI task per socket, and all available cores for OpenMP.
last but not least, if you run more than one OpenMP thread per MPI task, make sure your MPI library bound the MPI tasks as expected.
for example, if you run with 12 OpenMP threads but MPI bind tasks to one core, you will end up doing time sharing and performance will be horrible.
or if you run with 12 OpenMP threads, and MPI task was bound to 12 cores, make sure the 12 cores are on the same socket (and not 6 on each socket)

There is no general rule about this because, most of the time, this performance is dependent on the computation properties of the application itself.
Applications with coarse synchronization granularity may scale well using plain MPI code (no multithreading).
If the synchronization granularity is fine, then using shared memory multithreading (such as OpenMP) and placing all the threads in a process close to each other (in the same socket) becomes more important: synchronization is cheaper and memory access latency is critical.
Finally, compute-bound applications (performance is limited by the processor) are likely not to benefit from hyper-threading at all, since two threads sharing a core contend for the functional units it contains. In this case, you may find applications that perform better using N=2 and M=6 than using N=2 and M=12.

indeeed there is no absolute rule on how to run MPI+OpenMP application.
I agree with all Gilles said.
so I want to talk about the CPU in your case.
in the specification you give, it shows the system enables hyper-thread.
but this not always helps. your computer has 12 physical cores in fact.
so I advice you try some combinations that make M * N = 12 to 24,
like 12*1, 6*2, 6*3
which one is best, depends on how well your application.

Figuring out the number of processors for using openmpi

I have compiled a weather forecasting software with openmpi in double precision on Ubuntu 14.04 and Intel ifort compiler. However I am not able to figure out few issues. I need to figure out the number of processors I need to send to mpirun. This is the output of lscpu
x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Stepping: 3
CPU MHz: 800.000
BogoMIPS: 6784.93
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3output of lscpu
This is the command that I am using to run my software
mpirun -np 4 aaa. But When I do this I get these errors -
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1001.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
When I set np to 1 it runs successfully but does not use the CPU completely. CPU usage varies from 3% to 35% but the memory usage is almost 100% and the system freezes for about ten minutes and exits with the error message
forrtl severe(41) insufficient virtual memory.
I have run WRF (the software associated with this question is not WRF) with multiple semaphores and I have not experienced any speed or memory issues.
I could recompile to single precision but before I do that I want to be able to figure out the number of cores(processors) to be sent to mpirun.

Most Intel CPUs (including the one you are using) have a virtual execution unit that allows two simultaneous instruction streams commonly called "hyperthreading." To the Linux kernel, this appears as an extra CPU core. Hence, lscpu tells you there are four CPU cores (CPU(s): 4). Looking carefully at the rest of the output, you will see that there are, in fact, only two CPU cores:
Thread(s) per core: 2 <--- this is hyperthreading
Core(s) per socket: 2
Socket(s): 1
I don't generally recommend running multiple MPI processes on a single physical CPU core even if there is hyperthreading. It tends to lead to detrimental performance, and, in your case, sometimes crashes. Try using mpiexec -np 2 aaa and see what happens. If it crashes again, there is something else wrong.
When I set np to 1 it runs successfully but does not use the CPU completely. CPU usage varies from 3% to 35% but the memory usage is almost 100% and the system freezes for about ten minutes and exits with the error message forrtl severe(41) insufficient virtual memory.
You may need to run a smaller problem size. This machine doesn't have enough physical memory to satisfy the requested allocations and is using virtual memory (essentially hard disk space) to try to fulfill them, but still running out. In any case, you don't want to be using virtual memory when running a simulation (it's ~1000x slower than main memory which is already slow).

_mm_pause usage in gcc on Intel

I have refered to this webpage :
https://software.intel.com/en-us/articles/benefitting-power-and-performance-sleep-loops , the following I can not understand :
the pause instruction gives a hint to the processor that the calling thread is in a "spin-wait" loop. In addition, the pause instruction is a no-op when used on x86 architectures that do not support Intel SSE2, meaning it will still execute without doing anything or raising a fault. While this means older x86 architectures that don’t support Intel SSE2 won’t see the benefits of the pause, it also means that you can keep one straightforward code path that works across the board.
I like to know , lscpu in linux will showes cpu information , but I have no idea if the cpu i have support SSE2 or not , how can I check it myself ?!
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2643 v3 # 3.40GHz
Stepping: 2
CPU MHz: 3599.882
BogoMIPS: 6804.22
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23
Also , currently I use _mm_pause or __asm volatile ("pause" ::: "memory");
the cpu idle will be exhausted to zero in that core , but the following code using nanosleep is too slow for me :
while(1){
nanosleep();
dosomething..... ;
}
I observe nanosleep will delayed 60 microseconds in my box , Is there any solution faster than nanosleep also not to exhaust cpu core like _mm_pause() or __asm volatile ("pause" ::: "memory") ?!
Edit :
struct timespec req={0};
req.tv_sec=0;
req.tv_nsec=100 ;
nanosleep(&req,NULL) ;
This nanosleep cost 60 microseconds in the box I have which cpu is above ,
I have no idea how come it happened ?!

To check if your platform supports SSE2
gcc -march=native -dM -E - </dev/null | grep SSE
But you don't need to check for support: The pause instruction safely decodes as a NOP on CPUs that don't recognize it as pause. (The encoding is basically rep nop). It's unlikely that a nop instead of a 5 or 100 cycle pause in the pipeline could be a correctness problem for your code.
_mm_pause won't release CPU for scheduler, as you mentioned it's designed for another purpose, e.g. hint for microarchitecture components.
nanosleep, if used correctly, should give you finer control than *60us (you might need to change the scheduler to RT). I suggest you check your code to see if arguments are correctly set, etc.
--Edit--
The accuracy of the nanosleep function depends on the kernel. And its behavior for short sleep is just busy loop (see reference) in glibc. It's also impossible to yield to scheduler for an interval (say, a few nano seconds) that is less than scheduler ticks (determined by CONFIG_HZ, which normally is 250, 1000, etc) since scheduler only context switch when timer fires.
Also, just idling the CPU for a few nanoseconds won't actually save power. CPU power is save either by C-State or P-State. P-State uses frequency scaling while C-State shuts down component of CPU. Although there is halt instruction that could do such state transition but it takes time to do so (latency in us range) which makes it expensive.
Reference:
http://tldp.org/HOWTO/IO-Port-Programming-4.html
http://ena-hpc.org/2014/pdf/paper_06.pdf

I think an easy solution (faster than nanosleep) is to use multiple pause instructions.
Also, please note that
It is important to note that the number of cycles delayed by the pause
instruction may vary from one processor family to another. You should
avoid using multiple pause instructions, assuming you will introduce a
delay of a specific cycle count.
Mentioned in Benefitting Power and Performance Sleep Loops

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string