Multi threaded behavior changes from machine to other

Multi threaded behavior changes from machine to other - multithreading

I have an application that uses multithreading as its main operation is divided into same block of code executed on independent pieces of data structure.
consider it as a tree where each node executes an operation independently on others. so I create thread for each node's operation.
I tested the performance of this code on 2 machines and the execution time vs no of threads's graph is shown..
My question is ... given the same code . why such difference could happen ? (why it saturates fast on of the machine than the other )
also, running the same code for 48 machine gives worse results ?
RED line machine specs:
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 4
NUMA node(s): 2
Blue Line machine specs :
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
NUMA node(s): 1
same core speed for both and same caches values.
Confirmed from the answer ::
tried
numactl --cpunodebind=0 --membind=0 {exe}
to run on single numa node and results are consistent.. it was numa issue

The machines are very different. One is NUMA, the other is not. Threads running on different NUMA nodes have vastly increased synchronization costs. Even the way memory is allocated matters a lot for performance.
Writing parallel code which scales well to large NUMA machines can be very hard. It's important to avoid unnecessary synchronization between threads and to allocate memory on the NUMA node where it is primarily used. It is also very costly if one cache line is frequently written by one or more threads and read from a different NUMA node. (This is what makes synchronization with regular concurrency primitives such as mutexes or read-write locks so expensive on NUMA machines.) Spinlocks can have very poor performance as well.
As a stop-gap measure, you might get better performance in the NUMA case if you pin the process to cores which are located on the same NUMA node.

Related

Does 100% use of some core impacts performance of a process ( C++, multithreaded ) which is running on different core in Linux?

In a 32 core system, a process(A) consumes 4 core fully (400% cpu usage in top). Rest of the cores are avialble. Does it impact the performance of another process(B)? Will process(B) run better if process(A) is not running , then why ?
Process(B) is using boost and multiple threds ( say 24).
I was expecting performance of Process-B is not impacted by Process-A as there are 32 cores.

In general, yes, running a process can slow down others even though not all cores are active. In practice, the impact is strongly dependent of the code being executed.
This can happen because some hardware resources are shared. The most common ones are storage devices, the network, the RAM, the LLC cache (typically a L3). For example, few cores are generally enough to saturate the RAM bandwidth so using more than 8 cores is generally not significantly faster if the two processes are memory bound. HDD storage devices tends to not be faster in parallel so when 2 processes try to massively use it at the same time they are often significantly slower. In practice, they can be more than 2 times slower because HDD have a high fetch time and a process doing many random accesses can drastically slow down a process reading/writing large contiguous files.
On NUMA systems, things can be a bit complex since 2 processes operating on the same NUMA node can be slower than 2 processes running on different NUMA node due to a saturation of the RAM of the target node and the NUMA allocation policy. In some rare case, 2 processes running on different NUMA nodes can be slower than running on the same NUMA node. This is true if the processes communicate each other (due to the higher latency between core belonging to different NUMA nodes) or if the processes communicate with hardware resources bound to specific NUMA nodes that is not the ones where the processes are running (eg. a GPU with a high-performance interconnect, a high-performance Infiniband device, etc.)
Note that some software resources can also be shared. The operating system can lock them so to ease the maintenance of some parts of its code or just because the resource cannot fundamentally be used in parallel in a way that can scale. Historically, some OS used a giant lock preventing nearly all system call to scale. Such lock has been progressively replaced with finer-grained locks or no lock at all (eg. atomics) due to the democratisation of the multi-core processors. Note that even atomic data structures do not scale very well on most processors so system calls operating on the same data structure tends to impact other running processes on many-core systems. Still, the biggest issue is generally the saturation of shared hardware resources.

Understanding output of lscpu

You can see the output from lscpu command -
jack#042:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 # 2.60GHz
Stepping: 1
CPU MHz: 2600.000
CPU max MHz: 2600.0000
CPU min MHz: 1200.0000
BogoMIPS: 5201.37
Virtualization: VT-x
Hypervisor vendor: vertical
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55
I can see that there are 2 sockets (which is like a processor ??) and inside each of the socket we have 14 cores. So, in total 2x14=28 physical cores. Normally, a CPU can contain multiple cores, so number of CPUs can never be smaller than number of Cores. But, as shown in the output CPUs(s): 56 and this is what is confusing me.
I can see that Thread(s) per core: 2, so these 28 cores can behave like 2x28=56 logical cores.
Question 1: What does this CPUs(s): 56 denote? Does CPU(s) denote number of Virtual/Logical cores, as it cannot be a Physical core atleast?
Question 2: What does this NUMA node mean? Does it represent the socket?

(Copied at the OP’s request.)
“CPU(s): 56” represents the number of logical cores, which equals “Thread(s) per core” × “Core(s) per socket” × “Socket(s)”. One socket is one physical CPU package (which occupies one socket on the motherboard); each socket hosts a number of physical cores, and each core can run one or more threads. In your case, you have two sockets, each containing a 14-core Xeon E5-2690 v4 CPU, and since that supports hyper-threading with two threads, each core can run two threads.
“NUMA node” represents the memory architecture; “NUMA” stands for “non-uniform memory architecture”. In your system, each socket is attached to certain DIMM slots, and each physical CPU package contains a memory controller which handles part of the total RAM. As a result, not all physical memory is equally accessible from all CPUs: one physical CPU can directly access the memory it controls, but has to go through the other physical CPU to access the rest of memory. In your system, logical cores 0–13 and 28–41 are in one NUMA node, the rest in the other. So yes, one NUMA node equals one socket, at least in typical multi-socket Xeon systems.

NUMA stands for Non-Uniform Memory Access. The value of NUMA nodes has to do with performance in terms of accessing the memory, and it's not involved in calculating the number of CPU's you have.
The calculation of 56 CPUs you are getting is based on
CPU's = number of sockets x number of cores per socket x number of threads per core
Here, 2 threads per core indicate that hyper-threading is enabled.
So, you don't have 56 physical processors, but rather a combination of sockets, cores and hyper-threading. The bottom line is that you can run 56 threads in parallel. You can think of sockets to be equivalent of a physical processor.
-- edited based on the excellent comment by Margaret Bloom.

Threads per core: A hardware thread is a sufficient set of registers to represent the current state of one software thread. A core with two hardware threads can execute instructions on behalf of two different software threads without incurring the overhead of context switches between them. The amount of real parallelism that it can achieve will vary depending on what the threads are doing and, on the processor make and model.
Cores per Socket: A core is what we traditionally think of as a processor or a CPU, and a socket is the interface between one or more cores and the memory system. A socket also is the physical connection between a chip or a multi-chip module and the main board. In addition to the cores, a chip/module typically will have at least two levels of memory cache. Each core typically will have its own L1 cache, and then all of the cores on the chip/module will have to share (i.e., compete for) access to at least one higher level cache and, to the main memory.
Socket(s): see above. Big systems (e.g., rack servers) often have more than one. Personal computers, less often.
NUMA...: I can't tell you much about NUMA except to say that communication between threads running on different NUMA nodes works differently from, and is more expensive than, communication between threads running on the same node.

c++ std::async : faster on 4 cores compared to 8 cores

I have 16000 jobs to perform.
Each job is independent. There is no shared memory, no interprocess communication, no lock or mutex.
I am on ubuntu 16.06. c++11. Intel® Core™ i7-8550U CPU # 1.80GHz × 8
I use std::async to split jobs between cores.
If I split the jobs into 8 (2000 per core), computation time is 145.
If I split the jobs into 4 (4000 per core), computation time is 60.
Output after reduce is the same in both case.
If I monitor the CPU during computation (just using htop), things happen as expected (8 cores are used at 100% in first case, only 4 cores are used 100% in second case).
I am very confused why 4 cores would process much faster than 8.

The i7-8550U has 4 cores and 8 threads.
What is the difference? Quoting How-To Geek:
Hyper-threading was Intel’s first attempt to bring parallel
computation to consumer PCs. It debuted on desktop CPUs with the
Pentium 4 HT back in 2002. The Pentium 4’s of the day featured just a
single CPU core, so it could really only perform one task at a
time—even if it was able to switch between tasks quickly enough that
it seemed like multitasking. Hyper-threading attempted to make up for
that.
A single physical CPU core with hyper-threading appears as two logical
CPUs to an operating system. The CPU is still a single CPU, so it’s a
little bit of a cheat. While the operating system sees two CPUs for
each core, the actual CPU hardware only has a single set of execution
resources for each core. The CPU pretends it has more cores than it
does, and it uses its own logic to speed up program execution. In
other words, the operating system is tricked into seeing two CPUs for
each actual CPU core.
Hyper-threading allows the two logical CPU cores to share physical
execution resources. This can speed things up somewhat—if one virtual
CPU is stalled and waiting, the other virtual CPU can borrow its
execution resources. Hyper-threading can help speed your system up,
but it’s nowhere near as good as having actual additional cores.
By splitting the jobs to more cores than available - you are paying a big penalty.

Curious about how to specify the number of core for MPI in order to get the fastest scientific computation

I have been running several scientific program package in conjunction with MPI by using the following command
nohup mpirun -np N -x OMP_NUM_THREADS=M program.exe < input > output &
where the value of N and M depend on the physical CPU cores of my machine. For example, my machine has the specification like this
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2440 0 # 2.40GHz
Stepping: 7
In this case, I first tried setting with N = 24 and M = 1, so the calculation ran very slowly. Then I changed N and M to 12 and 2 respectively. So I found that the latter had obviously provided me the fastest computation.
I was wondering that why did I set N & M are 12 and 2 provide more performance higher than the first case ?

there is no absolute rule on how to run MPI+OpenMP application.
the only advice is not to run an OpenMP process on more than one socket
(OpenMP was designed for SMP machines with flat memory access, but today, most systems are NUMA)
then just experiment.
some apps run best in flat MPI (e.g. one thread per task), while some other work best with one MPI task per socket, and all available cores for OpenMP.
last but not least, if you run more than one OpenMP thread per MPI task, make sure your MPI library bound the MPI tasks as expected.
for example, if you run with 12 OpenMP threads but MPI bind tasks to one core, you will end up doing time sharing and performance will be horrible.
or if you run with 12 OpenMP threads, and MPI task was bound to 12 cores, make sure the 12 cores are on the same socket (and not 6 on each socket)

There is no general rule about this because, most of the time, this performance is dependent on the computation properties of the application itself.
Applications with coarse synchronization granularity may scale well using plain MPI code (no multithreading).
If the synchronization granularity is fine, then using shared memory multithreading (such as OpenMP) and placing all the threads in a process close to each other (in the same socket) becomes more important: synchronization is cheaper and memory access latency is critical.
Finally, compute-bound applications (performance is limited by the processor) are likely not to benefit from hyper-threading at all, since two threads sharing a core contend for the functional units it contains. In this case, you may find applications that perform better using N=2 and M=6 than using N=2 and M=12.

indeeed there is no absolute rule on how to run MPI+OpenMP application.
I agree with all Gilles said.
so I want to talk about the CPU in your case.
in the specification you give, it shows the system enables hyper-thread.
but this not always helps. your computer has 12 physical cores in fact.
so I advice you try some combinations that make M * N = 12 to 24,
like 12*1, 6*2, 6*3
which one is best, depends on how well your application.

Figuring out the number of processors for using openmpi

I have compiled a weather forecasting software with openmpi in double precision on Ubuntu 14.04 and Intel ifort compiler. However I am not able to figure out few issues. I need to figure out the number of processors I need to send to mpirun. This is the output of lscpu
x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Stepping: 3
CPU MHz: 800.000
BogoMIPS: 6784.93
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3output of lscpu
This is the command that I am using to run my software
mpirun -np 4 aaa. But When I do this I get these errors -
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1001.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
When I set np to 1 it runs successfully but does not use the CPU completely. CPU usage varies from 3% to 35% but the memory usage is almost 100% and the system freezes for about ten minutes and exits with the error message
forrtl severe(41) insufficient virtual memory.
I have run WRF (the software associated with this question is not WRF) with multiple semaphores and I have not experienced any speed or memory issues.
I could recompile to single precision but before I do that I want to be able to figure out the number of cores(processors) to be sent to mpirun.

Most Intel CPUs (including the one you are using) have a virtual execution unit that allows two simultaneous instruction streams commonly called "hyperthreading." To the Linux kernel, this appears as an extra CPU core. Hence, lscpu tells you there are four CPU cores (CPU(s): 4). Looking carefully at the rest of the output, you will see that there are, in fact, only two CPU cores:
Thread(s) per core: 2 <--- this is hyperthreading
Core(s) per socket: 2
Socket(s): 1
I don't generally recommend running multiple MPI processes on a single physical CPU core even if there is hyperthreading. It tends to lead to detrimental performance, and, in your case, sometimes crashes. Try using mpiexec -np 2 aaa and see what happens. If it crashes again, there is something else wrong.
When I set np to 1 it runs successfully but does not use the CPU completely. CPU usage varies from 3% to 35% but the memory usage is almost 100% and the system freezes for about ten minutes and exits with the error message forrtl severe(41) insufficient virtual memory.
You may need to run a smaller problem size. This machine doesn't have enough physical memory to satisfy the requested allocations and is using virtual memory (essentially hard disk space) to try to fulfill them, but still running out. In any case, you don't want to be using virtual memory when running a simulation (it's ~1000x slower than main memory which is already slow).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string