Understanding output of lscpu - multithreading

You can see the output from lscpu command -
jack#042:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 # 2.60GHz
Stepping: 1
CPU MHz: 2600.000
CPU max MHz: 2600.0000
CPU min MHz: 1200.0000
BogoMIPS: 5201.37
Virtualization: VT-x
Hypervisor vendor: vertical
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55
I can see that there are 2 sockets (which is like a processor ??) and inside each of the socket we have 14 cores. So, in total 2x14=28 physical cores. Normally, a CPU can contain multiple cores, so number of CPUs can never be smaller than number of Cores. But, as shown in the output CPUs(s): 56 and this is what is confusing me.
I can see that Thread(s) per core: 2, so these 28 cores can behave like 2x28=56 logical cores.
Question 1: What does this CPUs(s): 56 denote? Does CPU(s) denote number of Virtual/Logical cores, as it cannot be a Physical core atleast?
Question 2: What does this NUMA node mean? Does it represent the socket?

(Copied at the OP’s request.)
“CPU(s): 56” represents the number of logical cores, which equals “Thread(s) per core” × “Core(s) per socket” × “Socket(s)”. One socket is one physical CPU package (which occupies one socket on the motherboard); each socket hosts a number of physical cores, and each core can run one or more threads. In your case, you have two sockets, each containing a 14-core Xeon E5-2690 v4 CPU, and since that supports hyper-threading with two threads, each core can run two threads.
“NUMA node” represents the memory architecture; “NUMA” stands for “non-uniform memory architecture”. In your system, each socket is attached to certain DIMM slots, and each physical CPU package contains a memory controller which handles part of the total RAM. As a result, not all physical memory is equally accessible from all CPUs: one physical CPU can directly access the memory it controls, but has to go through the other physical CPU to access the rest of memory. In your system, logical cores 0–13 and 28–41 are in one NUMA node, the rest in the other. So yes, one NUMA node equals one socket, at least in typical multi-socket Xeon systems.

NUMA stands for Non-Uniform Memory Access. The value of NUMA nodes has to do with performance in terms of accessing the memory, and it's not involved in calculating the number of CPU's you have.
The calculation of 56 CPUs you are getting is based on
CPU's = number of sockets x number of cores per socket x number of threads per core
Here, 2 threads per core indicate that hyper-threading is enabled.
So, you don't have 56 physical processors, but rather a combination of sockets, cores and hyper-threading. The bottom line is that you can run 56 threads in parallel. You can think of sockets to be equivalent of a physical processor.
-- edited based on the excellent comment by Margaret Bloom.

Threads per core: A hardware thread is a sufficient set of registers to represent the current state of one software thread. A core with two hardware threads can execute instructions on behalf of two different software threads without incurring the overhead of context switches between them. The amount of real parallelism that it can achieve will vary depending on what the threads are doing and, on the processor make and model.
Cores per Socket: A core is what we traditionally think of as a processor or a CPU, and a socket is the interface between one or more cores and the memory system. A socket also is the physical connection between a chip or a multi-chip module and the main board. In addition to the cores, a chip/module typically will have at least two levels of memory cache. Each core typically will have its own L1 cache, and then all of the cores on the chip/module will have to share (i.e., compete for) access to at least one higher level cache and, to the main memory.
Socket(s): see above. Big systems (e.g., rack servers) often have more than one. Personal computers, less often.
NUMA...: I can't tell you much about NUMA except to say that communication between threads running on different NUMA nodes works differently from, and is more expensive than, communication between threads running on the same node.

Related

Parallel computing: how to share computing resources among users?

I am running a simulation on a Linux machine with the following specs.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU # 2.40GHz
Stepping: 4
CPU MHz: 3099.902
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 28160K
This is the run command line script for my solver.
/path/to/meshfree/installation/folder/meshfree_run.sh # on 1 (serial) worker
/path/to/meshfree/installation/folder/meshfree_run.sh N # on N parallel MPI processes
I share the system with another colleague of mine. He uses 10 cores for his solution. What would be the fastest option for me in this case? Using 30 MPI processes?
I am a Mechanical Engineer with very little knowledge on parallel computing. So please excuse me if the question is too stupid.
Q : "What would be the fastest option for me in this case? ...running short on time. I am already in the middle of a simulation."
Salutes to Aachen. If it were not for the ex-post remark, the fastest option would be to pre-configure the computing eco-system so that:
check full details of your NUMA device - using lstopo, or lstopo-no-graphics -.ascii not the lscpu
initiate your jobs having as many as possible MPI-worker processes mapped on physical (and best each one exclusively mapped onto its private) CPU-core ( as these deserve this as they carry the core FEM / meshing processing workload )
if your FH policy does not forbid one doing so, you may ask system administrator to introduce CPU-affinity mapping ( that will protect your in-cache data from eviction and expensive re-fetches, that would make 10-CPUs mapped exclusively for use by your colleague and the said 30-CPUs mapped exclusively for your application runs and the rest of the listed resources ~ the 40-CPUs ~ being "shared"-for-use by both, by your respective CPU-affinity masks.
Q : "Using 30 MPI processes?"
No, this is not a reasonable assumption for ASAP processing - use as many CPUs for workers, as possible for an already MPI-parallelised FEM-simulations ( they have high degree of parallelism and most often a by-nature "narrow"-locality ( be it represented as a sparse-matrix / N-band-matrix ) solvers, so the parallel-portion is often very high, compared to other numerical problems ) - the Amdahl's Law explains why.
Sure, there might be some academic-objections about some slight difference possible, for cases, where the communication overheads might got slightly reduced on one-less worker(s), yet the need for a brute-force processing rules in FEM/meshed-solvers ( communication costs are typically way less expensive, than the large-scale, FEM-segmented numerical computing part, sending but a small amount of neighbouring blocks' "boundary"-node's state data )
The htop will show you the actual state ( may note process:CPU-core wandering around, due to HT / CPU-core Thermal-balancing tricks, that decrease the resulting performance )
And do consult the meshfree Support for their Knowledge Base sources on Best Practices.
Next time - the best option would be to acquire a less restrictive computing infrastructure for processing critical workloads ( given a business-critical conditions consider this to be the risk of smooth BAU, the more if impacting your business-continuity ).

How many Threads can be run 20 cpu and 1 thread per cpu?

My Configuration:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 20
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8168 CPU # 2.70GHz
Stepping: 0
CPU MHz: 2700.000
BogoMIPS: 5400.00
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 33792K
NUMA node0 CPU(s): 0-19
On my machine above configuration is running. How many max threads can run on this config?
As many as you create.
To say it blunt: your question doesn't really make sense. Probably, you are rather asking "what would be an ideal number of threads"?
But that question can only be answered when knowing context, in other words: details about the workload to run.
"More threads" aren't necessarily going to speed up things.
When your workload is CPU intensive, say: loading large files from memory, and en/derypting them, then you want to restrict the number of software threads by the number the hardware supports (like 20 in your case).
If your workload is I/O intensive, and your software is waiting for I/O a lot, then you might use hundreds, if not thousands of threads easily.
Long story short: hardware details alone aren't sufficient to determine the number of threads to use.

Multi threaded behavior changes from machine to other

I have an application that uses multithreading as its main operation is divided into same block of code executed on independent pieces of data structure.
consider it as a tree where each node executes an operation independently on others. so I create thread for each node's operation.
I tested the performance of this code on 2 machines and the execution time vs no of threads's graph is shown..
My question is ... given the same code . why such difference could happen ? (why it saturates fast on of the machine than the other )
also, running the same code for 48 machine gives worse results ?
RED line machine specs:
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 4
NUMA node(s): 2
Blue Line machine specs :
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
NUMA node(s): 1
same core speed for both and same caches values.
Confirmed from the answer ::
tried
numactl --cpunodebind=0 --membind=0 {exe}
to run on single numa node and results are consistent.. it was numa issue
The machines are very different. One is NUMA, the other is not. Threads running on different NUMA nodes have vastly increased synchronization costs. Even the way memory is allocated matters a lot for performance.
Writing parallel code which scales well to large NUMA machines can be very hard. It's important to avoid unnecessary synchronization between threads and to allocate memory on the NUMA node where it is primarily used. It is also very costly if one cache line is frequently written by one or more threads and read from a different NUMA node. (This is what makes synchronization with regular concurrency primitives such as mutexes or read-write locks so expensive on NUMA machines.) Spinlocks can have very poor performance as well.
As a stop-gap measure, you might get better performance in the NUMA case if you pin the process to cores which are located on the same NUMA node.

Figuring out the number of processors for using openmpi

I have compiled a weather forecasting software with openmpi in double precision on Ubuntu 14.04 and Intel ifort compiler. However I am not able to figure out few issues. I need to figure out the number of processors I need to send to mpirun. This is the output of lscpu
x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Stepping: 3
CPU MHz: 800.000
BogoMIPS: 6784.93
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3output of lscpu
This is the command that I am using to run my software
mpirun -np 4 aaa. But When I do this I get these errors -
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1001.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
When I set np to 1 it runs successfully but does not use the CPU completely. CPU usage varies from 3% to 35% but the memory usage is almost 100% and the system freezes for about ten minutes and exits with the error message
forrtl severe(41) insufficient virtual memory.
I have run WRF (the software associated with this question is not WRF) with multiple semaphores and I have not experienced any speed or memory issues.
I could recompile to single precision but before I do that I want to be able to figure out the number of cores(processors) to be sent to mpirun.
Most Intel CPUs (including the one you are using) have a virtual execution unit that allows two simultaneous instruction streams commonly called "hyperthreading." To the Linux kernel, this appears as an extra CPU core. Hence, lscpu tells you there are four CPU cores (CPU(s): 4). Looking carefully at the rest of the output, you will see that there are, in fact, only two CPU cores:
Thread(s) per core: 2 <--- this is hyperthreading
Core(s) per socket: 2
Socket(s): 1
I don't generally recommend running multiple MPI processes on a single physical CPU core even if there is hyperthreading. It tends to lead to detrimental performance, and, in your case, sometimes crashes. Try using mpiexec -np 2 aaa and see what happens. If it crashes again, there is something else wrong.
When I set np to 1 it runs successfully but does not use the CPU completely. CPU usage varies from 3% to 35% but the memory usage is almost 100% and the system freezes for about ten minutes and exits with the error message forrtl severe(41) insufficient virtual memory.
You may need to run a smaller problem size. This machine doesn't have enough physical memory to satisfy the requested allocations and is using virtual memory (essentially hard disk space) to try to fulfill them, but still running out. In any case, you don't want to be using virtual memory when running a simulation (it's ~1000x slower than main memory which is already slow).

CPU ordering in Linux (with hyper threading)

I'm curious what the CPU ordering is in Linux. Say I bind a thread to cpu0 and another to cpu1 on a hyperthreaded system, are they both going to be on the same physical core. Given a Core i7 920 with 4 cores and hyperthreading, the output of /proc/cpuinfo has me thinking that cpu0 and cpu1 are different physical cores, and cpu0 and cpu4 are on the same physical core.
Thanks.
The physical cpu/socket is listed as physical id.
The physical core is listed as core id.
A processor entry due to hypherthreading will get its own processor, but share core id and physical id with another.
Note that each physical cpu (physical id) can have multiple cores (core id), which can further be broken up into additional logical cpus by hyperthreading. The logical cpus are overall ordered by processor id.
There's a detailed explanation with examples here: archive.richweb.com/cpu_info via web.archive.org
You can use likwid-topology -g to get graphical topology of the cpu. It shows each cpu primary cores along with the sibling core.
See the pointer provided in this link. The information is all in /proc/cpuinfo with regards to physical processors, cores, and hyperthreading, but you have to match info from multiple entries in that file to identify which ones group together.

Resources