CPU ordering in Linux (with hyper threading)

CPU ordering in Linux (with hyper threading) - linux

I'm curious what the CPU ordering is in Linux. Say I bind a thread to cpu0 and another to cpu1 on a hyperthreaded system, are they both going to be on the same physical core. Given a Core i7 920 with 4 cores and hyperthreading, the output of /proc/cpuinfo has me thinking that cpu0 and cpu1 are different physical cores, and cpu0 and cpu4 are on the same physical core.
Thanks.

The physical cpu/socket is listed as physical id.
The physical core is listed as core id.
A processor entry due to hypherthreading will get its own processor, but share core id and physical id with another.
Note that each physical cpu (physical id) can have multiple cores (core id), which can further be broken up into additional logical cpus by hyperthreading. The logical cpus are overall ordered by processor id.
There's a detailed explanation with examples here: archive.richweb.com/cpu_info via web.archive.org

You can use likwid-topology -g to get graphical topology of the cpu. It shows each cpu primary cores along with the sibling core.

See the pointer provided in this link. The information is all in /proc/cpuinfo with regards to physical processors, cores, and hyperthreading, but you have to match info from multiple entries in that file to identify which ones group together.

Related

Understanding output of lscpu

You can see the output from lscpu command -
jack#042:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 # 2.60GHz
Stepping: 1
CPU MHz: 2600.000
CPU max MHz: 2600.0000
CPU min MHz: 1200.0000
BogoMIPS: 5201.37
Virtualization: VT-x
Hypervisor vendor: vertical
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55
I can see that there are 2 sockets (which is like a processor ??) and inside each of the socket we have 14 cores. So, in total 2x14=28 physical cores. Normally, a CPU can contain multiple cores, so number of CPUs can never be smaller than number of Cores. But, as shown in the output CPUs(s): 56 and this is what is confusing me.
I can see that Thread(s) per core: 2, so these 28 cores can behave like 2x28=56 logical cores.
Question 1: What does this CPUs(s): 56 denote? Does CPU(s) denote number of Virtual/Logical cores, as it cannot be a Physical core atleast?
Question 2: What does this NUMA node mean? Does it represent the socket?

(Copied at the OP’s request.)
“CPU(s): 56” represents the number of logical cores, which equals “Thread(s) per core” × “Core(s) per socket” × “Socket(s)”. One socket is one physical CPU package (which occupies one socket on the motherboard); each socket hosts a number of physical cores, and each core can run one or more threads. In your case, you have two sockets, each containing a 14-core Xeon E5-2690 v4 CPU, and since that supports hyper-threading with two threads, each core can run two threads.
“NUMA node” represents the memory architecture; “NUMA” stands for “non-uniform memory architecture”. In your system, each socket is attached to certain DIMM slots, and each physical CPU package contains a memory controller which handles part of the total RAM. As a result, not all physical memory is equally accessible from all CPUs: one physical CPU can directly access the memory it controls, but has to go through the other physical CPU to access the rest of memory. In your system, logical cores 0–13 and 28–41 are in one NUMA node, the rest in the other. So yes, one NUMA node equals one socket, at least in typical multi-socket Xeon systems.

NUMA stands for Non-Uniform Memory Access. The value of NUMA nodes has to do with performance in terms of accessing the memory, and it's not involved in calculating the number of CPU's you have.
The calculation of 56 CPUs you are getting is based on
CPU's = number of sockets x number of cores per socket x number of threads per core
Here, 2 threads per core indicate that hyper-threading is enabled.
So, you don't have 56 physical processors, but rather a combination of sockets, cores and hyper-threading. The bottom line is that you can run 56 threads in parallel. You can think of sockets to be equivalent of a physical processor.
-- edited based on the excellent comment by Margaret Bloom.

Threads per core: A hardware thread is a sufficient set of registers to represent the current state of one software thread. A core with two hardware threads can execute instructions on behalf of two different software threads without incurring the overhead of context switches between them. The amount of real parallelism that it can achieve will vary depending on what the threads are doing and, on the processor make and model.
Cores per Socket: A core is what we traditionally think of as a processor or a CPU, and a socket is the interface between one or more cores and the memory system. A socket also is the physical connection between a chip or a multi-chip module and the main board. In addition to the cores, a chip/module typically will have at least two levels of memory cache. Each core typically will have its own L1 cache, and then all of the cores on the chip/module will have to share (i.e., compete for) access to at least one higher level cache and, to the main memory.
Socket(s): see above. Big systems (e.g., rack servers) often have more than one. Personal computers, less often.
NUMA...: I can't tell you much about NUMA except to say that communication between threads running on different NUMA nodes works differently from, and is more expensive than, communication between threads running on the same node.

c++ std::async : faster on 4 cores compared to 8 cores

I have 16000 jobs to perform.
Each job is independent. There is no shared memory, no interprocess communication, no lock or mutex.
I am on ubuntu 16.06. c++11. Intel® Core™ i7-8550U CPU # 1.80GHz × 8
I use std::async to split jobs between cores.
If I split the jobs into 8 (2000 per core), computation time is 145.
If I split the jobs into 4 (4000 per core), computation time is 60.
Output after reduce is the same in both case.
If I monitor the CPU during computation (just using htop), things happen as expected (8 cores are used at 100% in first case, only 4 cores are used 100% in second case).
I am very confused why 4 cores would process much faster than 8.

The i7-8550U has 4 cores and 8 threads.
What is the difference? Quoting How-To Geek:
Hyper-threading was Intel’s first attempt to bring parallel
computation to consumer PCs. It debuted on desktop CPUs with the
Pentium 4 HT back in 2002. The Pentium 4’s of the day featured just a
single CPU core, so it could really only perform one task at a
time—even if it was able to switch between tasks quickly enough that
it seemed like multitasking. Hyper-threading attempted to make up for
that.
A single physical CPU core with hyper-threading appears as two logical
CPUs to an operating system. The CPU is still a single CPU, so it’s a
little bit of a cheat. While the operating system sees two CPUs for
each core, the actual CPU hardware only has a single set of execution
resources for each core. The CPU pretends it has more cores than it
does, and it uses its own logic to speed up program execution. In
other words, the operating system is tricked into seeing two CPUs for
each actual CPU core.
Hyper-threading allows the two logical CPU cores to share physical
execution resources. This can speed things up somewhat—if one virtual
CPU is stalled and waiting, the other virtual CPU can borrow its
execution resources. Hyper-threading can help speed your system up,
but it’s nowhere near as good as having actual additional cores.
By splitting the jobs to more cores than available - you are paying a big penalty.

Why each logical CPU has it's own CR3 register in case of multithreading?

When we have a CPU that supports some form of multithreading, each logical CPU has it's own set of registers (as a minimum), including a CR3 register.
Since we are working on the vitual address space of the same process when executing different threads and a context switch never happens (neither the TLB cache gets invalidated when switching threads of the same process), why do we need a CR3 register to point to the page table and page directory in the logical CPU?
Isn't the value always the same as the value in the CR3 of the physcial CPU?

Since we are working on the vitual address space of the same process when executing different threads
That's not all HT is capable of. I think you're confusing "hardware thread" (execution context / logical core) with "software thread".
Two logical cores run on one physical core, with one physical iTLB / dTLB / L2TLB. The logical cores are very much independent, and don't have to be running threads from the same process.
This is a desirable property in an SMT design like Intel's HT: If the OS had to carefully avoid scheduling threads with different page tables onto different logical cores of the same physical core, it would require more synchronization between cores.
Two threads of different processes (with separate CR3 page tables) can share one TLB because the entries are tagged with a PCID (process-context ID). IIRC, hardware virtualization also uses similar (or the same?) tagging to avoid needing TLB flushes on VM exits or when switching between guests.
The OS can set a PCID (low 12 bits of CR3) to avoid needing TLB flushes on context switches, and as a bonus enables concurrent TLB usage by 2 processes. Does Linux use x86 CPU's PCID feature for TLB? If not, why? (According to that, Linux doesn't generally use PCID, but I assume it does for HT.)
Hmm, I'm not sure I have the details exactly right, but physically there is some kind of tagging of TLB entries to keep them separate even when the two logical cores have different CR3.
According to an Intel forum thread, SnB-family CPUs statically partition the iTLB (so each logical core gets half the entries). That automatically solves any sharing problems.
The dTLB and L2TLB are competitively shared, so they do need tagging.

Is 1 vCPU on Google Compute Engine basically half of 1 physical CPU core?

Google's Machine types page states that:
For the n1 series of machine types, a virtual CPU is implemented as a
single hardware hyper-thread on a 2.6 GHz Intel Xeon E5 (Sandy
Bridge), 2.5 GHz Intel Xeon E5 v2 (Ivy Bridge)...etc
Assuming that a single physical CPU core with hyper-threading appears as two logical CPUs to an operating system, then if the n1-standard-2 machine that is described as 2 virtual CPUs and 7.5 GB of memory, then this essentially means 1 CPU core, right?
So if I'm trying to follow hardware recommendations for an InfluxDB instance that recommends 2 CPU cores, then I should aim for a Google Compute Engine machine that has 4vCPUs, correct?

Typically when software tells you how many cores they need they don't take hyper-threading into account. Remember, AMD didn't even have that (Hyper-Threading) until very recently. So 2 cores means 2 vCPUs. Yes, a single HT CPU core shows up as 2 CPUs to the OS, but does NOT quite perform as 2 truly independent CPU cores.

That's correct, you should aim for a GCE machine-type that has 4vCPUs... When you're migrating from an on-premises world, you're used to physical cores which have hyperthreading. In GCP, these are called vCPUs or virtual CPUs. A vCPU is equivalent to one hyperthread core. Therefore, if you have a single-core hyperthreaded CPU on premises, that would essentially be two virtual CPUs to one physical core. So always keep that in mind as oftentimes people will immediately do a test. They'll say, "I have a four-cores physical machine and I'm going to run four cores in the cloud" and ask "why their performance isn't the same?!!!"

if the n1-standard-2 machine that is described as 2 virtual CPUs and 7.5 GB of memory, then this essentially means 1 CPU core, right?
I believe, yes.
So if I'm trying to follow hardware recommendations for an InfluxDB instance that recommends 2 CPU cores, then I should aim for a Google Compute Engine machine that has 4vCPUs, correct?
I think, they means 2 physical cores regardless of hyper threading (HT) because the performance of HT is not a stable reference.
But IMO, the recommendation should also contains speed of each physical core.
If the software recommends 2 CPU cores, you need 4 vCPUs on GCP.

https://cloud.google.com/compute/docs/cpu-platforms says:
On Compute Engine, each virtual CPU (vCPU) is implemented as a single hardware multithread on one of the available CPU processors. On Intel Xeon processors, Intel Hyper-Threading Technology supports multiple app threads running on each physical processor core. You configure your Compute Engine VM instances with one or more of these multithreads as vCPUs. The specific size and shape of your VM instance determines the number of its vCPUs.

Long ago and far away, there was a 1 to 1 equivalence between a 'CPU' (such as what one sees in the output of "top"), a socket, a core, and a thread. (And "processor" and/or "chip" too if you like.)
So, many folks got into the habit of using two or more of those terms interchangeably. Particularly "CPU" and "core."
Then CPU designers started putting multiple cores on a single die/chip. So a "socket" or "processor" or "chip" was no longer a single core, but a "CPU" was still 1 to 1 with a "core." So, interchanging those two terms was still "ok."
Then CPU designers started putting multiple "threads" (eg hyperthreads) in a single core. The operating systems would present each hyperthread as a "CPU" so there was no longer a 1 to 1 correspondence between "CPU" and "thread" and "core."
And, different CPU families can have different numbers of threads per core.
But referring to "cores" when one means "CPUs" persists.

How do I find information about the parallel architecture of my CPU?

I'm Intel(R) Core(TM)2 Duo CPU T6600 # 2.20GHz (as told to me by cat /proc/cpuinfo), but I need to go into as much depth as possible re. architecture for working on parallel programming (likely using pthreads). Any pointers?

The sys filesystem knows all about this:
$ ls /sys/devices/system/cpu
cpu0 cpu2 cpuidle possible sched_mc_power_savings
cpu1 cpu3 online present
$ ls /sys/devices/system/cpu/cpu0/topology/
core_id core_siblings_list thread_siblings
core_siblings physical_package_id thread_siblings_list
Here's the documentation
Using this filesystem, you can find out how many CPUs you have, how many threads they have, which CPUs are next to which other cpus, and which CPUs share caches with which other ones.
For example - Q: which CPUs does cpu0 share it's L2 cache with?
$ cat /sys/devices/system/cpu/cpu0/cache/index2/{type,level,shared_cpu_list}
Unified
2
0-1
A: It shares it's unified L2 cache with cpu1 (and itself).
Another example: Q: which CPUs are in the same physical package as cpu0 (on a larger machine):
cat /sys/devices/system/cpu/cpu0/topology/core_siblings
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000055
A: cores 0, 2, 4 and 6. (taken from the bit pattern above, lsb=cpu0)
not all linux systems have the sys filesystem in, and it's not always in root. (possibly in proc/sys?). the thread_siblings_list form is not always available, but the thread_siblings (bit pattern) one is.

I found lstopo of the hwloc project quite useful. This will give you graphical output
(based on information found in /proc and /sys as Alex Brown described)
of the topology of your system (see their webpage for an example). From the graphical
output you can easily see
if hyperthreaded cores are present
which cpu numbers correspond are different hyperthreads on the same physical core
how many CPU sockets are used
which cores share the L3 cache
if the main memory is common between CPU sockets or whether you are on a NUMA system
etc.
If you need to access this information programmatically, there is some documentation how hwloc can be used as a library.

To get the lowest detail about the processor a system engineer could run their equivalent to CPUZ and get to the lowest levels when permitted in law for the home or office address where the processor is to be installed and used. Software engineers and developers are not usually permitted to know that level of detail but I can tell you that the architecture can be of the type where a processor spawns or instances a virtual second processor giving close to x1.75 processing equivalent, or there are real physical multiple cores on the same die using technology and control methods as enhancements of previous implementations of the design. The Core processor has an interface for system programmers, application developers to use and this mode of access would be the expectation of Intel, details available from them.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string