SMT and Hyperthreading : threads vs process - multithreading

I understand SMT in general and the concept of hardware threads(I think). I wanted my understanding to be validated here or corrected.
Basically, HW threads are different from the SW threads. We could run different SW threads or even different processes on an SMT core simultaneously, right? SMT core does not differentiate between process1 and process2, to the HW, they are just two threads.
Is that correct?

Yes, your understanding is correct: the concept of hardware threads doesn't really relate to the distinction between (OS-level) threads and processes. For example, it doesn't somehow limit two SMT threads to only running software threads from the same process1.
The use of the term hardware thread is a bit confusing, since thread already had a specific meaning in the software world. As Peter pointed out in the comments, you might prefer logical core instead. So a single hyperthreaded package might have 2 physical cores and 4 logical cores. We refer to that as 2c4t (yes, the t is again for thread).
It might be easiest to think of this in terms of abstractions. The key abstraction hardware offers to software is the CPU. 15 years ago, your desktop had 1 CPU and was the same as the 1 CPU you'd see under the fan if you opened the case. Today, a single physical package (the thing you see plugged into the socket under the fan) usually appears as multiple CPUs to the operating system.
In particular, a 2c4t physical CPU will mostly appear as 4 CPUs to the OS. The OS mostly doesn't care that it's 2 physical cores and 4 logical cores, versus 1 physical core and 4 logical (not common on Intel but common elsewhere), or 4 physical cores with 1 logical thread each, or even 4 separate physical CPUs with 1 core each on a big huge server motherboard. The way the hardware implements the presented CPU is only a performance concern, not really a functional one. In user software, for example, when you query the number of CPUs, you really get the total number of hardware threads, no matter how they are physically implemented2.
So understanding that abstraction helps answer this:
We could run different SW threads or even different processes on an
SMT core simultaneously, right?
Yes - whatever you could do on 2 physical CPUs, you can do on 2 cores, or 2 logical cores on the same physical core. The abstraction the hardware presents is the same.
Then there is the question of software processes and threads. This is mostly an abstraction the operating system presents to userland software. The CPU doesn't really have this concept at all: it only offers facilities to provide an "execution context" per CPU to run something, offers a bunch of additional services that modern OSes need, such as various priviledge levels (to implement the user/kernel split), memory protection, interrupts, paging/memory-management unit services, etc.
The operating system uses that to implement its concept of processes and threads: but the CPU doesn't care. For example, processes usually have separate virtual memory spaces, while threads share them. The CPU supports this concept by having an MMU - but it doesn't have a binary concept of process vs thread: you could very well have something in the middle that shares some part of the memory space, etc. Much of the non-virtual-memory difference between processes and threads is totally outside of the domain of the CPU: such as separate sets of open files, separate permissions and capabilities, working directories, environment variables and so on.
Understanding the process/thread abstraction helps answer this other part of your question:
SMT core does not differentiate between process1 and process2, to the
HW, they are just two threads[?]
Correct. Not only does SMT not care about processes versus thread, CPUs in general don't care. They offer some functionality for the OS to set up various sharing arrangements between execution contexts (the memory mapping being the big one) - but they don't care how it is used. You won't even really find discussion of a binary distinction between "process" and "thread" in the system programming manual for a CPU.
1 That seemed to be one of your concerns, but it wasn't entirely clear.
2 To be clear, no modern OS will be totally ignorant of the mapping between physical cores and the 1 or more logical cores they contain - among other things, it uses that information to optimize scheduling. For example if you had two processes running on your 2c4t box, it would usually be silly if they were both running on the same physical core, leaving the other idle, since performance will generally be lower that way. This is no different to something like NUMA - where there is the fundamental high level abstraction (single homogeneous shared memory space) alongside the low level performance concerns which leak through the abstraction (not all memory access is uniform). The goal is that the lowest levels of the software stack (OS, threading libraries, memory allocators, etc) mostly handle this stuff so the user software can keep working with the high level abstraction.

Related

Does 100% use of some core impacts performance of a process ( C++, multithreaded ) which is running on different core in Linux?

In a 32 core system, a process(A) consumes 4 core fully (400% cpu usage in top). Rest of the cores are avialble. Does it impact the performance of another process(B)? Will process(B) run better if process(A) is not running , then why ?
Process(B) is using boost and multiple threds ( say 24).
I was expecting performance of Process-B is not impacted by Process-A as there are 32 cores.
In general, yes, running a process can slow down others even though not all cores are active. In practice, the impact is strongly dependent of the code being executed.
This can happen because some hardware resources are shared. The most common ones are storage devices, the network, the RAM, the LLC cache (typically a L3). For example, few cores are generally enough to saturate the RAM bandwidth so using more than 8 cores is generally not significantly faster if the two processes are memory bound. HDD storage devices tends to not be faster in parallel so when 2 processes try to massively use it at the same time they are often significantly slower. In practice, they can be more than 2 times slower because HDD have a high fetch time and a process doing many random accesses can drastically slow down a process reading/writing large contiguous files.
On NUMA systems, things can be a bit complex since 2 processes operating on the same NUMA node can be slower than 2 processes running on different NUMA node due to a saturation of the RAM of the target node and the NUMA allocation policy. In some rare case, 2 processes running on different NUMA nodes can be slower than running on the same NUMA node. This is true if the processes communicate each other (due to the higher latency between core belonging to different NUMA nodes) or if the processes communicate with hardware resources bound to specific NUMA nodes that is not the ones where the processes are running (eg. a GPU with a high-performance interconnect, a high-performance Infiniband device, etc.)
Note that some software resources can also be shared. The operating system can lock them so to ease the maintenance of some parts of its code or just because the resource cannot fundamentally be used in parallel in a way that can scale. Historically, some OS used a giant lock preventing nearly all system call to scale. Such lock has been progressively replaced with finer-grained locks or no lock at all (eg. atomics) due to the democratisation of the multi-core processors. Note that even atomic data structures do not scale very well on most processors so system calls operating on the same data structure tends to impact other running processes on many-core systems. Still, the biggest issue is generally the saturation of shared hardware resources.

Can a hyper-threaded processor core execute two threads at the exact same time?

I'm having a hard time understanding hyper-threading. If the logical core doesn't actually exist, what's the point of using hyper-threading?. The wikipedia article states that:
For each processor core that is physically present, the operating system addresses two virtual (logical) cores and shares the workload between them when possible.
If the two logical cores share the same execution unit, that means one of the threads will have to be put on hold while the other executes, that being said, I don't understand how hyper-threading can be useful, since you're not actually introducing a new execution unit. I can't wrap my head around this
See my answer on a softwareengineering.SE question for some details about how modern CPUs find and exploit instruction-level parallelism (ILP) by running multiple instructions at once. (Including a block diagram of Intel Haswell's pipeline, and links to more CPU microarchitecture details). Also Modern Microprocessors
A 90-Minute Guide!
You have a CPU with lots of execution units and a front-end that can keep them mostly supplied with work to do, but only under good conditions. Stalls like cache misses or branch mispredicts, or just limited parallelism (e.g. a loop that does one long chain of FP additions, bottlenecking on FP latency at one (scalar or SIMD) add per 4 or 5 clocks instead of one or two per clock) will result in throughput of much less than 4 instructions per cycle, and leave execution units idle.
The point of HT (and Simultaneous Multithreading (SMT) in general) is to keep those hungry execution units fed with work to do, even when running code with low ILP or lots of stalls (cache misses / branch mispredicts).
SMT only adds a bit of extra logic to the pipeline so it can keep track of two separate architectural contexts at the same time. So it costs a lot less die area and power than having twice or 4x as many full cores. (Knight's Landing Xeon Phi runs 4 threads per core, mainstream Intel CPUs run 2. Some non-x86 chips run 8 threads per core, aimed at database-server type workloads.) But of course having to divide out-of-order execution resources between logical threads often means the throughput gain is significantly below 2x or 4x, often far below, and for some workloads is negative.
Also related What is the difference between Hyperthreading and Multithreading? Does AMD Zen use either? - AMD's SMT is basically the same as Intel's, just not using the trademark "Hyperthreading" for it. See also other links in my answer there, like https://www.realworldtech.com/nehalem/3/ and especially https://www.realworldtech.com/alpha-ev8-smt/ for an intro with diagrams to what SMT is all about. (Many members of the Alpha EV8 design team was hired by Intel after DEC folded, and went on to implement SMT in Netburst (Pentium 4) which Intel branded Hyperthreading.)
Common misconceptions
Hyperthreading is not just optimized context switching. Simpler designs that switch to the other thread on a cache miss are possible, but HT is more advanced than that. (Switch-on-stall, or round-robin "barrel processor").
With two threads active, the front-end alternates between threads every cycle (in the fetch, decode, and issue/rename stages), but the out-of-order back-end can actually execute uops from both logical cores in the same cycle. The issue/rename stage is 4 uops wide on Intel before Ice Lake.
In pipeline stages that normally alternate, any time one thread is stalled, the other thread gets all the cycles in that stage. HT is much better than just fixed alternating, because one thread can get lots of work done while the other is recovering from a branch mispredict or waiting for a cache miss.
Note that up to 10 or 12 cache misses can be outstanding at once (from L1D cache in Intel CPUs: this is the number of LFB (Line Fill Buffers), and memory requests are pipelined. But if the address for the next load depends on an earlier load (e.g. pointer chasing through a tree or linked list), the CPU doesn't know where to load from and can't keep multiple requests in flight. So it is actually useful for both threads to be waiting on cache misses in parallel.
Some resources are statically partitioned when two threads are active, some are competitively shared. See this pdf of slides for some details. (For more details about how to actually optimize asm for Intel and AMD CPUs, see Agner Fog's microarchitecture PDF.)
When one logical core "sleeps" (i.e. the kernel runs a HLT instruction or whatever MWAIT to enter a deeper sleep), the physical core transitions to single-thread mode and lets the still-active logical core have all the resources (including the full ReOrder Buffer size, and other statically-partitioned resources), so it's ability to find and exploit ILP in the single thread still running increases more than when the other thread is simply stalled on a cache miss.
BTW, some workloads actually run slower with HT. If your working set barely fits in L2 or L1D cache, then running two on the same core will lead to a lot more cache misses. For very well-tuned high-throughput code that can already keep the execution units saturated
(like an optimized matrix multiply in high-performance computing), it can make sense to disable HT. Always benchmark.
On Skylake, I've found that video encoding (with x265 -preset slower, 1080p) is about 15% faster with 8 threads instead of 4, on my quad-core i7-6700k. I didn't actually disable HT for the 4-thread test, but Linux's scheduler is good at not bouncing threads around and running threads on separate physical cores when there are enough to go around. A 15% speedup is pretty good considering that x265 has a lot of hand-written asm and runs very high instructions-per-cycle even when it has a whole core to itself. (Slower presets like I used tend to be more CPU-bound than memory-bound.)

Single-threaded/event-based software vs cores and H/W threads

I'm a bit confused here about cores and threads on CPUs
Often in configuration files eg. nginx, golang you have to define the number of cores to get the best performance
If you look at this CPU
http://ark.intel.com/products/52213/Intel-Core-i7-2600-Processor-8M-Cache-up-to-3_80-GHz
How many "cores" does it have?
In the specs it has 4 cores and 8 threads.. Does that mean 4*8 = 32 "cores" ??
No, the CPU you linked to has four cores. It can, however, run two threads at the same time per core with a technology called Hyper-Threading (HT), thus has 8 "threads". The OS will be presented with 8 processors, unless you disable HT in the BIOS or elsewhere.
Note that hyper threading works in a special way: It uses unused execution units (in the sense of a superscalar processor) of a core for the second thread. AFAIK there are really good algorithms that re-order instructions for this to be most effective, but bear in mind that the hyper threads may not bring the best performance for all applications. For example: if you already use all floating point execution units in the four "real" threads all the time, the hyper threads will not be able to use them most of the time.

Regarding relationship between cores and ranks of a MPI program

As far as I know, in a multiprocessor environment any thread/process can be allocated to any core/processor so, what is meant by following line:
the number of MPI ranks used on an Intel Xeon Phi coprocessor should be substantially fewer than the number of cores in no small part because of limited memory on the coprocessor.
I mean, what are the issues if #cores <= #MPI Ranks ?
That quote is correct only when it is applied to a memory size constrained problem; in general it would be an incorrect statement. In general you should use more tasks than you have physical cores on the Xeon Phi in order to hide memory latency1.
To answer your question "What are the issues if the number of cores is fewer than the number of MPI ranks?": you run the risk of having too much context switching. On many problems it is advantageous to use more tasks than you have cores to hide memory latency2.
1. I don't even feel like I need to cite a reference for this because how loudly it is advertised; however, they do mention it in an article on the OpenCL design document: http://software.intel.com/en-us/articles/opencl-design-and-programming-guide-for-the-intel-xeon-phi-coprocessor
2. This advice applies to the Xeon Phi specifically, not necessarily other pieces of hardware.
Well if you make number of MPI tasks higher than number of cores it makes no sense, because you start to enforce 2 tasks on one processing unit, and therefore exhaustion of computing resources.
When it comes to preferred substantially lower number of tasks over cores on Xeon Phi. Maybe they prefer threads over processes. The architecture of Xeon Phi is quite peculiar and overhead introduced by maintaining an MPI task can seriously cripple computing performance. I will not hide that I do not know technical reason behind it. But maybe someone will fill it in.
If I recall correctly communication bus in there is a ring (or two rings), so maybe all to all communication and barriers are polluting bus and turns out to be ineffective.
Using threads or the native execution mode they provide has less overhead.
Also I think you should look at it more like a multicore CPU, not a multi-CPU machine. For greater performance you don't want to run 4 MPI tasks on a 4-core CPU either, you want to run one 4-threaded MPI task.

Kernel Scheduling for 1024 CPUs

Azul Systems has an appliance that supports thousands of cache coherent CPUs. I would love insight into what changes would need to occur to an operating system in order to schedule thousands of simultaneously running threads.
Scheduling thousands of threads is not a big deal, but scheduling them on hundreds of CPUs is. What you need, first and foremost, is very fine-grained locking, or, better yet, lock-free data structures and algorithms. You just can't afford to let 200 CPUs waiting while one CPU executes a critical section.
You're asking for possible changes to the OS, so I presume there's a significant engineering team behind this effort.
There are also a few pieces of clarififying info that would help define the problem parameters:
How much IPC (inter process communication) do you need?
Do they really have to be threads, or can they be processes?
If they're processes, is it okay if the have to talk to each other through sockets, and not by using shared memory?
What is the memory architecture? Are you straight SMP with 1024 cores, or is there some other NUMA (Non-Uniform Memory Architecture) or MMP going on here? What are your page tables like?
Knowing only the very smallest of info about Azul systems, I would guess that you have very little IPC, and that a simple "run one kernel per core" model might actually work out just fine. If processes need to talk to each other, then they can create sockets and transfer data that way. Does your hardware support this model? (You would likely end up needing one IP address per core as well, and at 1024 IP addrs, this might be troublesome, although they could all be NAT'd, and maybe it's not such a big deal). If course, this model would lead to some inefficiencies, like extra page tables, and a fair bit of RAM overhead, and may even not be supported by your hardware system.
Even if "1 kernel per core" doesn't work, you could likely run 1024/8 kernels, and be just fine, letting each kernel control 8 physical CPUs.
That said, if you wanted to run 1 thread per core in a traditional SMP machine with 1024 cores (and only a few physical CPUs) then I would expect that the old fashioned O(1) scheduler is what you'd want. It's likely that your CPU[0] will end up nearly 100% in kernel and doing interrupt handling, but that's just fine for this use case, unless you need more than 1 core to handle your workload.
Making Linux scale has been a long and ongoing project. The first multiprocessor capable Linux kernel had a single lock protecting the entire kernel (the Big Kernel Lock, BKL), which was simple, but limited scalability.
Subsequently the locking has been made more fine-grained, i.e. there are many locks (thousands?), each covering only a small portion of data. However, there are limits to how far this can be taken, as fine-grained locking tends to be complicated, and the locking overhead starts to eat up the performance benefit, especially considering that most multi-CPU Linux systems have relatively few CPU's.
Another thing, is that as far as possible the kernel uses per-cpu data structures. This is very important, as it avoids the cache coherency performance issues with shared data, and of course there is no locking overhead. E.g. every CPU runs its own process scheduler, requiring only occasional global synchronization.
Also, some algorithms are chosen with scalability in mind. E.g. some read-mostly data is protected by Read-Copy-Update (RCU) instead of traditional mutexes; this allows readers to proceed during a concurrent update.
As for memory, Linux tries hard to allocate memory from the same NUMA node as where the process is running. This provides better memory bandwidth and latency for the applications.
My uneducated guess would be that there is a run-queue per processor and a work-stealing algorithm when a processor is idle. I could see this working in an M:N model, where there is a single process per cpu and light-weight processes as the work items. This would then feel similar to a work-stealing threadpool, such as the one in Java-7's fork-join library.
If you really want to know, go pick up Solaris Internals or dig into the Solaris kernel code. I'm still reading Design & Impl of FreeBSD, with Solaris Internals being the next on my list, so all I can do is make wild guesses atm.
I am pretty sure that the SGI Altix we have at work, (which does ccNUMA) uses special hardware for cache coherency.
There is a huge overhead connected to hold 4mb cache per core coherent. It's unlikely to happen in software only.
in an array of 256 cpus you would need 768mb ram just to hold the cache-invalidation bits.
12mb cache / 128 bytes per cache line * 256² cores.
Modifying the OS is one thing, but using unchanged application code is a waste of hardware. When going over some limit (depending on the hardware), the effort to keep coherency and synchronization in order to execute generic code is simply too much. You can do it, but it will be very expensive.
From the OS side you'll need complex affinity model, i.e. not to jump CPUs just because yours is busy. Scheduling threads based on hardware topology - cooperating threads on CPUs that are "close" to minimize penalties. Simple work stealing is not a good solution, you must consider topology. One solution is hierarchical work stealing - steal work by distance, divide topology to sectors and try to steal from closest first.
Touching a bit the lock issue; you'll still use spin-locks nd such, but using totally different implementations. This is probably the most patented field in CS these days.
But, again, you will need to program specifically for such massive scale. Or you'll simply under-use it. No automatic "parallelizers" will do it for you.
The easiest way to do this is to bind each process/thread to a few CPUS, and then only those CPUs would have to compete for a lock on that thread. Obviously, there would need to be some way to move threads around to even out the load, but on a NUMA architecture, you have to minimize this as much as possible.
Even on dual-core intel systems, I'm pretty sure that Linux can already handle "Thousands" of threads with native posix threads.
(Glibc and the kernel both need to be configured to support this, however, but I believe most systems these days have that by default now).

Resources