Why does hyper threaded or Multi-threaded CPU matter?

Why does hyper threaded or Multi-threaded CPU matter? - multithreading

Since single CPU could only execute one instruction at a time. Basically, what Multi-threaded CPU does is switch back and forth between multiple threads within single core. Since single-threaded & singl-core CPU could do multitasking by context switch between processes, why does Multi-threaded CPU matter?

You're mixing up quite a few things here ...
First of all : hardware-threads have next to nothing in common with software-threads. As far as i know, there can only be n hardware-threads on a CPU whereas n is the amount of real or virtual CPU-cores (an ALU, for example)
Context switching is done to allow the illusion of parallelism on one single core.
Now : since there are no CPUs without several cores anymore, every CPU supports MT which effectively enables somewhat real parallelism - multiple calculations can be done at the same time yet the result has to be pipelined.
Modern CPUs even simulate additional cores - thats possible because there is a time-gap between result-delivery and command-dispatch, AFAIR - this can be used for additional calculations ... thats called hyperthreading and can boost your performance a bit.

Related

What does it mean when we say "4 cores 8 threads"?

When I run lscpu on my host, it shows
CPU(s): 8
Thread(s) per core: 2
Core(s) per socket: 4
My host has 4 physical CPUs, but 8 logical CPUs due to 2 threads per core. ok, "2 threads per core" means one core can execute 2 threads simultaneously so as if we have doubled the CPU capacity? So this is parallel concept?
While we have another concept that "one process can have multiple threads", I believe this means one process can handle multiple threads concurrently by switching context, but not necessarily in parallel. In most cases one CPU can execute one thread at a time, right?
I'd like to confirm my understanding above is correct. Thanks
Ref for concurrent and parallel difference: What is the difference between concurrency and parallelism?

This concept is called Simultaneous multithreading (SMT). It is implemented in many processor, from x86-64 (both AMD and Intel) to POWER. The idea is to execute 2 threads concurrently. Some operation can be parallel regarding the specific target architecture.
one core can execute 2 threads simultaneously so as if we have doubled the CPU capacity?
No. Hardware threads (also called logical cores) are not equivalent to cores (ie. in opposition to physical cores). Some processor units are statically allocated for the hardware threads while some units are dynamically allocated for the hardware thread meaning the threads share the available resources.
The initial idea was to execute something useful when a core was stalling on some operations like memory reads. With 2 hardware threads, a core can execute the instructions of another thread if the current one is waiting on memory, for example due to a cache miss. Memory-bound parallel codes that are limited by the latency of the RAM like naive transpositions or linked-list traversals can benefit from this mechanism.
The SMT implementation has significantly improved over time. Especially in x86-64 processor recently. Nowadays, hardware threads of modern processor can execute computing instructions truly in parallel. For example, an Intel Skylake processor can execute up to 4 arithmetic instructions at a time per cycle, thanks to 4 ALUs. 1 thread can execute 4 instructions per cycle only if the instructions are independent (during the target cycles). This is not always possible as some loops are inherently sequential and do not contain enough independent instruction for each loop (eg. cumulative sum). With a 2-way SMT enabled, 2 software threads can be scheduled on the same core and the core can execute 2 instructions of each thread completely in parallel in a given cycle. It can even load balance the number of instruction regarding the needs of each thread in real time (eg. 1 vs 3 instructions per cycle). In the end, latency-bound codes can be up to 2 times faster on a 2-way SMT processor like Skylake. That being said, it does not speed up codes that can already fully use all the available processor computing units. For example, a parallel matrix multiplication using an optimized BLAS library will nearly always be slower with 2 software threads running per core than with only 1 software thread per core. The execution can be slower because hardware thread share some resources like caches and they can conflict each other with 2 threads per core running simultaneously. Put it shortly, efficient codes should not benefit from it, but people tends to write inefficient code and it is not rare for compilers to fail to generate efficient codes saturating computing units of a core (they often need some help).
While we have another concept that "one process can have multiple threads", I believe this means one process can handle multiple threads concurrently by switching context, but not necessarily in parallel.
I would like to set the record straight: software threads and hardware threads are two very different things despite the name.
A software thread is a logical OS unit that can be scheduled on a hardware thread. A hardware thread can be seen as a physical part of a processor core (it is actually a naive simplistic view). A software thread is a part of an OS process. The OS is responsible for the scheduling of the ready software threads. Processes are not scheduled, software threads are (at least on a modern OS). 2 software threads of 2 different processes can run in parallel on a processor with multiple cores (or even on some 2-way SMT cores).
In most cases one CPU can execute one thread at a time, right?
The term "CPU" is not clear here: it can mean different things regarding the context.
If "one CPU" means a modern microprocessor chip that is typically a multicore one nowadays, then definitively no. Software threads can truly run in parallel on different cores for examples.
If "one CPU" means a core (like often in high-performance computing), then it depends: a 1-way SMT core can execute only 1 thread at a time while a 2-way SMT core can execute 2 thread at a time.
On old microprocessor chip with 1 core and no SMT, it was true that one thread was running at a time and context switches was used to execute thread concurrently from the user point-of-view but not in parallel. This time is long gone (since nearly 2 decades) except maybe on some embedded microprocessor chips.

Is this...parallel?
Maybe.
Hyperthreading is Intel's trademark* for processor cores that have two complete sets of context registers. A hyperthreaded CPU can concurrently execute code on behalf of two threads without any intervention by the operating system (i.e., with no need for context switching.)
The extent to which those two concurrent executions actually are parallel executions varies from CPU model to model, and it depends on what the two threads actually are doing. For example (I'm just making this part up because it's been a few decades since I've needed to worry about any particular CPU architecture) if some "hyperthreaded" CPU has two integer ALUs per core, then the two threads might both be able to perform integer operations in parallel, but if it has only one FPU per core, then they would have to take turns using it.
Some Hyperthreaded CPU models have more duplicate execution units than others have, and so can parallelize more parts of the execution.
* AMD calls their similar capability, "2-way simultaneous multithreading."

is multi-threading dependent on the architecture of the machine?

I have been reading lately about system architecture and the topic of multi-threading has not been covered in detail with latest improvements in technology. I did my part of search, but could not find answers for the following:
The questions have are
1) Is multi-threading dependent on the system architecuture (CPU). do all CPU (single core) support multi-threading? If it does not, what happens to multi-threaded applications when run on those machines
It is cited here that
Intel CPUs support multithreading, but only two threads per CPU.
AMD CPUs do not support multithreading and AMD often sites Microsoft's
recommendations to turn off Hyperthreading on Intel CPUs when running applications
like peoplesoft and Exchange.
2) so what does it mean it say only two threads per CPU here. At any given time, CPU (single core) can process only thread. and the other thread is waiting to be processed correct?
3) how is it different from an application that spawns, say, 10 threads and waiting for them to be executed. If the CPU at the most can tackle only two threads, shouldn't programmer keep that fact in consideration when writing multi-threaded applications.
Even with multi-core processors (say quad-core) at the most 8 threads can be queued, but only 4 threads can be processed at the same time.
P.S: I have a read a little about hyper-threading but I am not sure if that is relevant here and if
all processors support hyper-threading

1) It depends on the operating system more than anything. Even for single core architectures, multi-threading can be supported, but the threads are not executing in parallel - The OS will context-switch between them.
2) Intel usually supports two-way hardware threading ( also called simultaneous multi-threading), where each thread is allocated a pipeline. So if you have a process with two threads they can both execute on the same core simultaneously.
3) See 1. Basically the operating system is going to allocate as many threads as it can to hardware before it plans to context-switch between the threads it couldn't allocate. This process is dependent on the OS's scheduler, and you can read about the Linux one to get a good idea of what's going on.
Edit: Hypethreading is basically the hardware threading feature I mentioned.

In your question CPU means core.
1) It does. I believe memory access on ARMs is in words, so write to char is not atomic
Also memory ordering differs Modern OSes (anything but DOS) support context switching: while one thread executes, others wait. Total number of threads in all Windows processes is about 1000. Common time quant (time to load CPU) is 1-10 ms. One core multithreading don't improve computational power but allows asynchronous tasks. For example GUI doesn't freeze during network activity. One threads waits net, another one responds to user activity.
2) Yes
3) It is common practice to spawn number of threads equal to number of (virtual) cores, ie number of cores in system for AMD and twice for Intel. It is true only for computational threads. Web server threads usually wait net and don't load CPU a lot, so it is better to spawn thousands of threads.
Hyperthreading is cool for tasks that wait RAM. While one thread waits data another one executes. For math it usually not increase performance. It is good for work with data that is not cache-friendly: lists, trees, hash tables that don't fit into cache.

What is the relationship between threads (in a Java or a C++ program) and number of cores in the CPU?

Can someone shed some light on it?
An i7 processor can run 8 threads but I am pretty sure we can create more than 8 threads in a JAVA or C++ program(not sure though). I have an i5 processor and while studying concurrency I have created 10 threads for assignments. I am just trying to understand how Core rating of CPU is related to threads.

The thread you are refering to is called a software thread; and you can create as many software threads as you need, as long as your operating system allows it. Each software thread, or code snippet, can run concurrently from the others.
For each core, there is at least one hardware thread to which the operating system can assign a software thread. If you have 8 cores, for example, then you have a hardware thread pool of capacity 8. You can map tens or hundreds of software threads to this 8-slot pool, where only 8 threads are actually running on hardware at the same time, i.e. in parallel.
Software threads are like people sharing the same computer. Each one can use this computer up to some time, not necessarily have his task completed, then give it up to another.
Hardware threads are like people having a computer for each of them. All of them can proceed with their tasks at the same time.
Note: For i7, there are two hardware threads (so called hyper-threading) in each core. So you can have up to 16 threads running in parallel.

There are already a couple of good answers talking about the hardware side of things, but there isn't much talk about the software side of things.
The essential fact that I believe you're missing is that not all threads have to be executing all the time. When you have thousands of threads on an 8 core machine, only a few of them are actually running at any given time. The others are sitting around doing nothing until some processor time becomes free. This has huge advantages because threads might be waiting on other resources, too. For example, if I have one thread trying to read a file from disk, then there's no reason for it to be taking up CPU time while it's waiting for the hard disk data to load into RAM. Another example is when the thread is waiting for a response from some other machine (such as a web request over the internet). When you have more threads than your processor can handle at once, the operating system and/or runtime (It depends on the OS and runtime implementation.) are responsible for deciding which threads should get available processor time. This kind of set up lets you maximize your machine's productivity because CPU cycles are doing something useful almost all the time.

A "thread" is a software abstraction which defines a single, self-consistent path of execution through the program: in most modern systems, the number of threads is basically only limited by memory. However, only a comparatively small number of threads can be run simultaneously by the CPU. Broadly speaking, the "core count" is how many threads the CPU can run truly in parallel: if there are more threads that want to run than there are cores available, the operating system will use time-slicing of some sort to let all the threads get some time to execute.
There are a whole bunch of terms which are thrown around when it comes to "cores:"
Processor count: the number of physical CPU chips on a system's motherboard. This used to be the only number that mattered, until CPUs with multiple cores became available.
Logical core count: the number of threads that the system hardware can run in parallel
Physical core count: the number of copies of the CPU execution hardware that the system has -- this is not always equal to the logical core count, due to features like SMT ("simultaneous multithreading") which use a single piece of hardware to run multiple threads in parallel
Module count: Recent (Bulldozer-derived) AMD processors have used an architecture which is a hybrid between SMT and the standard one-logical-core-per-physical-core model. On these CPUs, there is a separate copy of the integer execution hardware for each logical core, but two logical cores share a floating-point unit and the fetch-and-decode frontend; AMD calls the unit containing two logical cores a module.
It should also be briefly mentioned that GPUs, graphics cards, have enormous core counts and run huge numbers (thousands) of threads in parallel. The trade-off is that GPU cores have very little memory and often a substantially restricted programming model.

Threads are handled by the OS's scheduler. The number of cores in a CPU determine how many threads it can run at the same time.
Note that threads are constantly switched in and out by the scheduler to give the "illusion" that everything is running at the same time.
More here, if you're interested.

no, no, no... Your I7 has eight execution threads and can run 8 threads at once.
1000 threads or more can be waiting for processor time.
calling thread.sleep moves a thread off the execution core and back into memory where it waits until woken.

Multithreads on kernel

In Galvin, I came across
Finally, many operating system kernels are now multithreaded; several threads operate in the kernel, and each thread performs a specific task.
Question 1
It does not imply that all of them will run at the same time, since at a given time only 1 process/thread can acquire control over the processor right? Though they could be doing various work, like one on CPU, other working on I/O like getting key strokes in the buffer etc., right?
Question 2
Multithreading will show better performance on multiprocessor systems only right?

Answer 1: Every core of your CPU can execute one command at any given time. Since nearly all of modern CPUs are multi core you'll get better performance if your app is multithreaded.
Answer 2:Multithreading will show better performance in most of the cases even on systems with single core CPUs. Your app will become more responsive to user input if you dispatch your time intensive jobs to multiple threads
The parallelization levels are as below:
Mutli Computers
Multi Processors
Multi Cores
Multi Threads
At higher levels you see more benefit from threading. E.g your multithreaded app will run better in multi cores CPUs in compare with single core(multi threaded) CPUs

Dual-Core Hyperthreading: Should I use 4 threads or 3 or 2?

If you're spawning multiple threads (or processes) concurrently, is it better to spawn as many as the number of physical processors or the number of logical processors, assuming the task is CPU-bound? Or is it better to do something in between (say, 3 threads)?
Does the performance depend on the kind of instructions that are getting executed (say, would non-local memory access be much different from cache hits)? If so, in which cases is it better to take advantage of hyperthreading?
Update:
The reason I'm asking is, I remember reading somewhere that if you have as many tasks as the number of virtual processors, tasks on the same physical core can sometimes starve some CPU resources and prevent each other from getting as many resources as needed, possibly decreasing performance. That's why I'm wondering if having as many threads as virtual cores is a good idea.

The performance depends on a huge variety of factors. Most tasks are not strictly CPU bound, since even if all of the data is in memory it is usually not on-board in the processor cache. I have seen examples (like this one) where memory access patterns can dramatically change the performance profile of a given 'parallel' process.
In short, there is no perfect number for all situations.

Chances are pretty good that you will see a performance improvement running 2 threads per core with HyperThreading enabled. Jobs that appear to be entirely CPU bound usually aren't, and HyperThreading can extract a few "extra" cycles out of the occasional interrupt or context switch.
On the other hand, with a core iX processor that has Turbo Boost, you might actually do better running 1 thread per core to encourage the CPU to overclock itself.
At work, we routinely run many-core servers at full CPU doing various kinds of calculation for days at a time. A while back we measured the performance difference with and without HT. We found that on average, with HyperThreading, and running twice as many jobs at once, we could complete the same amount of jobs about 10% faster than than without HyperThreading.
Assume that 2 × cores is a good place to start, but the bottom line is: measure!

I remember info that hyperthreading can give you up to 30% of performance boost. in general you'd better to treat them as 4 different cores. of course in some specific circumstances (e.g. having the same long running task bound to each core) you can divide your processing better taking into account that some cores are just logical ones
more info about hyperthreading itself here

Using Hyperthreading to run two threads on the same core, when both threads have similar memory access patterns but access disjoint data structures, would be very roughly equivalent to running them on two separate cores each with half the cache. If the memory-access patterns are such that half the cache would be sufficient to prevent thrashing, performance may be good. If the memory-access patterns are such that halving the cache induces thrashing, there may be a ten-fold performance hit (implying one would have been much better off without hyperthreading).
On the other hand, there are some situations where hyperthreading may be a huge win. If many threads will all be reading and writing the same shared data using lock-free data structures, and all threads must see a consistent view of the data, trying to run threads on disjoint processor may cause thrashing since only one processor at a time may have read-write access to any given cache line; running such a threads on two cores may take longer than running only one at a time. Such cache arbitration is not required, however, when a piece of data is accessed by multiple threads on a single core. In those cases, hyperthreading can be a huge win.
Unfortunately, I don't know any way to give the scheduler any "hints" to suggest that some threads should share a core when possible, while others should run separately when possible.

HT allows a boost of approximately 10-30% for mostly cpu-bound tasks that use the extra virtual cores. Although these tasks may seem CPU-bound, unless they are custom made assembly, they will usually suffer from IO waits between RAM and local cache. This allows one thread running on a physical HT-enabled core to work while the other thread is waiting for IO. This does come with a disadvantage though, as two threads share the same cache/bus, which will result in less resources each which may cause both threads to pause while waiting for IO.
In the last case, running a single thread will decrease the maximum simultaneous theoretical processing power(by 10-30%) in favor of running a single thread without the slowdown of cache thrashing which may be very significant in some applications.
Choosing which cores to use is just as important as choosing how many threads to run. If each thread is CPU-bound for roughly the same duration it is best to set the affinity such that threads using mostly different resources find themselves on different physical cores and threads using common resources be grouped to the same physical cores(different virtual core) so that common resources can be used from the same cache without extra IO wait.
Since each program has different CPU-usage characteristics and cache thrashing may or may not be a major slowdown(it usually is) it is impossible to determine what the ideal number of threads should be without profiling first. One last thing to note is that the OS/Kernel will also require some CPU and cache space. It is usually ideal to keep a single (physical)core set aside for the OS if real-time latency is required on CPU-bound threads so as to avoid sharing cache/cpu resources. If threads are often waiting for IO and cache thrashing is not an issue, or if running a real-time OS specifically designed for the application, you can skip this last step.
http://en.wikipedia.org/wiki/Thrashing_(computer_science)
http://en.wikipedia.org/wiki/Processor_affinity

All of the other answers already give lots of excellent info. But, one more point to consider is that the SIMD unit is shared between logical cores on the same die. So, if you are running threads with SSE code, do you run them on all 4 logical cores, or just spawn 2 threads (assuming you have two chips)? For this odd case, best to profile with your app.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string