Hyper threading in hardware level

Hyper threading in hardware level - multithreading

So, this semester I have a subject about operating systems, and I don't understand so much yet about Hyper threading. I searched internet but what I found are almost the same things (I don't know if I searched with the wrong terms).
Here are the sources I found:
https://www.dasher.com/will-hyper-threading-improve-processing-performance/;
Hyper-threading Performance Comparison;
Why does hyper-threading benefit my algorithm?;
But, my question is not about HT in differents languages or how I can analyse with/without but how this was implemented at the hardware level.
How does HT comunicate with main memory (ALU, registers..), cache and others devices. Where can I find something about this?
And finally I want to compare HT to parallelized processes. How does parallelism take advantage of hyper threading?
So guys if you know about a book or site that can help me, please share here.
Thanks,

Modern hyperthreading is implemented in a very clever way.
Think about a dual-core processor for a minutes. It has two cores, each of which has registers, caches with mechanisms to access memory, and a collection of execution units to perform various integer, floating point, and control operations.
Now imagine that instead of each core having its own collection of execution units, the two cores just share one pool of execution units. Either core can use, say, a floating point multiplier so long as the other core isn't using that same execution unit. If one core needs an execution unit that is used by the other core, it will have to wait, just as it would have to if that execution unit was used by an overlapping instruction executed by that same thread.

Related

What are all the different types of parallelism?

I am trying to understand more about parallelism, but I've noticed there are a lot of different terms out there and some seem to mean the same thing while others have a notable difference. So, what are all the different types of parallelism, how do they differ from each other, and do any have specific applications or purposes?
(To keep this more focused, I'm hoping for an answer that provides clarity to all the terminology associated with parallelism, including terms not listed below; technical comparisons between each different type would be nice, but will probably result in this question becoming off-topic - then again, I don't really know, hence the question).
Note:
this is not a question about concurrency and goes beyond the "simple" question: "what is parallelism?", although a clarifying definition might be warranted.
First, I have taken notice of the difference between parallelism and threading, but some of the differences between the following terms are still confusing.
To add clarity to my question here is a list of terms that I have found that are related to parallelism: parallel computing, parallel processing, multithreading, multiprocessing, multicore programming, Hyper-threading (Intel) 2, Simultaneous MultiThreading (SMT) 3, Switch-on-Event MultiThreading 3. (If possible, definitions or references to definitions for each of these terms would also be appreciated).
My very specific question: what is the difference between thread-level parallelism, instruction-level parallelism, and process-level parallelism? (and any other x-level parallelism)?
In a multi-core processor, can parallelism occur within a single core? Is that what Hyper-threading is, and does that require a single core having, for example, two ALU's that can be used in parallel?
Last one: is there a difference between hardware vs software parallelism, aside from the obvious distinction that one happens in hardware while the other in software?
Related resources:
- Process vs Thread,
- Parallelism on a GPU,
- Hyper-threading,
- Concurrency vs Parallelism,
- Hyper-threading and gaming.

Q:What is the difference betweenthread-level parallelism,instruction-level parallelism,and process-level parallelism?
While the subject matter is indeed immensely wide, I would try to have this view, even at a risk of making many opponents present their objections of simplifying the subject matter ( but StackOverflow format does not substitute other sources of complete reference, does it ? ):
A:the main difference is WHAT / WHO / HOW is responsible for keeping things to execute in true-[PARALLEL]
Instruction Level Parallelism - ILP - is the simplest case, the CPU-architecture has designed and "hardwired" this particular form of hardware-based parallelism. Having processors with ILP4 ( 4 instructions executed at once ), or having processors with per-instruction based width of this form of parallel-instruction execution, be it ILP2 for some instructions but ILP1 for some others, again the silicon architecture decides, what can happen indeed in parallel at the instruction level. Some awkward surprises may arise from further details, as memory-controller channels may block ILP-mode in cases, where REG/MEMORY uops will have to wait for a free channel to access the instructed MEMORY.
hardware-threads are the next level of granularity. Given a CPU-core is declared to support two hardware threads, these are the only streams-of-code execution, that may flow in parallel ( if no O/S request comes to instantiate and schedule another thread to get executed, mapped onto one of the available CPU-core hardware-threads ). From the user-perspective, there are O/S tools that permit one to explicitly "nail"-down a process-level-PID / thread-level-PID affinity onto a particular CPU-core(s) and thus limit or even eliminate any "disturbance", so as to move from a "just"-[CONCURRENT] flow of code-execution closer to a true-[PARALLEL]one.
We will knowingly skip all the crowds of threads, that are just a tool for latency-masking ( be it on the SIMT / SMX warp-wide GPU-scheduler, or the more relaxed, MIMT O/S-kernel driven multithreading )
software operated distributed-systems parallelism is the one that ought be mentioned for completeness, but it has the principally highest adverse costs from a need to invent, define, implement and operate the setup / coordination in software ( which all causes overheads to grow remarkably ), in the sense of the re-formulated Amdahl's Law right due to a need to somehow design and keep operational the non-native orchestration of both the distributed process execution and all the dataflow, that it is dependent on.
hardware-based true-[PARALLEL] systems are at the highest level of orchestration, where both the silicon ( like the InMOS' network of meshed Transputers ) and also the programming language ( like the InMOS' occam or occam-pi ) provide the carefully engineered, conceptually crafted true-[PARALLEL] code-execution.
- MIMT: Multiple Instruction Multiple Threads, a non-restricted thread-execution fabric / policy, where any thread may and does issue a different instruction to the processor for execution, as opposed to SIMT
- SIMT: Single Instruction Multiple Threads, typically a GPU Streaming Multiprocessor code-execution architecture- SMX: Streaming Multiprocessor eXecution unit, typically a GPU SIMT building block, onto which the GPU-kernel code-units could be directed ( addressed ) for being TaskQueeue-scheduled and later executed, according to the WARP-wide SIMT-code scheduler coordinated

what is the difference between thread-level parallelism, instruction-level parallelism, and process-level parallelism?
In 1, different CPU cores execute different streams of instructions.
In 2, single CPU core executes different instructions from a single instruction stream in parallel (these instructions are either consecutive instructions in the stream, or otherwise very close to each other).
3 is same as 1, the difference is cosmetic. It’s just the default settings about which memory pages are shared across threads and which aren’t. But these settings are user-adjustable with process creation flags, shared memory sections, dynamic libraries, and other system APIs, that’s why on the lower level, the difference between process and threads is not a big deal.
and any other x-level parallelism
Another important one is SIMD level parallelism. For this one, CPU applies same instruction to multiple operands stored in special wide registers. With SSE we have 128-bit wide registers, and we can e.g. multiply a vector of 4 single-precision floating-point numbers in one register by another 4 values in another register, making 4 products in parallel, with a single mulps instruction. ARM NEON is similar, also 128 bit registers, the instruction to multiply 4 floats by 4 floats is vmul.f32. AVX operates on 256-bit registers so it can multiply 8 floats at once, with a single vmulps instruction.
can parallelism occur within a single core?
Yes.
Is that what Hyper-threading is
Yes, also it’s what instruction-level parallelism is, and SIMD parallelism, too.
does that require a single core having, for example, two ALU's that can be used in parallel?
Modern CPUs have more than two per core but HT was introduced in P4 and it’s not a requirement. The profit from HT is not just loading multiple ALUs, it’s also using the core while a thread is waiting for data to arrive from caches or from system RAM. And also, using the core while it's stalled because of the data dependency between nearby instructions. HT allows a CPU core to compute something else on another hardware thread while it’s waiting, therefore improving ALU utilization. Without HT, the core would likely just sit and wait for hundreds cycles in case of RAM latency, or for dozens cycles in case of data dependency latency.
is there a difference between hardware vs software parallelism
When you have a single hardware thread and multiple OS threads that compute stuff, only 1 thread will be running at any given time. The rest of the threads will be waiting. The OS will periodically (often ~50-100Hz) switch which one’s running, with the goal to give all threads a fair slice of CPU time. You can call that software parallelism if you want, but I wouldn’t call such thing parallel at all.

High Load System and Multithreading

Can anyone tell me , What is architectural pattern or way the best for using multi threading in High Load System?
I have read about Multiplexing approach, Is there something else?
Thanks.

I'm not exactly sure what you mean by High Load System, but I'll assume you mean a commercial server environment. The trend for high-end server chips these days is many replicated cores, each of which allows some degree of multi-threading. It's hard to say which multi-threading technique is best, since each offers advantages that may be more appropriate given a certain application workload.
Take the Sun UltraSPARC T1 for example. It has 8 cores, each of which can support up to 4 threads on a single shared pipeline. A core is able to switch between threads with no delay. This approach is called fine-grained temporal multi-threading: fine-grained because threads can switch every cycle, temporal because threads are interleaved across cycles.
Another approach, called Simultaneous Multithreading (SMT) allows instructions from multiple threads to be in the same pipeline stage at the same time. This technique requires that the processor be superscalar, that is, be able to issue multiple instructions to the pipeline in a single cycle. You will tend to not see as much of this in the server market because superscalar processors tend to be bigger and more power-hungry, not in line with the economies of scale that server farms and data centers require.

Is there a point to multithreading?

I don’t want to make this subjective...
If I/O and other input/output-related bottlenecks are not of concern, then do we need to write multithreaded code? Theoretically the single threaded code will fare better since it will get all the CPU cycles. Right?
Would JavaScript or ActionScript have fared any better, had they been multithreaded?
I am just trying to understand the real need for multithreading.

I don't know if you have payed any attention to trends in hardware lately (last 5 years) but we are heading to a multicore world.
A general wake-up call was this "The free lunch is over" article.
On a dual core PC, a single-threaded app will only get half the CPU cycles. And CPUs are not getting faster anymore, that part of Moores law has died.

In the words of Herb Sutter The free lunch is over, i.e. the future performance path for computing will be in terms of more cores not higher clockspeeds. The thing is that adding more cores typically does not scale the performance of software that is not multithreaded, and even then it depends entirely on the correct use of multithreaded programming techniques, hence multithreading is a big deal.
Another obvious reason is maintaining a responsive GUI, when e.g. a click of a button initiates substantial computations, or I/O operations that may take a while, as you point out yourself.

The primary reason I use multithreading these days is to keep the UI responsive while the program does something time-consuming. Sure, it's not high-tech, but it keeps the users happy :-)

Most CPUs these days are multi-core. Put simply, that means they have several processors on the same chip.
If you only have a single thread, you can only use one of the cores - the other cores will either idle or be used for other tasks that are running. If you have multiple threads, each can run on its own core. You can divide your problem into X parts, and, assuming each part can run indepedently, you can finish the calculations in close to 1/Xth of the time it would normally take.
By definition, the fastest algorithm running in parallel will spend at least as much CPU time as the fastest sequential algorithm - that is, parallelizing does not decrease the amount of work required - but the work is distributed across several independent units, leading to a decrease in the real-time spent solving the problem. That means the user doesn't have to wait as long for the answer, and they can move on quicker.
10 years ago, when multi-core was unheard of, then it's true: you'd gain nothing if we disregard I/O delays, because there was only one unit to do the execution. However, the race to increase clock speeds has stopped; and we're instead looking at multi-core to increase the amount of computing power available. With companies like Intel looking at 80-core CPUs, it becomes more and more important that you look at parallelization to reduce the time solving a problem - if you only have a single thread, you can only use that one core, and the other 79 cores will be doing something else instead of helping you finish sooner.

Much of the multithreading is done just to make the programming model easier when doing blocking operations while maintaining concurrency in the program - sometimes languages/libraries/apis give you little other choice, or alternatives makes the programming model too hard and error prone.
Other than that the main benefit of multi threading is to take advantage of multiple CPUs/cores - one thread can only run at one processor/core at a time.

No. You can't continue to gain the new CPU cycles, because they exist on a different core and the core that your single-threaded app exists on is not going to get any faster. A multi-threaded app, on the other hand, will benefit from another core. Well-written parallel code can go up to about 95% faster- on a dual core, which is all the new CPUs in the last five years. That's double that again for a quad core. So while your single-threaded app isn't getting any more cycles than it did five years ago, my quad-threaded app has four times as many and is vastly outstripping yours in terms of response time and performance.

Your question would be valid had we only had single cores. The things is though, we mostly have multicore CPU's these days. If you have a quadcore and write a single threaded program, you will have three cores which is not used by your program.
So actually you will have at most 25% of the CPU cycles and not 100%. Since the technology today is to add more cores and less clockspeed, threading will be more and more crucial for performance.

That's kind of like asking whether a screwdriver is necessary if I only need to drive this nail. Multithreading is another tool in your toolbox to be used in situations that can benefit from it. It isn't necessarily appropriate in every programming situation.

Here are some answers:
You write "If input/output related problems are not bottlenecks...". That's a big "if". Many programs do have issues like that, remembering that networking issues are included in "IO", and in those cases multithreading is clearly worthwhile. If you are writing one of those rare apps that does no IO and no communication then multithreading might not be an issue
"The single threaded code will get all the CPU cycles". Not necessarily. A multi-threaded code might well get more cycles than a single threaded app. These days an app is hardly ever the only app running on a system.
Multithreading allows you to take advantage of multicore systems, which are becoming almost universal these days.
Multithreading allows you to keep a GUI responsive while some action is taking place. Even if you don't want two user-initiated actions to be taking place simultaneously you might want the GUI to be able to repaint and respond to other events while a calculation is taking place.
So in short, yes there are applications that don't need multithreading, but they are fairly rare and becoming rarer.

First, modern processors have multiple cores, so a single thraed will never get all the CPU cycles.
On a dualcore system, a single thread will utilize only half the CPU. On a 8-core CPU, it'll use only 1/8th.
So from a plain performance point of view, you need multiple threads to utilize the CPU.
Beyond that, some tasks are also easier to express using multithreading.
Some tasks are conceptually independent, and so it is more natural to code them as separate threads running in parallel, than to write a singlethreaded application which interleaves the two tasks and switches between them as necessary.
For example, you typically want the GUI of your application to stay responsive, even if pressing a button starts some CPU-heavy work process that might go for several minutes. In that time, you still want the GUI to work. The natural way to express this is to put the two tasks in separate threads.

Most of the answers here make the conclusion multicore => multithreading look inevitable. However, there is another way of utilizing multiple processors - multi-processing. On Linux especially, where, AFAIK, threads are implemented as just processes perhaps with some restrictions, and processes are cheap as opposed to Windows, there are good reasons to avoid multithreading. So, there are software architecture issues here that should not be neglected.
Of course, if the concurrent lines of execution (either threads or processes) need to operate on the common data, threads have an advantage. But this is also the main reason for headache with threads. Can such program be designed such that the pieces are as much autonomous and independent as possible, so we can use processes? Again, a software architecture issue.
I'd speculate that multi-threading today is what memory management was in the days of C:
it's quite hard to do it right, and quite easy to mess up.
thread-safety bugs, same as memory leaks, are nasty and hard to find
Finally, you may find this article interesting (follow this first link on the page). I admit that I've read only the abstract, though.

With modern OS schedulers, does it still make sense to manually lock processes to specific CPUs/cores?

I recently learned that sometimes people will lock specific processes or threads to specific processors or cores, and it's thought that this manual tuning will best distribute the load. This is a bit counter-intuitive to me -- I would think the OS scheduler would be able to make a better decision than a human about how to spread the load. I could see it being true for older operating systems that perhaps weren't aware of issues like their being more latency between specific pairs of cores, or shared cache between one pair of cores but not another pair. But I assume 'modern' OSs like Linux, Solaris 10, OS X, and Vista should have schedulers that know this information. Am I mistaken about their capabilities? Am I mistaken that it's a problem the OS can actually solve? I'm particularly interested in the answer for Solaris and Linux.
The consequence is whether or not I need to inform users of my (multithreaded) software of how they might consider balancing on their box.

First of all, 'Lock' is not a correct term to describe it. 'Affinity' is more suitable term.
In most case, you don't need to care about it. However, in some cases, manually setting CPU/Process/Thread affinity could be beneficial.
Operating systems are usually oblivious to the details of modern multicore architecture. For example, say we have 2-socket quadcore processors, and the processor supports SMT(=HyperThreading). In this case, we have 2 processors, 8 cores, and 16 hardware threads. So, OS will see 16 logical processors. If an OS does not recognize such hierarchy, it is highly likely to lose some performance gains. The reasons are:
Caches: in our example, two different processors (installed on two different sockets) are not sharing any on-chip caches. Say an application has 4 busy-running threads and a lot of data are shared by threads. If an OS schedules the threads across the processors, then we may lose some cache locality, resulting in performance lose. However, the threads are not sharing much data (having distinct working set), then separating to different physical processors would be better by increasing effective cache capacity. Also, more tricky scenario could be happen, which is very hard for OS to be aware of.
Resource conflict: let's consider SMT(=HyperThreading) case. SMT shares a lot of important resources of CPU such as caches, TLB, and execution units. Say there are only two busy threads. However, an OS may stupidly schedule these two threads on two logical processors from the same physical core. In such case, a significant resources are contended by two logical threads.
One good example is Windows 7. Windows 7 now supports a smart scheduling policy that consider SMT (related article). Windows 7 actually prevents the above 2. case. Here is a snapshot of task manger in Windows 7 with 20% load on Core i7 (quadcore with HyperThreading = 8 logical processors):
(source: egloos.com)
The CPU usage history is very interesting, isn't? :) You may see that only a single CPU in pairs is utilized, meaning Windows 7 avoids scheduling two threads on a same core simultaneously as possible. This policy will definitely decrease the negative effects of SMT such as resource conflict.
I'd like to say OS are not very smart to understand modern multicore architecture where a lot of caches, shared last-level cache, SMT, and even NUMA. So, there could be good reasons you may need to manually set CPU/process/thread affinity.
However, I won't say this is really needed. Only when you fully understand your workload patterns and your system architecture, then try it on. And, see the results whether your try is effective.

For general-purpose applications, there is no reason to set the CPU affinity; you should just allow the OS scheduler to choose which CPU should run the process or thread. However, there are instances where it is necessary to set the CPU affinity. For example, in real-time systems where the cost of migrating a thread from one core to another (which can happen at any time if the CPU affinity has not been set) can introduce unpredictable delays that can cause tasks to miss their deadlines and which preclude real-time guarantees.
You can take a look at this article about a multi-core aware implementation of real-time CORBA that, among other things, had to set the CPU affinity so that CPU migration could not result in missed deadlines.
The paper is: Real-Time Performance and Middleware for Multiprocessor and Multicore Linux Platforms

For applications designed with parallelism and multiple cores in mind, OS-default thread affinity is sometimes not enough. There are many approaches to parallelism, but so far all require involvement of the programmer and knowledge - at some level at least - of the architecture on which the solution will be mapped. This includes the machines, CPU's and threads that are involved.
This is an actively researched subject, and there is an excellent course on MIT's OpenCourseWare that delves into these issues: http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-189January--IAP--2007/CourseHome/

Well something many people haven't thought here is the idea of forbidding two processes to run on the same processor (socket). It might be worth to help the system to bound different heavily used processes to different processors. This can avoid contention if the scheduler is not clever enough to figure it out itself.
But this is more a system admin task then one for the programmers. I have seen optimizations like this for a few high performance database servers.

Most modern operating systems will do an effective job of allocating work between cores. They also attempt to keep threads running on the same core, to get the cache benefits you mentioned.
In general, you should never be setting your thread affinity unless you have a very good reason to. You don't have as good an insight as the OS into the other work that threads on the system are doing. Kernels are constantly being updated based on new processor technology (single CPU per socket to hyper threading to multiple cores per sockets). Any attempt by you to set hard affinity may backfire on future platforms.

This article from MSDN Magazine, Using concurrency for scalability, gives a good overview of multithreading on Win32. Regarding CPU affinity,
Windows automatically employs
so-called ideal processor affinity in
an attempt to maximize cache
efficiency. For example, a thread
running on CPU 1 that gets context
switched out will prefer to run again
on CPU 1 in the hope that some of its
data will still reside in cache. But
if CPU 1 is busy and CPU 2 is not, the
thread could be scheduled on CPU 2
instead, with all the negative cache
effects that implies.
The article also warns that CPU affinity shouldn't be manipulated without a deep understanding of the problem. Based on this information, my answer to your question would be No, except for very specific, well-understood scenarios.

I am not even sure you can pin processes to a specific CPU on linux. So, my answer is "NO" - let the OS handle it, it's smarter then you most of the time.
Edit:
It seems that on win32 you have some control over which CPU family are you going to run this process. Now I only wait for someone to prove me wrong also on linux/posix ...

Developing Kernels to support Multiple CPUs

I am looking to get into operating system kernel development and figured my contribution would be to extend the SANOS operating system in order to support multiple core machines. I have been reading books on operating systems (Tannenbaum) as well as studying how BSD and Linux have tackled this challenge but still am stuck on several concepts.
Does SANOS need to have more sophisticated scheduling algorithms when it runs on multiple CPUs or will what is currently in place work fine?
I know that it is a good idea for threads to have affinity to a core that they were started on, but is this handled via scheduling or by changing the implementation of how threads are created?
What would need to be considered such that SANOS could run on a machine with hundreds of cores? From what I can tell, BSD and Linux at best only support a maximum of a dozen of cores.

Your reading material is good. SO no problems there. Also take a peek at the CS downloadable lectures on operating system design from Stanford.
The scheduling algorithm may need to be more sophisticated. This depends on the types of applications running and how greedy they are. Do they yield themselves or are they forced to. That kind of thing. This is more a question of what your processes want, or expect. A RTOS will have more complex scheduling than a desktop.
Threads should have an affinity to one core, because 2 threads in one process can execute in parallel ... but not at the same real-time on the same core. Putting them on different cores allows them to really-run-in-parallel. Also caching can be optimized for core affinity. This is really a mix of your thread implementation and scheduler. The sched may want to ensure threads are started at the same time on cores, rather than ad-hoc to reduce the amount of time threads wait on eachother and things. If your thread library is user-space, maybe it assigns core, or lets the scheduler decide based on capacity or recent deaths.
Scalability is often a kernel limit (which can be arbitrary). In Linux, if I recall, the limits are due to static sizing of arrays that hold CPU information structs in the scheduler. Hence they are a fixed size. This can be changed by recompiling the kernel. Most good scheduling algorithms will support a very large number of cores. As your core or processor count gets higher, you need to be careful that you don't fragment a processes execution too much. If a program has 2 threads, try and schedule them in close-time-proximity because causation may exist (through shared data) between them.
You also need to decide how your threads are implemented, and how a process is represented (be it heavy or lightweight) in the kernel. Are threads kernel managed? user-space managed? These things all have an impact on scheduler design. Look at how POSIX threads are implemented in various operating systems. There is just so much for you to think about :)
in short there are not really any straight-cut answers to where the logic does, or should reside. It is all down to design, application expectation, time-constraints (on the programs) and so on.
Hope this helps, I am not an expert here however.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string