Hardware Multithreading and Simultaneous Multithreading(SMT)

Hardware Multithreading and Simultaneous Multithreading(SMT) - multithreading

I'm reading Multithreading (computer architecture) - Wiki, aka hardware threading, and I'm trying to understand the second paragraph:
(p2): Where multiprocessing systems include multiple complete processing units in one or more cores, multithreading aims to increase utilization of a single core by using thread-level parallelism, as well as instruction-level parallelism.
while the link to thread-level parallelism says:
(Link): Thread-level parallelism (TLP) is the parallelism inherent in an application that runs multiple threads at once. This type of parallelism is found largely in applications written for commercial servers such as ...
which is not so useful... So I read task parallelism above, since I guess TLP is a subtype of it:
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing tasks—concurrently performed by processes or threads—across different processors.
Question: If thread-level parallelism is task parallelism, and task parallelism is for parallelization across multiple processors, how increase utilization of a single core by using thread-level parallelism work?
Guessing: I guess for TLP, it should mean across multiple logical processors, i.e. hardware threads in the perspective of OS, correct?
Another minor issue is that for my first link, Multithreading:
In computer architecture, multithreading is the ability of a central processing unit (CPU) (or a single core in a multi-core processor) to execute multiple processes or threads concurrently, supported by the operating system.
And in (p2) it aim to increase utilization of a single core by using thread-level parallelism? What a contradiction.

I don't think we should base off wiki definitions, the wording there is not accurate enough to merit searching for contradictions.
First, I would describe task parallelism as a form of parallelism inherent to some algorithm or problem, where there could be a functional decomposition into multiple tasks with different nature, that can run concurrently. Alternative forms of parallelism include for example spatial or data decomposition, where the problem can be broken into different parts of the data or the input layout (e.g., array ranges, matrix tiles, image parts...).
Thread-level parallelism is a different taxonomy, it is any form of parallelism that can be extracted for utilization by a multi-threaded system. It requires the decomposition to be coarse grained enough to allow the different threads to run independently (otherwise the synchronization overhead required would make it useless).
The alternative for that is for example ILP (instruction level parallelism) which is when a single thread context can extract parallelism within the code by running over a deep out-of-order machine that can schedule based on readiness. This allows more fine-grained parallelism and less programmer involvement usually, but limits the parallelism to the depth of the OOO window.
On a related topic - be careful not to confuse simultaneous execution and concurrent one.
Thread level parallelism can be used by extracting task-level parallelism or other forms of algorithm decomposition from the code. It can then be run on a system that is single-core (preemptive), or multi-threaded. The latter type can be achieved through multi-core systems, simultaneous multi-threading or both (common processors usually have many cores, and may of them support SMT on top of that).

I think my intuition should be correct, it is either:
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments.
should be updated to:
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple logical processors in parallel computing environments.
to include the possibility of hyper threading.
No update, but task parallelism is focus on cross-processor parallelism while for TLP it should mean
Thread-level parallelism is a form of parallelization of computer code across multiple logical processors in parallel computing environments.
again, to include the possibility of hyper threading.
Useful resource:
https://en.wikipedia.org/wiki/Hyper-threading
https://www.youtube.com/watch?v=wnS50lJicXc
https://en.wikipedia.org/wiki/Simultaneous_multithreading
especially this line:
The name multithreading is ambiguous, because not only can multiple threads be executed simultaneously on one CPU core, but also multiple tasks (with different page tables, different task state segments, different protection rings, different I/O permissions, etc.).
So for the minor issue, see Concurrent computing - #Introduction - p1:
The concept of concurrent computing is frequently confused with the related but distinct concept of parallel computing,[2][3] although both can be described as "multiple processes executing during the same period of time". In parallel computing, execution occurs at the same physical instant: for example, on separate processors of a multi-processor machine, with the goal of speeding up computations—parallel computing is impossible on a (one-core) single processor, as only one computation can occur at any instant (during any single clock cycle).

Related

About multithreading, concurrency, and parallelism

Recently I had confusion with understanding concepts: multithreading, concurrency, and parallelism. In order to reduce confusion, I've tried to organize my understanding about these and drawn my conclusion. My question is,
Is there a misunderstanding or something wrong from conclusion below?
References I took can be found here.
1. Concurrency and parallelism are different level of category.
It is not either concurrent or parallelism. It is either concurrent or not, and either parallel or not.
For example,
Not concurrent (Sequential) / Not parallel
Not concurrent (Sequential) / Parallel
Concurrent / Not parallel
Concurrent / Parallel
2. Parallelism is not subset of concurrency.
3. How does threading or multithreading relates to concurrency and parallelism?
Definition of thread clarifies this. Thread is "unit of execution flow". This "execution flow" can be managed independently by a scheduler, which is typically a part of the operating system.
Having a thread means having one unit of execution flow.
Having multiple threads (Multithreading) means having multiple units of execution flow.
And,
Having multiple units of execution flow is having multiple things making progress, which is definition of concurrency.
And,
Multiple units of execution flow is done by time slicing in single core hardware environment.
Multiple units of execution flow is done in parallel in multi core hardware environment.
4. Is multithreading concurrent or parallel?
Multithreading, or having multiple units of execution flow, is concurrent.
Multithreading itself is just having multiple units of execution flow. This has nothing to do with parallelism.
How operating system deals with multiple units of execution flow relates to parallelism.
Parallelism is achieved by operating system and hardware environment.
"Code can be concurrent, but not parallel." (Parallelism implies concurrency but not the other way round right?, stackexchange)
Detailed description will be truly appreciated.

Parallelism Refers to any system in which a single application can make use of more computing hardware than a single CPU can provide. There are a number of different types of parallel computing architecture, but when people say "parallelism" they often are talking about one in particular...
...A Symmetric MultiProcessing (SMP) system is a computer with one memory system, and two or more traditional CPUs that have equal access to it. Most modern workstations, most mobile devices, and many server systems* are SMP.
Multithreading is a model of concurrent computing.** A computer scientist might tell you that two threads run concurrently when the order in which the operations they perform are interleaved is not strictly determined by the program itself. A software developer is more likely to say that two threads run concurrently with each other when both threads have been started and neither of them has finished.
One way to achieve parallelism in an application running on an SMP system is to use multiple concurrent threads.
* Some servers are NUMA, which is a close cousin to SMP. In a NUMA system, the CPUs all access the same memory system, just like in SMP, except that each CPU "owns" part of the physical memory space, and it can access its own memory locations more quickly than it can access memory locations that are owned by other CPUs.
** There are other models of concurrent computing. Some, such as Actors, are used in production software. Others are mostly of academic interest.

What are all the different types of parallelism?

I am trying to understand more about parallelism, but I've noticed there are a lot of different terms out there and some seem to mean the same thing while others have a notable difference. So, what are all the different types of parallelism, how do they differ from each other, and do any have specific applications or purposes?
(To keep this more focused, I'm hoping for an answer that provides clarity to all the terminology associated with parallelism, including terms not listed below; technical comparisons between each different type would be nice, but will probably result in this question becoming off-topic - then again, I don't really know, hence the question).
Note:
this is not a question about concurrency and goes beyond the "simple" question: "what is parallelism?", although a clarifying definition might be warranted.
First, I have taken notice of the difference between parallelism and threading, but some of the differences between the following terms are still confusing.
To add clarity to my question here is a list of terms that I have found that are related to parallelism: parallel computing, parallel processing, multithreading, multiprocessing, multicore programming, Hyper-threading (Intel) 2, Simultaneous MultiThreading (SMT) 3, Switch-on-Event MultiThreading 3. (If possible, definitions or references to definitions for each of these terms would also be appreciated).
My very specific question: what is the difference between thread-level parallelism, instruction-level parallelism, and process-level parallelism? (and any other x-level parallelism)?
In a multi-core processor, can parallelism occur within a single core? Is that what Hyper-threading is, and does that require a single core having, for example, two ALU's that can be used in parallel?
Last one: is there a difference between hardware vs software parallelism, aside from the obvious distinction that one happens in hardware while the other in software?
Related resources:
- Process vs Thread,
- Parallelism on a GPU,
- Hyper-threading,
- Concurrency vs Parallelism,
- Hyper-threading and gaming.

Q:What is the difference betweenthread-level parallelism,instruction-level parallelism,and process-level parallelism?
While the subject matter is indeed immensely wide, I would try to have this view, even at a risk of making many opponents present their objections of simplifying the subject matter ( but StackOverflow format does not substitute other sources of complete reference, does it ? ):
A:the main difference is WHAT / WHO / HOW is responsible for keeping things to execute in true-[PARALLEL]
Instruction Level Parallelism - ILP - is the simplest case, the CPU-architecture has designed and "hardwired" this particular form of hardware-based parallelism. Having processors with ILP4 ( 4 instructions executed at once ), or having processors with per-instruction based width of this form of parallel-instruction execution, be it ILP2 for some instructions but ILP1 for some others, again the silicon architecture decides, what can happen indeed in parallel at the instruction level. Some awkward surprises may arise from further details, as memory-controller channels may block ILP-mode in cases, where REG/MEMORY uops will have to wait for a free channel to access the instructed MEMORY.
hardware-threads are the next level of granularity. Given a CPU-core is declared to support two hardware threads, these are the only streams-of-code execution, that may flow in parallel ( if no O/S request comes to instantiate and schedule another thread to get executed, mapped onto one of the available CPU-core hardware-threads ). From the user-perspective, there are O/S tools that permit one to explicitly "nail"-down a process-level-PID / thread-level-PID affinity onto a particular CPU-core(s) and thus limit or even eliminate any "disturbance", so as to move from a "just"-[CONCURRENT] flow of code-execution closer to a true-[PARALLEL]one.
We will knowingly skip all the crowds of threads, that are just a tool for latency-masking ( be it on the SIMT / SMX warp-wide GPU-scheduler, or the more relaxed, MIMT O/S-kernel driven multithreading )
software operated distributed-systems parallelism is the one that ought be mentioned for completeness, but it has the principally highest adverse costs from a need to invent, define, implement and operate the setup / coordination in software ( which all causes overheads to grow remarkably ), in the sense of the re-formulated Amdahl's Law right due to a need to somehow design and keep operational the non-native orchestration of both the distributed process execution and all the dataflow, that it is dependent on.
hardware-based true-[PARALLEL] systems are at the highest level of orchestration, where both the silicon ( like the InMOS' network of meshed Transputers ) and also the programming language ( like the InMOS' occam or occam-pi ) provide the carefully engineered, conceptually crafted true-[PARALLEL] code-execution.
- MIMT: Multiple Instruction Multiple Threads, a non-restricted thread-execution fabric / policy, where any thread may and does issue a different instruction to the processor for execution, as opposed to SIMT
- SIMT: Single Instruction Multiple Threads, typically a GPU Streaming Multiprocessor code-execution architecture- SMX: Streaming Multiprocessor eXecution unit, typically a GPU SIMT building block, onto which the GPU-kernel code-units could be directed ( addressed ) for being TaskQueeue-scheduled and later executed, according to the WARP-wide SIMT-code scheduler coordinated

what is the difference between thread-level parallelism, instruction-level parallelism, and process-level parallelism?
In 1, different CPU cores execute different streams of instructions.
In 2, single CPU core executes different instructions from a single instruction stream in parallel (these instructions are either consecutive instructions in the stream, or otherwise very close to each other).
3 is same as 1, the difference is cosmetic. It’s just the default settings about which memory pages are shared across threads and which aren’t. But these settings are user-adjustable with process creation flags, shared memory sections, dynamic libraries, and other system APIs, that’s why on the lower level, the difference between process and threads is not a big deal.
and any other x-level parallelism
Another important one is SIMD level parallelism. For this one, CPU applies same instruction to multiple operands stored in special wide registers. With SSE we have 128-bit wide registers, and we can e.g. multiply a vector of 4 single-precision floating-point numbers in one register by another 4 values in another register, making 4 products in parallel, with a single mulps instruction. ARM NEON is similar, also 128 bit registers, the instruction to multiply 4 floats by 4 floats is vmul.f32. AVX operates on 256-bit registers so it can multiply 8 floats at once, with a single vmulps instruction.
can parallelism occur within a single core?
Yes.
Is that what Hyper-threading is
Yes, also it’s what instruction-level parallelism is, and SIMD parallelism, too.
does that require a single core having, for example, two ALU's that can be used in parallel?
Modern CPUs have more than two per core but HT was introduced in P4 and it’s not a requirement. The profit from HT is not just loading multiple ALUs, it’s also using the core while a thread is waiting for data to arrive from caches or from system RAM. And also, using the core while it's stalled because of the data dependency between nearby instructions. HT allows a CPU core to compute something else on another hardware thread while it’s waiting, therefore improving ALU utilization. Without HT, the core would likely just sit and wait for hundreds cycles in case of RAM latency, or for dozens cycles in case of data dependency latency.
is there a difference between hardware vs software parallelism
When you have a single hardware thread and multiple OS threads that compute stuff, only 1 thread will be running at any given time. The rest of the threads will be waiting. The OS will periodically (often ~50-100Hz) switch which one’s running, with the goal to give all threads a fair slice of CPU time. You can call that software parallelism if you want, but I wouldn’t call such thing parallel at all.

Level of Parallelism present in multiple threads per core

So i have been looking into some of the technologies that implement multiple threads per core (like intel's hyperthreading) and I am wondering whats the extent of parallelism in these kinds of technologies. Is it true parallelism or just more effective concurrency? It seems they still share the same execution units and core resources, basically seems like its just virtualizing the usage. So I am unsure how true parallelism could occur. And if this is the case then what is the benefit? You can achieve concurrency through effective thread context switching.

I'm no expert, but from what I've read (Long Duration Spin-wait Loops on Hyper-Threading Technology Enabled Intel Processors):
Each physical processor has two logical processors. The logical processors each have their own independent architectural state, but share nearly all other resources on the physical processor, such as caches, execution units, branch predictor, control logic and buses.
So, basically, if one logical processor is using a physical unit (e.g., FPU, the Floating-point unit), the other logical processor is allowed to use another resource (e.g., ALU, the arithmetic logic unit).
From what I've read, you can expect a performance increase of 15-20% best case scenario. I don't have any actual numbers, but don't expect the same level of performance increase as you'd expect from adding another physical processor.

So there are a lot of factors that determine the benefits present in Hyperthreading. First off since they are sharing resources there is obviously no true parallelism but their is some increase in concurrency depending on the type of processor.
There are three types of hardware threading. Fine grained which switches threads in a round robin fashion with the goal of increased throughput, at the cost of increased individual thread latency. Switching is done on a clock to clock basis. There is course grained which is more like a context switch, where the processor switching the thread when a stall or some sort of memory fetching occurs. Then there is simultaneous in which thread switching occurs in the same clock, meaning there is multiple thread data in the Reorder Buffer and Pipeline at the same time. They are depicted as follows.
Hyperthreading corresponds to the SMT in this diagram. And as seen, the effectiveness of the design depends primarily on one thing: how busy the pipeline is. In dynamically scheduled processors, where the goal is to keep the pipeline and execution units as busy as possible, the advantages see diminishing returns of around 0 to 5 percent from what I have seen. For statically scheduled processors, where the pipeline has a lot of stalls the benefits are much more prevalent and see gains of around 20 to 40% depending on the capabilities of the compiler reordering the instructions.

What is the difference between multicore and concurrent programming

Can anyone help me out I am working on a presentation and would like to include a bit about - 'The difference between multicore and concurrent programming', I have googled a bit but not turning up many good descriptions, any help appreciated! :)
Thanks,
Eamonn

Concurrent (occurring or existing simultaneously) implies that different code MAY execute at the exact same cycle. It means that things can possibly happen in parallel if multiple processors or a processor with multiple cores is available and the program is crafted correctly. Just adding threads does not imply concurrent execution.
The reason I say MAY and possibly is that anytime the programs separate threads need to share volatile/mutable state, other threads that need access to that state can not continue executing and will have to wait their turn to access that state, and things start happening serially again.
Typically this is implemented in a single program as more than one thread executing code concurrently at the same exact cycle as another thread, given that there is no resource contentions as listed above. This requires multiple physical processors or cores. Other models run multiple heavyweight OS processes that can execute concurrently.
Concurrent programming is very hard to do correctly with mutable shared state.
You can write a concurrent program
that runs serially on a single single
core processor, but scales up to
execute more things at the same time
when more processors or cores, or even
multiple processors with multiple
cores is present.
You can also cause single threaded programs to appear concurrent on a multi-core / multi-processor system if they can operate on independent ranges of input data at the same time. Example: a single threaded 3D rendering program can on a dual core machine can run 2 separate instances the first rendering all the odd frames and the second rendering all the even frames. As long as they don't try to share any mutable resources.
Multi-core means that a single CPU has multiple Processor cores that can execute threads or processes concurrently and typically appears as multiple processors to mainstream operating systems.
It does NOT imply that programs that are single threaded gain any concurrency behaviors or benefits from the additional processor cores available.

Concurrent Programming is more broad - it just refers to writing software that will run "concurrently" - ie: more than one thing will happen at a time.
"Multi-core" programming is really referring to a specific subset of concurrent programming, in which you are targetting multiple available CPU cores on a specific machine. This is the most common form of concurrent programming (typically single process running on a single computer), but still only one form of concurrent programming.

You can do concurrent programming on a machine that has only a single CPU core. The operating system provides the illusion that more than one thread is running at the same time, it rapidly switches back-and-forth between them.
A machine with multiple cores simply needs to this context switching less often since two threads can run at the same time on two cores. It is only a bit special because threading bugs can make your life difficult much quicker. The odds that two threads try to access a shared memory location at the same time is much higher.

At a high level, multi-core is an attribute of the processor chip in your computer. Multi core means it has got multiple processing cores. There are several types of multi-processor computers: the old style super computers with thousands of computers connected via ethernet, systems with more than processors (like 2 Pentium 4s), and contemporary multi-core systems where every processor package has multiple processing cores 9like Intel i7). The third type is often called multi-core of Chip Multiprocessor (CMP).
Concurrent programming is an attribute of software. Concurrent programming is about writing code which has is split into multiple tasks that can execute concurrently if processors are available. While concurrent programs do leverage multi-core, concurrent programming is broader in two dimensions:
Concurrent programs can run on a single core or multiple cores.
Concurrent programs can be used on any type of multi-processors I mentioned above.
Thus, to summarize:
Concurrent programming is about software that can use multiple processors if available. those processors can be on the same chip (multi-core or Chip Multiprocessor) or on different chips (often known as SMP). You can have systems where you can put two multi-core chips in the same system making it a CMP and an SMP at the same time. Concurrent programming will work for that as well.

Concurrent programming regards operations that appear to overlap and is primarily concerned with the complexity that arises due to non-deterministic control flow. The quantitative costs associated with concurrent programs are typically both throughput and latency. Concurrent programs are often IO bound but not always, e.g. concurrent garbage collectors are entirely on-CPU. The pedagogical example of a concurrent program is a web crawler. This program initiates requests for web pages and accepts the responses concurrently as the results of the downloads become available, accumulating a set of pages that have already been visited. Control flow is non-deterministic because the responses are not necessarily received in the same order each time the program is run. This characteristic can make it very hard to debug concurrent programs. Some applications are fundamentally concurrent, e.g. web servers must handle client connections concurrently. Erlang, F# asynchronous workflows and Scala's Akka library are perhaps the most promising approaches to highly concurrent programming.
Multicore programming is a special case of parallel programming. Parallel programming concerns operations that are overlapped for the specific goal of improving throughput. The difficulties of concurrent programming are evaded by making control flow deterministic. Typically, programs spawn sets of child tasks that run in parallel and the parent task only continues once every subtask has finished. This makes parallel programs much easier to debug than concurrent programs. The hard part of parallel programming is performance optimization with respect to issues such as granularity and communication. The latter is still an issue in the context of multicores because there is a considerable cost associated with transferring data from one cache to another. Dense matrix-matrix multiply is a pedagogical example of parallel programming and it can be solved efficiently by using Straasen's divide-and-conquer algorithm and attacking the sub-problems in parallel. Cilk is perhaps the most promising approach for high-performance parallel programming on multicores and it has been adopted in both Intel's Threaded Building Blocks and Microsoft's Task Parallel Library (in .NET 4).

Programming for Multi core Processors

As far as I know, the multi-core architecture in a processor does not effect the program. The actual instruction execution is handled in a lower layer.
my question is,
Given that you have a multicore environment, Can I use any programming practices to utilize the available resources more effectively? How should I change my code to gain more performance in multicore environments?

That is correct. Your program will not run any faster (except for the fact that the core is handling fewer other processes, because some of the processes are being run on the other core) unless you employ concurrency. If you do use concurrency, though, more cores improves the actual parallelism (with fewer cores, the concurrency is interleaved, whereas with more cores, you can get true parallelism between threads).
Making programs efficiently concurrent is no simple task. If done poorly, making your program concurrent can actually make it slower! For example, if you spend lots of time spawning threads (thread construction is really slow), and do work on a very small chunk size (so that the overhead of thread construction dominates the actual work), or if you frequently synchronize your data (which not only forces operations to run serially, but also has a very high overhead on top of it), or if you frequently write to data in the same cache line between multiple threads (which can lead to the entire cache line being invalidated on one of the cores), then you can seriously harm the performance with concurrent programming.
It is also important to note that if you have N cores, that DOES NOT mean that you will get a speedup of N. That is the theoretical limit to the speedup. In fact, maybe with two cores it is twice as fast, but with four cores it might be about three times as fast, and then with eight cores it is about three and a half times as fast, etc. How well your program is actually able to take advantage of these cores is called the parallel scalability. Often communication and synchronization overhead prevent a linear speedup, although, in the ideal, if you can avoid communication and synchronization as much as possible, you can hopefully get close to linear.
It would not be possible to give a complete answer on how to write efficient parallel programs on StackOverflow. This is really the subject of at least one (probably several) computer science courses. I suggest that you sign up for such a course or buy a book. I'd recommend a book to you if I knew of a good one, but the paralell algorithms course I took did not have a textbook for the course. You might also be interested in writing a handful of programs using a serial implementation, a parallel implementation with multithreading (regular threads, thread pools, etc.), and a parallel implementation with message passing (such as with Hadoop, Apache Spark, Cloud Dataflows, asynchronous RPCs, etc.), and then measuring their performance, varying the number of cores in the case of the parallel implementations. This was the bulk of the course work for my parallel algorithms course and can be quite insightful. Some computations you might try parallelizing include computing Pi using the Monte Carlo method (this is trivially parallelizable, assuming you can create a random number generator where the random numbers generated in different threads are independent), performing matrix multiplication, computing the row echelon form of a matrix, summing the square of the number 1...N for some very large number of N, and I'm sure you can think of others.

I don't know if it's the best possible place to start, but I've subscribed to the article feed from Intel Software Network some time ago and have found a lot of interesting thing there, presented in pretty simple way. You can find some very basic articles on fundamental concepts of parallel computing, like this. Here you have a quick dive into openMP that is one possible approach to start parallelizing the slowest parts of your application, without changing the rest. (If those parts present parallelism, of course.) Also check Intel Guide for Developing Multithreaded Applications. Or just go and browse the article section, the articles are not too many, so you can quickly figure out what suits you best. They also have a forum and a weekly webcast called Parallel Programming Talk.

Yes, simply adding more cores to a system without altering the software would yield you no results (with exception of the operating system would be able to schedule multiple concurrent processes on separate cores).
To have your operating system utilise your multiple cores, you need to do one of two things: increase the thread count per process, or increase the number of processes running at the same time (or both!).
Utilising the cores effectively, however, is a beast of a different colour. If you spend too much time synchronising shared data access between threads/processes, your level of concurrency will take a hit as threads wait on each other. This also assumes that you have a problem/computation that can relatively easily be parallelised, since the parallel version of an algorithm is often much more complex than the sequential version thereof.
That said, especially for CPU-bound computations with work units that are independent of each other, you'll most likely see a linear speed-up as you throw more threads at the problem. As you add serial segments and synchronisation blocks, this speed-up will tend to decrease.
I/O heavy computations would typically fare the worst in a multi-threaded environment, since access to the physical storage (especially if it's on the same controller, or the same media) is also serial, in which case threading becomes more useful in the sense that it frees up your other threads to continue with user interaction or CPU-based operations.

You might consider using programming languages designed for concurrent programming. Erlang and Go come to mind.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string