How do GPU cores communicate with each other? - multithreading

GPUs, when used for general-purpose computing, put a lot of emphasis on fine-grained parallelism with SIMD and SIMT. They perform best on regular numbercrunching workloads with high arithmetic intensity.
Nonetheless, to be applicable to as many workloads as they have been applied to, they must also be capable of coarse-grained MIMD parallelism, where different cores execute different instruction streams on different chunks of data.
This means different cores on the GPU must synchronize with each other after executing different instruction streams. How do they do it?
On a CPU the answer would be that there is cache coherence plus a set of communication primitives chosen to work well with that such as CAS or LL/SC. But as I understand it, GPUs do not have cache coherence - avoiding the overhead of such is the biggest reason they are more efficient than CPUs in the first place.
So what method do GPU cores use for synchronizing with each other? If the answer to how they exchange data is by writing to shared main memory, then how do they synchronize so the sender can inform the recipient when to read the data?
If the answer depends on the particular architecture, then I'm particularly interested in modern Nvidia GPUs that support CUDA.
Edit: From the document Booo linked, here is my understanding so far:
They seem to use the word 'stream' for a quantity of stuff that gets done synchronously (including fine-grained parallelism like SIMD); the problem is then how to synchronize/communicate between multiple streams.
As I surmised, this is much more explicit than it is on CPUs. in particular, they talk about:
Page-locked memory
cudaDeviceSynchronize ()
cudaStreamSynchronize ( streamid )
cudaEventSynchronize ( event )
So streams can communicate by writing data to main memory (or L3 cache?) and there is nothing like the cache coherence there is on CPUs, instead there is locking pages of memory, and/or an explicit synchronization API.

My understanding is that there are several ways to "synchronise" using CUDA:
CUDA Streams (at the function level): cudaDeviceSynchronize() synchronise across the whole device. In addition you can synchronise a particular stream with cudaStreamSynchronize(cudaStream_t stream), or synchronise a event embedded in some streams with cudaEventSynchronize(cudaEvent_t event). Ref 1, Ref 2.
Cooperative Groups (>CUDA 9.0 and >CC 3.0): you can synchronise at the group level, a group can be a set of coalesced threads, a threadblock, or grids spanning multiple devices. This is much more flexible. Define your own group using
(1) auto group = cooperative_groups::coalesced_threads() for current coalesced set of threads, or
(2) auto group = cooperative_groups::this_thread_block() for current threadblock, you can further define fine-grained groups within the block such as auto group_warp = cooperative_groups::tiled_partition<32>(group), or
(3) auto group = cooperative_groups::this_grid() or auto group = cooperative_groups::this_multi_grid() for grid(s) across multiple devices.
Then, you can just call group.sync() for synchronisation. You need to have a device that support cooperativeLaunch or cooperativeMultiDeviceLaunch through. Note with cooperative groups you can already perform the traditional block level sync in shared memory with __syncthreads(). Ref 1, Ref 2.

Related

relationship between computer logical cores and nodejs threadpool

I have read lots of articles on stackoverflow, but failed to find any references about relationship between computer`s logical cores and nodejs threadpool. I believe this is not the duplicated question.
I am using 2017 macbook pro which has 2physical core with 4threads(4 logical cores)
and I believe that nodejs uses threadpool size of 4 (reference with libuv) when doing heavy stuffs such as pbkdf2(function inside of crypto module), i/o operations.
my question is that, what is the relationship between computer's thread size and nodejs's threadpool size?
actually I have never thought about the computer's thread.
it may sound crazy but so far I believed computer only has physical core, and if application such as nodejs instance supports thread pool(in this case thread pool size of 4 by default), then computer can utilize multi-threading.
so what exactly is the relationship between those two?
do I have to change THREADPOOL_SIZE to number of computer's logical core to maximize the performance..?
node.js does not dynamically adjust the threadpool size based on the CPUs or logical CPUs that are present. It has a preset value (4) unless you customize it.
Because the threadpool is often used for blocking operations such as disk I/O, it's not necessarily true that an optimal value for the thread pool size is the number of logical CPUs you have (unlike the typical recommendation for clustering).
Instead, it would likely depend upon which specific types of operations you most want to optimize for and exactly how you're using those operations. For example, it probably doesn't really help you a lot to make more and more parallel disk operations that are all trying to access the same physical disk because there's only one position the read/write head can move at a time so having lots of parallel operations all contending for the same read/write head may not speed things up (could even make things slower).
If you have a specific operation you're trying to optimize for, then your best bet is to create a reproducible benchmark test and then time it with several different sizes for the thread pool.
As was pointed out in a comment, you can study some of the relevant thread pool code here.

Hardware Multithreading and Simultaneous Multithreading(SMT)

I'm reading Multithreading (computer architecture) - Wiki, aka hardware threading, and I'm trying to understand the second paragraph:
(p2): Where multiprocessing systems include multiple complete processing units in one or more cores, multithreading aims to increase utilization of a single core by using thread-level parallelism, as well as instruction-level parallelism.
while the link to thread-level parallelism says:
(Link): Thread-level parallelism (TLP) is the parallelism inherent in an application that runs multiple threads at once. This type of parallelism is found largely in applications written for commercial servers such as ...
which is not so useful... So I read task parallelism above, since I guess TLP is a subtype of it:
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing tasks—concurrently performed by processes or threads—across different processors.
Question: If thread-level parallelism is task parallelism, and task parallelism is for parallelization across multiple processors, how increase utilization of a single core by using thread-level parallelism work?
Guessing: I guess for TLP, it should mean across multiple logical processors, i.e. hardware threads in the perspective of OS, correct?
Another minor issue is that for my first link, Multithreading:
In computer architecture, multithreading is the ability of a central processing unit (CPU) (or a single core in a multi-core processor) to execute multiple processes or threads concurrently, supported by the operating system.
And in (p2) it aim to increase utilization of a single core by using thread-level parallelism? What a contradiction.
I don't think we should base off wiki definitions, the wording there is not accurate enough to merit searching for contradictions.
First, I would describe task parallelism as a form of parallelism inherent to some algorithm or problem, where there could be a functional decomposition into multiple tasks with different nature, that can run concurrently. Alternative forms of parallelism include for example spatial or data decomposition, where the problem can be broken into different parts of the data or the input layout (e.g., array ranges, matrix tiles, image parts...).
Thread-level parallelism is a different taxonomy, it is any form of parallelism that can be extracted for utilization by a multi-threaded system. It requires the decomposition to be coarse grained enough to allow the different threads to run independently (otherwise the synchronization overhead required would make it useless).
The alternative for that is for example ILP (instruction level parallelism) which is when a single thread context can extract parallelism within the code by running over a deep out-of-order machine that can schedule based on readiness. This allows more fine-grained parallelism and less programmer involvement usually, but limits the parallelism to the depth of the OOO window.
On a related topic - be careful not to confuse simultaneous execution and concurrent one.
Thread level parallelism can be used by extracting task-level parallelism or other forms of algorithm decomposition from the code. It can then be run on a system that is single-core (preemptive), or multi-threaded. The latter type can be achieved through multi-core systems, simultaneous multi-threading or both (common processors usually have many cores, and may of them support SMT on top of that).
I think my intuition should be correct, it is either:
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments.
should be updated to:
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple logical processors in parallel computing environments.
to include the possibility of hyper threading.
No update, but task parallelism is focus on cross-processor parallelism while for TLP it should mean
Thread-level parallelism is a form of parallelization of computer code across multiple logical processors in parallel computing environments.
again, to include the possibility of hyper threading.
Useful resource:
https://en.wikipedia.org/wiki/Hyper-threading
https://www.youtube.com/watch?v=wnS50lJicXc
https://en.wikipedia.org/wiki/Simultaneous_multithreading
especially this line:
The name multithreading is ambiguous, because not only can multiple threads be executed simultaneously on one CPU core, but also multiple tasks (with different page tables, different task state segments, different protection rings, different I/O permissions, etc.).
So for the minor issue, see Concurrent computing - #Introduction - p1:
The concept of concurrent computing is frequently confused with the related but distinct concept of parallel computing,[2][3] although both can be described as "multiple processes executing during the same period of time". In parallel computing, execution occurs at the same physical instant: for example, on separate processors of a multi-processor machine, with the goal of speeding up computations—parallel computing is impossible on a (one-core) single processor, as only one computation can occur at any instant (during any single clock cycle).

What are all the different types of parallelism?

I am trying to understand more about parallelism, but I've noticed there are a lot of different terms out there and some seem to mean the same thing while others have a notable difference. So, what are all the different types of parallelism, how do they differ from each other, and do any have specific applications or purposes?
(To keep this more focused, I'm hoping for an answer that provides clarity to all the terminology associated with parallelism, including terms not listed below; technical comparisons between each different type would be nice, but will probably result in this question becoming off-topic - then again, I don't really know, hence the question).
Note:
this is not a question about concurrency and goes beyond the "simple" question: "what is parallelism?", although a clarifying definition might be warranted.
First, I have taken notice of the difference between parallelism and threading, but some of the differences between the following terms are still confusing.
To add clarity to my question here is a list of terms that I have found that are related to parallelism: parallel computing, parallel processing, multithreading, multiprocessing, multicore programming, Hyper-threading (Intel) 2, Simultaneous MultiThreading (SMT) 3, Switch-on-Event MultiThreading 3. (If possible, definitions or references to definitions for each of these terms would also be appreciated).
My very specific question: what is the difference between thread-level parallelism, instruction-level parallelism, and process-level parallelism? (and any other x-level parallelism)?
In a multi-core processor, can parallelism occur within a single core? Is that what Hyper-threading is, and does that require a single core having, for example, two ALU's that can be used in parallel?
Last one: is there a difference between hardware vs software parallelism, aside from the obvious distinction that one happens in hardware while the other in software?
Related resources:
- Process vs Thread,
- Parallelism on a GPU,
- Hyper-threading,
- Concurrency vs Parallelism,
- Hyper-threading and gaming.
Q:What is the difference betweenthread-level parallelism,instruction-level parallelism,and process-level parallelism?
While the subject matter is indeed immensely wide, I would try to have this view, even at a risk of making many opponents present their objections of simplifying the subject matter ( but StackOverflow format does not substitute other sources of complete reference, does it ? ):
A:the main difference is WHAT / WHO / HOW is responsible for keeping things to execute in true-[PARALLEL]
Instruction Level Parallelism - ILP - is the simplest case, the CPU-architecture has designed and "hardwired" this particular form of hardware-based parallelism. Having processors with ILP4 ( 4 instructions executed at once ), or having processors with per-instruction based width of this form of parallel-instruction execution, be it ILP2 for some instructions but ILP1 for some others, again the silicon architecture decides, what can happen indeed in parallel at the instruction level. Some awkward surprises may arise from further details, as memory-controller channels may block ILP-mode in cases, where REG/MEMORY uops will have to wait for a free channel to access the instructed MEMORY.
hardware-threads are the next level of granularity. Given a CPU-core is declared to support two hardware threads, these are the only streams-of-code execution, that may flow in parallel ( if no O/S request comes to instantiate and schedule another thread to get executed, mapped onto one of the available CPU-core hardware-threads ). From the user-perspective, there are O/S tools that permit one to explicitly "nail"-down a process-level-PID / thread-level-PID affinity onto a particular CPU-core(s) and thus limit or even eliminate any "disturbance", so as to move from a "just"-[CONCURRENT] flow of code-execution closer to a true-[PARALLEL]one.
We will knowingly skip all the crowds of threads, that are just a tool for latency-masking ( be it on the SIMT / SMX warp-wide GPU-scheduler, or the more relaxed, MIMT O/S-kernel driven multithreading )
software operated distributed-systems parallelism is the one that ought be mentioned for completeness, but it has the principally highest adverse costs from a need to invent, define, implement and operate the setup / coordination in software ( which all causes overheads to grow remarkably ), in the sense of the re-formulated Amdahl's Law right due to a need to somehow design and keep operational the non-native orchestration of both the distributed process execution and all the dataflow, that it is dependent on.
hardware-based true-[PARALLEL] systems are at the highest level of orchestration, where both the silicon ( like the InMOS' network of meshed Transputers ) and also the programming language ( like the InMOS' occam or occam-pi ) provide the carefully engineered, conceptually crafted true-[PARALLEL] code-execution.
- MIMT: Multiple Instruction Multiple Threads, a non-restricted thread-execution fabric / policy, where any thread may and does issue a different instruction to the processor for execution, as opposed to SIMT
- SIMT: Single Instruction Multiple Threads, typically a GPU Streaming Multiprocessor code-execution architecture- SMX: Streaming Multiprocessor eXecution unit, typically a GPU SIMT building block, onto which the GPU-kernel code-units could be directed ( addressed ) for being TaskQueeue-scheduled and later executed, according to the WARP-wide SIMT-code scheduler coordinated
what is the difference between thread-level parallelism, instruction-level parallelism, and process-level parallelism?
In 1, different CPU cores execute different streams of instructions.
In 2, single CPU core executes different instructions from a single instruction stream in parallel (these instructions are either consecutive instructions in the stream, or otherwise very close to each other).
3 is same as 1, the difference is cosmetic. It’s just the default settings about which memory pages are shared across threads and which aren’t. But these settings are user-adjustable with process creation flags, shared memory sections, dynamic libraries, and other system APIs, that’s why on the lower level, the difference between process and threads is not a big deal.
and any other x-level parallelism
Another important one is SIMD level parallelism. For this one, CPU applies same instruction to multiple operands stored in special wide registers. With SSE we have 128-bit wide registers, and we can e.g. multiply a vector of 4 single-precision floating-point numbers in one register by another 4 values in another register, making 4 products in parallel, with a single mulps instruction. ARM NEON is similar, also 128 bit registers, the instruction to multiply 4 floats by 4 floats is vmul.f32. AVX operates on 256-bit registers so it can multiply 8 floats at once, with a single vmulps instruction.
can parallelism occur within a single core?
Yes.
Is that what Hyper-threading is
Yes, also it’s what instruction-level parallelism is, and SIMD parallelism, too.
does that require a single core having, for example, two ALU's that can be used in parallel?
Modern CPUs have more than two per core but HT was introduced in P4 and it’s not a requirement. The profit from HT is not just loading multiple ALUs, it’s also using the core while a thread is waiting for data to arrive from caches or from system RAM. And also, using the core while it's stalled because of the data dependency between nearby instructions. HT allows a CPU core to compute something else on another hardware thread while it’s waiting, therefore improving ALU utilization. Without HT, the core would likely just sit and wait for hundreds cycles in case of RAM latency, or for dozens cycles in case of data dependency latency.
is there a difference between hardware vs software parallelism
When you have a single hardware thread and multiple OS threads that compute stuff, only 1 thread will be running at any given time. The rest of the threads will be waiting. The OS will periodically (often ~50-100Hz) switch which one’s running, with the goal to give all threads a fair slice of CPU time. You can call that software parallelism if you want, but I wouldn’t call such thing parallel at all.

Flaws in Shared Memory of Massively Multi-Threaded Designs

I am trying to create my first application of multi-threading, one that is scalable to multi-core technology. Its inspiration comes from the concept of a event-driven spiking neural network.
The design is a little like this: The data structure of the algorithm is stored in 1 location in memory, in the form of instances of classes. An example of a task that can be performed on this structure is a neuron spiking: it will modify several values in the neuron and connected neurons, and identify any future tasks that may need to be performed. The tasks to be performed are added a queue. There are several threads whose only function is to pull a task from the queue, perform the task, and lather rinse repeat. Any updates to values can be performed in any order, as long as they are performed. Small but rare errors that result from this parallelism would have a statistically insignificant effect on the performance of the system.
This design does not use any memory other than shared memory (except for possibly a small amount of dedicated memory used for calculations). I've recently watched a few lectures where the speaker implied that the use of shared memory in multi-core and GPU applications was very slow. Even though I have a few ideas as to why that might be the case, I'd like to find out from people who have experience with this sort of thing, and maybe be directed to a useful resource to help me out.
Accessing shared state from multiple threads in multicore system can be slow due to CPU cache coherency protocol. That is every change in the shared state must be reflected in the cache lines of all the cores.
http://msdn.microsoft.com/en-us/magazine/cc163715.aspx#S2 provides good explanation why accessing shared data from multiple threads can be slow and what can be done about it.

Critical sections with multicore processors

With a single-core processor, where all your threads are run from the one single CPU, the idea of implementing a critical section using an atomic test-and-set operation on some mutex (or semaphore or etc) in memory seems straightforward enough; because your processor is executing a test-and-set from one spot in your program, it necessarily can't be doing one from another spot in your program disguised as some other thread.
But what happens when you do actually have more than one physical processor? It seems that simple instruction level atomicity wouldn't be sufficient, b/c with two processors potentially executing their test-and-set operations at the same time, what you really need to maintain atomicity on is access to the shared memory location of the mutex. (And if the shared memory location is loaded into cache, there's the whole cache consistency thing to deal with, too..)
This seems like it would incur far more overhead than the single core case, so here's the meat of the question: How much worse is it? Is it worse? Do we just live with it? Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?
Multi-core/SMP systems are not just several CPUs glued together. There's explicit support for doing things in parallel. All the synchronization primitives are implemented with the help of hardware along the lines of atomic CAS. The instruction either locks the bus shared by CPUs and memory controller (and devices that do DMA) and updates the memory, or just updates the memory relying on cache snooping. This in turn causes cache coherency algorithm to kick in forcing all involved parties to flush their caches.Disclaimer - this is very basic description, there are more interesting things here like virtual vs. physical caches, cache write-back policies, memory models, fences, etc. etc.
If you want to know more about how OS might use these hardware facilities - here's an excellent book on the subject.
The vendor of multi-core cpus has to take care that the different cores coordinate themselves when executing instructions which guarantee atomic memory access.
On intel chips for instance you have the 'cmpxchg' instruction. It compares the value stored at a memory location to an expected value and exchanges it for the new value if the two match. If you precede it with the 'lock' instruction, it is guaranteed to be atomic with respect to all cores.
You would need a test-and-set that forces the processor to notify all the other cores of the operation so that they are aware. Yes, that introduces an overhead and you have to live with it. It's a reason to design multithreaded applications in such a way that they don't wait for synchronization primitives too often.
Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?
That would cancel the whole point of multithreading. When you are using a lock, semaphore, or other syncronization techniques, you are relying on OS to make sure that these operations are interlocked, no matter how many cores you are using.
The time to switch to a different thread after a lock has been released is mostly determined by the cost of a context switch. This SO thread deals with the context switching overhead, so you might want to check that.
There are some other interesting threads also:
What are the differences between various threading synchronization options in C#?
Threading best practices
You should read this MSDN article also: Understanding the Impact of Low-Lock Techniques in Multithreaded Apps.
Memory accesses are handled by the memory controller which should take care of multi-core issues, i.e. it shouldn't allow simultaneous access to same addresses (probably handled either by memory page or memory line basis). So you can use a flag to indicate whether another processor is updating the memory contents of some block (this to avoid a type of dirty read where part of the record is updated, but not all).
A more elegant solution is to use a HW semaphore block if the processor has such a feature. A HW semaphore is a simple queue which could be of size no_of_cores -1. This is how it is in TI's 6487/8 processor. You can either query the semaphore directly (and loop until it is released) or do an indirect query which will result in an interrupt once your core gets the resource. The requests are queued and served in the order they were made. A semaphore query is an atomic operation.
Cache consistency is another issue and you might need to do cache writebacks and refreshes in some cases. But this is a very cache implementation specific thing. With 6487/8 we needed to do that on a few operations.
Well, depending on what type of computers you have laying around the house, do the following: Write a simple multithreaded application. Run this application on a single core (Pentium 4 or Core Solo) and then run it on a multicore processor (Core 2 Duo or similar) and see how big the speed up is.
Granted these are unfair comparisons since Pentium 4 and Core Solo are much slower regardless of cores than a Core 2 Duo. Maybe compare between a Core 2 Duo and a Core 2 Quad with an application that can use 4 or more threads.
You raise a number of valid points. Muliple processors introduce a lot of headache and overhead. However, we just have to live with them, because the speed boost of parallelism can far outweigh them, if the critical sections are made long enough.
As for your final suggestion about having all threads on the same physical core, that completely defeats the point of a multi-core computer!

Resources