This question already has answers here:
MPI vs openMP for a shared memory
(4 answers)
Closed 5 years ago.
Assuming that we have one node with 12 cores. What are the differences between:
Run one MPI process to manage 12 threads for each core.
Just run 12 MPI processes for each core.
The former communicate via shared memory, and the latter communicate via IPC. So, which one is faster? are the differences negligible or significant?
Well, it depends ...
Most MPI implementations use shared memory (instead of the interconnect or even IPC) for intra-node communications.
Generally speaking, MPI+X is used for hybrid programming :
MPI for inter node communications
X within the same node
OpenMP is commonly used as X. MPI RMA (e.g one sided communications) can also be used, and even more options are available.
From a performance point of view, then once again, it depends.
some applications run faster in flat MPI (e.g. one MPI process per core),
whereas some other applications run faster in hybrid MPI+OpenMP. (keep in mind OpenMP was designed for shared memory system with a flat access to memory, so there is generally one MPI task per NUMA domain (e.g. socket most of the time) vs one MPI task per node.
Last but not least, MPI+OpenMP memory overhead and wire-up time is generally lower than flat MPI, and this can be an important factor.
Related
What is the difference between software threads, hardware threads and java threads?
Are software threads, java threads and hardware threads independent or interdependent?
I am asking this because, I know Java threads are created inside a process with in jvm (java.exe).
Also is it true that these different process are executed on different hardware threads?
A "hardware thread" is a physical CPU or core. So, a 4 core CPU can genuinely support 4 hardware threads at once - the CPU really is doing 4 things at the same time.
One hardware thread can run many software threads. In modern operating systems, this is often done by time-slicing - each thread gets a few milliseconds to execute before the OS schedules another thread to run on that CPU. Since the OS switches back and forth between the threads quickly, it appears as if one CPU is doing more than one thing at once, but in reality, a core is still running only one hardware thread, which switches between many software threads.
Modern JVMs map java threads directly to the native threads provided by the OS, so there is no inherent overhead introduced by java threads vs native threads. As to hardware threads, the OS tries to map threads to cores, if there are sufficient cores. So, if you have a java program that starts 4 threads, and have 4 or more cores, there's a good chance your 4 threads will run truly in parallel on 4 separate cores, if the cores are idle.
Software threads are threads of execution managed by the operating system.
Hardware threads are a feature of some processors that allow better utilisation of the processor under some circumstances. They may be exposed to/by the operating system as appearing to be additional cores ("hyperthreading").
In Java, the threads you create maintain the software thread abstraction, where the JVM is the "operating system". Whether the JVM then maps Java threads to OS threads is the JVM's business (but it almost certainly does). And then the OS will be using hardware threads if they are available.
Hardware threads (e.g. Intel Hyperthreading) are a cheaper and slower alternative to having multiple-cores
Software threads are a software abstraction implemented by the (Linux) kernel:
either the kernel runs one software thread per CPU (or hyperthread)
or it fakes it with the scheduler by running a process for a bit, then a timer interrupt comes, then it switches to another process, and so on
Key to their implementation is the hardware provided and kernel configured separation between userland and kerneland: What are Ring 0 and Ring 3 in the context of operating systems?
I will now focus on hardware threads, which is the more obscure hardware question, with a focus on Intel's implementation which it calls Hyperthreading.
The Intel Manual Volume 3 System Programming Guide - 325384-056US September 2015 8.7 "INTEL HYPER-THREADING TECHNOLOGY ARCHITECTURE" describes HT briefly. It contains the following diagram:
TODO it is slower by how much percent in average in real applications?
Hyperthreading is possible because modern single CPUs cores already execute multiple instructions at once with the instruction pipeline https://en.wikipedia.org/wiki/Instruction_pipelining
The instruction pipeline is a separation of functions inside of a single core to ensure that each part of the circuit is used at any given time: reading memory, decoding instructions, executing instructions, etc.
Hyperthreading separates functions further by using:
a single backend, which actually runs the instructions with its pipeline.
Dual core has two backends, which explains the greater cost and performance.
two front-ends, which take two streams of instructions and order them in a way to maximize pipelining usage of the single backend by avoiding hazards.
Dual core would also have 2 front-ends, one for each backend.
There are edge cases where instruction reordering produces no benefit, making hyperthreading useless. But it produces a significant improvement in average.
Two hyperthreads in a single core share further cache levels (TODO how many? L1?) than two different cores, which share only L3, see:
Multiple threads and CPU cache
How are cache memories shared in multicore Intel CPUs?
The interface that each hyperthread exposes to the operating system is similar to that of an actual core, and both can be controlled separately. Thus cat /proc/cpuinfo shows me 4 processors, even though I only have 2 cores with 2 hyperthreads each.
Operating systems can however take advantage of knowing which hyperthreads are on the same core to run multiple threads of a given program on a single core, which might improve cache usage.
This LinusTechTips video contains a light-hearted non-technical explanation: https://www.youtube.com/watch?v=wnS50lJicXc
Hardware threads can be thought of as the CPU cores, although each core can run multiple threads. Most of the CPUs mention how many threads can be run on each core (on linux, lscpu command gives this detail). These are the number of cores that can be used in parallel.
Software threads are abstraction to the hardware to make multi-processing possible. If you have multiple software threads but there are not multiple resources then these software threads are a way to run all tasks in parallel by allocating resources for limited time(or using some other strategy) so that it appears that all threads are running in parallel. These are managed by the operating system. Java thread is an abstraction at the JVM level.
I think you are mistaken. I never heard about hardware threads (unless you mean hyper threading on certain intel machines). Every process is a running representation of a program. Threads are simultaneous execution flows with in a process. Java thread definitions are mapped to system threads by JVM. Java used to have a concept of GreenThreads, which is no longer the case.
Can we take full benefits of Multi core architecture without Multi threading.?
Can we take full benefits of Multi core architecture without Multi threading.?
For conventional environments; you can take some of the benefits of multi-CPU without multi-threading (e.g. if you've got 8 CPUs and you're running 8 separate single-threaded processes, then...).
For non-conventional environments, who knows? For an example, maybe the entire system uses the actor model (software divided into separate/independent objects where each object is an event handler), where the OS has a queue of pending events, and each CPU does "get event from queue, execute the corresponding object's event handler for that event" in a loop. In this case you can say that there's no threads at all (just CPUs and events) and therefore there is no multi-threading.
Can we take full benefit of multicores without multithreading ? Definitely no. But we can still have some parallelism.
As already answered, we can have several independent processes running on different processors to improve global computer performances.
And it is still possible to do parallel processing by means of interprocess communication (IPC) as pipes or shared memory. For instance, if doing
taskset 0x01 sort | taskset 0x02 uniq
you will run two processes, sort on core 0 and uniq on core 1, and these process will communicate by a pipe (implemented in the shared memory). Note that this just an example and that OSes do run new processes on different cores without the taskset directive.
With posix shared memory IPC, you can do parallel processes running on different cores and exchanging data in a dedicated memory zone.
And you can use openMPI to run multiprocess parallel programs on a multicore. The shared memory will be used to implement MPI message passing.
But in either case, compared to multithreading, the programming burden will be higher and performances much lower.
Given a cluster of several nodes, each of which hosts multiple-core processor, is there any advantage of using MPI between nodes and OpenMP/pthreads within nodes over using pure all-MPI? If I understand correctly, if I run an MPI-program on a single node and indicate the number of processes equal to the number of cores, then I will have an honest parallel MPI-job of several processes running on separate cores. So why bother about hybrid parallelization using threads within nodes and MPI only between nodes? I have no question in case of MPI+CUDA hybrid, as MPI cannot employ GPUs, but it can employ CPU cores, so why use threads?
Using a combination of OpenMP/pthread threads and MPI processes is known as Hybrid Programming. It is tougher to program than pure MPI but with the recent reduction in latencies with OpenMP, it makes a lot of sense to use Hybrid MPI. Some advantages are:
Avoiding data replication: Since threads can share data within a node, if any data needs to be replicated between processes, we can avoid this.
Light-weight : Threads are lightweight and thus you reduce the meta-data associated with processes.
Reduction in number of messages : A single process within a node can communicate with other processes, reducing number of messages between nodes (and thus reducing pressure on the Network Interface Card). The number of messages involved in collective communication is notable.
Faster communication : As pointed out by #user3528438 above, since threads communicate using shared memory, you can avoid using point-to-point MPI communication within a node. A recent approach (2012) recommends using RMA shared memory instead of threads within a node - this model is called MPI+MPI (search google scholar using MPI plus MPI).
But Hybrid MPI has its disadvantages as well but you asked only about the advantages.
This is in fact a much more complex question that it looks like.
It depends of lot of factor. By experience I would say: You are always happy to avoid hibrid openMP-MPI. Which is a mess to optimise. But there is some momement you cannot avoid it, mainly dependent on the problem you are solving and the cluster you have access to.
Let say you are solving a problem highly parallelizable and you have a small cluster then Hibrid will be probably useless.
But if you have a problem which lets says scale well up to N processes but start to have a very bad efficiency at 4N. And you have access to a cluster with 10N cores... Then hybridization will be a solution. You will use a little amount of thread per MPI processes something like 4 (It is known that >8 is not efficient).
(its fun to think that on KNL most people I know use 4 to 8 Thread per MPI process even if one chip got 68 cores)
Then what about hybrid accelerator/openMP/MPI.
You are wrong with accelerator + MPI. As soon as you start to used a cluster which has accelerators you will need to use someting like openMP/MPI or CUDA/MPI or openACC/MPI as you will need to communicate between devices. Nowadays you can bypass the CPU using Direct GPU (at least for Nvidia, not clue for other builder but I expect that it would be the case). Then usually you will use 1 MPI process per GPU. Most cluster with GPU will have 1 socket and N accelerators (N
I have been reading lately about system architecture and the topic of multi-threading has not been covered in detail with latest improvements in technology. I did my part of search, but could not find answers for the following:
The questions have are
1) Is multi-threading dependent on the system architecuture (CPU). do all CPU (single core) support multi-threading? If it does not, what happens to multi-threaded applications when run on those machines
It is cited here that
Intel CPUs support multithreading, but only two threads per CPU.
AMD CPUs do not support multithreading and AMD often sites Microsoft's
recommendations to turn off Hyperthreading on Intel CPUs when running applications
like peoplesoft and Exchange.
2) so what does it mean it say only two threads per CPU here. At any given time, CPU (single core) can process only thread. and the other thread is waiting to be processed correct?
3) how is it different from an application that spawns, say, 10 threads and waiting for them to be executed. If the CPU at the most can tackle only two threads, shouldn't programmer keep that fact in consideration when writing multi-threaded applications.
Even with multi-core processors (say quad-core) at the most 8 threads can be queued, but only 4 threads can be processed at the same time.
P.S: I have a read a little about hyper-threading but I am not sure if that is relevant here and if
all processors support hyper-threading
1) It depends on the operating system more than anything. Even for single core architectures, multi-threading can be supported, but the threads are not executing in parallel - The OS will context-switch between them.
2) Intel usually supports two-way hardware threading ( also called simultaneous multi-threading), where each thread is allocated a pipeline. So if you have a process with two threads they can both execute on the same core simultaneously.
3) See 1. Basically the operating system is going to allocate as many threads as it can to hardware before it plans to context-switch between the threads it couldn't allocate. This process is dependent on the OS's scheduler, and you can read about the Linux one to get a good idea of what's going on.
Edit: Hypethreading is basically the hardware threading feature I mentioned.
In your question CPU means core.
1) It does. I believe memory access on ARMs is in words, so write to char is not atomic
Also memory ordering differs Modern OSes (anything but DOS) support context switching: while one thread executes, others wait. Total number of threads in all Windows processes is about 1000. Common time quant (time to load CPU) is 1-10 ms. One core multithreading don't improve computational power but allows asynchronous tasks. For example GUI doesn't freeze during network activity. One threads waits net, another one responds to user activity.
2) Yes
3) It is common practice to spawn number of threads equal to number of (virtual) cores, ie number of cores in system for AMD and twice for Intel. It is true only for computational threads. Web server threads usually wait net and don't load CPU a lot, so it is better to spawn thousands of threads.
Hyperthreading is cool for tasks that wait RAM. While one thread waits data another one executes. For math it usually not increase performance. It is good for work with data that is not cache-friendly: lists, trees, hash tables that don't fit into cache.
Can anyone help me out I am working on a presentation and would like to include a bit about - 'The difference between multicore and concurrent programming', I have googled a bit but not turning up many good descriptions, any help appreciated! :)
Thanks,
Eamonn
Concurrent (occurring or existing simultaneously) implies that different code MAY execute at the exact same cycle. It means that things can possibly happen in parallel if multiple processors or a processor with multiple cores is available and the program is crafted correctly. Just adding threads does not imply concurrent execution.
The reason I say MAY and possibly is that anytime the programs separate threads need to share volatile/mutable state, other threads that need access to that state can not continue executing and will have to wait their turn to access that state, and things start happening serially again.
Typically this is implemented in a single program as more than one thread executing code concurrently at the same exact cycle as another thread, given that there is no resource contentions as listed above. This requires multiple physical processors or cores. Other models run multiple heavyweight OS processes that can execute concurrently.
Concurrent programming is very hard to do correctly with mutable shared state.
You can write a concurrent program
that runs serially on a single single
core processor, but scales up to
execute more things at the same time
when more processors or cores, or even
multiple processors with multiple
cores is present.
You can also cause single threaded programs to appear concurrent on a multi-core / multi-processor system if they can operate on independent ranges of input data at the same time. Example: a single threaded 3D rendering program can on a dual core machine can run 2 separate instances the first rendering all the odd frames and the second rendering all the even frames. As long as they don't try to share any mutable resources.
Multi-core means that a single CPU has multiple Processor cores that can execute threads or processes concurrently and typically appears as multiple processors to mainstream operating systems.
It does NOT imply that programs that are single threaded gain any concurrency behaviors or benefits from the additional processor cores available.
Concurrent Programming is more broad - it just refers to writing software that will run "concurrently" - ie: more than one thing will happen at a time.
"Multi-core" programming is really referring to a specific subset of concurrent programming, in which you are targetting multiple available CPU cores on a specific machine. This is the most common form of concurrent programming (typically single process running on a single computer), but still only one form of concurrent programming.
You can do concurrent programming on a machine that has only a single CPU core. The operating system provides the illusion that more than one thread is running at the same time, it rapidly switches back-and-forth between them.
A machine with multiple cores simply needs to this context switching less often since two threads can run at the same time on two cores. It is only a bit special because threading bugs can make your life difficult much quicker. The odds that two threads try to access a shared memory location at the same time is much higher.
At a high level, multi-core is an attribute of the processor chip in your computer. Multi core means it has got multiple processing cores. There are several types of multi-processor computers: the old style super computers with thousands of computers connected via ethernet, systems with more than processors (like 2 Pentium 4s), and contemporary multi-core systems where every processor package has multiple processing cores 9like Intel i7). The third type is often called multi-core of Chip Multiprocessor (CMP).
Concurrent programming is an attribute of software. Concurrent programming is about writing code which has is split into multiple tasks that can execute concurrently if processors are available. While concurrent programs do leverage multi-core, concurrent programming is broader in two dimensions:
Concurrent programs can run on a single core or multiple cores.
Concurrent programs can be used on any type of multi-processors I mentioned above.
Thus, to summarize:
Concurrent programming is about software that can use multiple processors if available. those processors can be on the same chip (multi-core or Chip Multiprocessor) or on different chips (often known as SMP). You can have systems where you can put two multi-core chips in the same system making it a CMP and an SMP at the same time. Concurrent programming will work for that as well.
Concurrent programming regards operations that appear to overlap and is primarily concerned with the complexity that arises due to non-deterministic control flow. The quantitative costs associated with concurrent programs are typically both throughput and latency. Concurrent programs are often IO bound but not always, e.g. concurrent garbage collectors are entirely on-CPU. The pedagogical example of a concurrent program is a web crawler. This program initiates requests for web pages and accepts the responses concurrently as the results of the downloads become available, accumulating a set of pages that have already been visited. Control flow is non-deterministic because the responses are not necessarily received in the same order each time the program is run. This characteristic can make it very hard to debug concurrent programs. Some applications are fundamentally concurrent, e.g. web servers must handle client connections concurrently. Erlang, F# asynchronous workflows and Scala's Akka library are perhaps the most promising approaches to highly concurrent programming.
Multicore programming is a special case of parallel programming. Parallel programming concerns operations that are overlapped for the specific goal of improving throughput. The difficulties of concurrent programming are evaded by making control flow deterministic. Typically, programs spawn sets of child tasks that run in parallel and the parent task only continues once every subtask has finished. This makes parallel programs much easier to debug than concurrent programs. The hard part of parallel programming is performance optimization with respect to issues such as granularity and communication. The latter is still an issue in the context of multicores because there is a considerable cost associated with transferring data from one cache to another. Dense matrix-matrix multiply is a pedagogical example of parallel programming and it can be solved efficiently by using Straasen's divide-and-conquer algorithm and attacking the sub-problems in parallel. Cilk is perhaps the most promising approach for high-performance parallel programming on multicores and it has been adopted in both Intel's Threaded Building Blocks and Microsoft's Task Parallel Library (in .NET 4).