POSIX Threads on a Multiprocessor System

POSIX Threads on a Multiprocessor System - multithreading

I have written software which takes advantage of POSIX threads so that I can utilize shared memory within the process. My question is if I have a machine running Ubuntu with 4 processors and each processor has 16 cores. Is it more efficient to run 4 processes each with 16 threads or 1 process with 64 threads? Each processor has a dedicated 32gb of ram.
My main worry is that there will be a lot of memcopy happening behind the seen with 1 process.
In summary:
On a 4(16core) Proc Machine
1 process 64 threads? 4 Processes 16 Threads each?
If the process requires more than 32 gb of RAM(The amount dedicated to one Proc) does the answer differ?
Thanks for your help

Depends on what your application does.
A thread in a single-threaded process runs faster then a thread in a multi-threaded process since the latter requires synchronization between threads in library functions like malloc(), fprintf(), etc.. Also, more threads in a multi-threaded process are likely to cause more lock contention slowing down each other. If threads don't need to communicate and don't share data they don't need to be in the same process.
In your case, you may get better parallelism with 4 processes with 16 threads rather then 1 process with 64 threads.

Related

Scheduling and Synchronization in Multicore CPU and in Single core CPU

From what I am understanding from the top answers of this post (
https://stackoverflow.com/questions/16116952/can-multithreading-be-implemented-on-a-single-processor-system#:~:text=Yes%2C%20you%20can%20have%20multiple,one%20thing%20at%20a%20time.),
If I am only running one multithreaded program that creates 4 threads on a multicore CPU system with 4 cores, there is no need for scheduling as all 4 threads of my program will be running in individual cores (or microprocessors). But there maybe a need for synchronization since all 4 threads access the memory of the program (or a process) that is stored in the same address space in the main memory.
On the other hand,
on a single core CPU computer. If I run the same program that creates 4 threads, I will need both synchronization and scheduling since all threads must utilize the same core (or a microprocessor).
Please correct my understanding if it is wrong.

there is no need for scheduling as all 4 threads of my program will be running in individual cores
This is not true in practice. The OS scheduler operate in both cases. Unless you pin threads to core, threads can migrate from one core to another. In fact, even if you pin them, there are generally few other threads that can be ready on the machine (eg. ssh daemon, tty session, graphical programs, kernel threads, etc.) so the OS has to schedule them. There will be context-switches though the number will be much lower than with a single processor.
there maybe a need for synchronization since all 4 threads access the memory of the program( or a process) that is stored in the same space in the main memory.
This is true. Note that threads can also work on different memory area (so that there is no need for synchronization except when they are joined). Note also that "main memory" includes CPU caches here.
In a single core CPU computer. If I run the same program that creates 4 threads, I will need both synchronization and Scheduling since all threads must utilize the same Core ( or a microprocessor).
Overall, yes. That being said, the term "scheduling" is unclear. THere are multiple kind of scheduling: preemptive VS cooperative scheduling. Here, as a programmer, you do not need to do anything special since the scheduling is done by the OS. Thus, it is a bit unexpected to say that you "need" scheduling. The OS will schedule the threads on the same core using preemption (by allocating different time-slices for each threads on the same core).

What is the difference between the thread in CPU (hardware) and those in thread pool

People always call CPU has 4 cores & 8 threads or 2 cores & 2 threads, etc.
But in the thread pool, there are quite a lot of tiny threads generated, are these related to the hardware threads?
I am thinking if the CPU threads are actually processes.
Also, I think the actual thread is just a block of code that runs while loop and execute available task otherwise sleep, is this statement correct?

On the hardware side the CPU has cores, cores have 1-8 schedulable threads, new Power-CPUs have up to 8 threads, Knightbridge(?) has 4, most other desktop CPU's have 2, older and/or smaller CPU's got 1.
On the software side a program can have multiple processes (different virtual memory mapping), a process can multiple software threads (sharing the processes memory map), a software thread is the scheduling partner for the hardware thread.
Then you again can have a logical thread in software, often called a fiber, which is a user software scheduled mini thread run by a software thread.

Multi-thread program(process) on multicore-core processor(s) with hyperthreading

For multicore computing, one thing confusing me from the beginning is the model of multicore hardware is too abstracted from the real machine.
I worked on a laptop with a single intel processor, containing 4 cores and support hyperthreading which makes the num of logical cores 8.
Suppose I have a Java program implemented a concurrent algorithm (it is said Java would use the OS rule of thread scheduling, so JVM won't affect the scheduling), and the program is purely CPU-bound.
My observations:
if the program(process) running less than 8 threads, parallelism of work increases as the num of threads increasing
when the total threads num is larger than 8, the performance become complicated, but usually no more improvement than running 8 threads; and for some certain algorithm, it is much worse, i.e.time-consumption increased hugely than running 8 threads.
My knowledge of this:
As far as I know, the program I run is treated as a user process by the OS, and if the program created threads to try to gain parallelism, the OS will try to schedule these threads among the cores available.
Any threads of the process on a same core may share the total execution time of the process on that core.
My questions:
Suppose the CPU is only running my program. i.e. no other user processes.
if there is only 1 core of the CPU, the process will get no benefit of parallelism by multi-threading, since the total execution time of the process will not change. Is it true?
if there are more than one cores available, the OS will try to schedule the threads of the process evenly and fairly on different cores, and the process's threads on different cores get their own (extra)
execution time, therefore speeding up. Is it true?
if there are n threads and m cores, where n>m, than some core may run more than 1 thread of the process, which may even harm the speed-up of parallelism because of "context switch" among threads on
the same core and potentially side-effect of threads of the process running at different speed. Is this true?
Thanks very much!

if there is only 1 core of the CPU, the process will get no benefit of parallelism by multi-threading, since the total execution time of the process will not change. Is it true?
Only if you are 100% CPU-bound. If you have I/O waits, multiple threads can help a lot even on a single core.
if there are more than one cores available, the OS will schedule the threads of the process evenly and fairly on different cores
That seems to be up to the discretion of the OS. There could be all kinds of quota and priorities involved.
may even harm the speed-up of parallelism because of "context switch" among threads on the same core
It is true that there is overhead in managing extra threads (not just at the scheduling level, but also in synchronizing and communicating within your application), and if these threads cannot make productive use of otherwise idle CPU cores, then having less threads can actually improve performance.

MultiCore CPUs, Multithreading and context switching?

Let's say we have a CPU with 20 cores and a process with 20 CPU-intensive independent of each other threads: One thread per CPU core.
I'm trying to figure out whether context switching happens in this case. I believe it happens because there are system processes in the operating system that need CPU-time too.
I understand that there are different CPU architectures and some answers may vary but can you please explain:
How context switching happens e.g. on Linux or Windows and some known CPU architectures? And what happens under the hood on modern hardware?
What if we have 10 cores and 20 threads or the other way around?
How to calculate how many threads we need if we have n CPUs?
Does CPU cache(L1/L2) gets empty after context switching?
Thank you

How context switching happens e.g. on Linux or Windows and some known
CPU architectures? And what happens under the hood on modern hardware?
A context-switch happens when an interrupt occurs and that interrupt, together with the kernel thread and process state data, specify a set of running threads that is different than the set running before the interrupt. Note that, in OS terms, an interrupt may be either a 'real' hardware interrupt that causes a driver to run and that driver requests a scheduling run, or a syscall from a thread that is already running. In either case, the OS scheduling state-machine decides whether to change the set of threads running on the available cores.
The kernel can change the set of running threads by stopping thread/s and running others. It can stop any thread running on any core by queueing up a premption request and generating a hardware interrupt of that core to force the core to run its interprocessor driver to handle the request.
What if we have 10 cores and 20 threads?
Depends on what the threads are doing. If they are in any other state than ready/running, (eg blocked on I/O or inter-thread comms), there will be no context-switching between them because nothing is running. If they are all ready/running, 10 of them will run forever on the 10 cores until there is an interrupt. Most systems have a periodic timer interrupt that can have the effect of sharing the available cores around the threads.
or the other way around
10 threads run on 10 cores. The other 10 cores are halted. The OS may move the threads around the cores, eg. to prevent uneven heat dissipation across the die.
How to calculate how many threads we need if we have n CPUs?
App-dependent. It would be nice if all cores were always used up 100% on exactly as many ready threads as cores but, since most threads are blocked for much more time than they are running, it's difficult, except in some end-cases, (eg - your '20 CPU-intensive threads on 20 cores'), to come up with any optimal number.
Does CPU cache(L1/L2) gets empty after context switching?
Maybe - it depends entirely on the data usage of the threads. The caches will get reloaded on-demand, as usual. There is no 'context-switch total cache reload' but, if the threads access different, large arrays of data while running, then the (L1 at least), cache will indeed get fully reloaded during the thread run.

Threads vs Cores

Say if I have a processor like this which says # cores = 4, # threads = 4 and without Hyper-threading support.
Does that mean I can run 4 simultaneous program/process (since a core is capable of running only one thread)?
Or does that mean I can run 4 x 4 = 16 program/process simultaneously?
From my digging, if no Hyper-threading, there will be only 1 thread (process) per core. Correct me if I am wrong.

A thread differs from a process. A process can have many threads. A thread is a sequence of commands that have a certain order. A logical core can execute on sequence of commands. The operating system distributes all the threads to all the logical cores available, and if there are more threads than cores, threads are processed in a fast cue, and the core switches from one to another very fast.
It will look like all the threads run simultaneously, when actually the OS distributes CPU time among them.
Having multiple cores gives the advantage that less concurrent threads will be placed on one single core, less switching between threads = greater speed.
Hyper-threading creates 2 logical cores on 1 physical core, and makes switching between threads much faster.

That's basically correct, with the obvious qualifier that most operating systems let you execute far more tasks simultaneously than there are cores or threads, which they accomplish by interleaving the executing of instructions.
A system with hyperthreading generally has twice as many hardware threads as physical cores.

The term thread is generally used as a description of an operating system concept that has the potential to execute independently of other threads. Whether it does so depends on whether it is stuck waiting for some event (disk or screen I/O, message queue), or if there are enough physical CPUs (hyperthreaded or not) to allow it run in the face of other non-waiting threads.
Hyperthreading is a CPU vendor term that means a single core, that can multiplex its attention between two computations. The easy way to think about a hyperthreaded core is as if you had two real CPUs, both slightly slower than what the manufacture says the core can actually do.

Basically this is up to the OS. A thread is a high-level construct holding a instruction pointer, and where the OS places a threads execution on a suitable logical processor. So with 4 cores you can basically execute 4 instructions in parallell. Where as a thread simply contains information about what instructions to execute and the instructions placement in memory.
An application normally uses a single process during execution and the OS switches between processes to give all processes "equal" process time. When an application deploys multiple threads the processes allocates more than one slot for execution but shares memory between threads.
Normally you make a difference between concurrent and parallell execution. Where parallell execution is when you actually physically execute instructions of more than one logical processor and concurrent execution is the the frequent switching of a single logical processor giving the apperence of parallell execution.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string