how are two processes allocated to the same logical CPU executed? - multithreading

I'm on the 12th gen Intel CPUs with hyperthreading enabled. From my lscpu output, I can see that logical cores 0 and 1 both map to physical core 0. I understand that if I allocate processes P0 and P1 to 0 and 1, respectively, they'll essentially multiplex physical core 0's resources via SMT. If I allocate both P0 and P1 to logical core 0, however, how is this execution handled? Does the OS just continually context switch them?
I'm just seeing some interesting behavior where P0 and P1 run faster when they're allocated on the same logical core (note that P0 and P1 communicate), faster than if they're multithreaded on the same physical core, or allocated to two different physical cores altogether. I can't explain this.

If I allocate both P0 and P1 to logical core 0, however, how is this execution handled?
It depends on the scheduler and what the scheduler is told about P1 and P2. If one process has a higher priority or an earlier deadline, then it may be given 100% of CPU time (until it terminates or unless it blocks temporarily). Regularly switching between the processes is a generic option that's often used when the scheduler can't distinguish between the processes (e.g. they're equal priority).
Note that this regular switching is mostly bad - e.g. if both processes would take 1 hour of CPU time to complete, then after 1 hour you end up no completed processes (2 "not quite half-completed due to extra overhead of context switches" processes) instead of 1 completed process (and 1 process that's made no progress); and after 2 hours you end up with no completed processes (2 "almost completed due to extra overhead of context switches" processes) instead of 2 completed process.
I'm just seeing some interesting behavior where P0 and P1 run faster when they're allocated on the same logical core (note that P0 and P1 communicate), faster than if they're multithreaded on the same physical core, or allocated to two different physical cores altogether. I can't explain this.
I can't explain it either - it depends on how the processes communicate (in addition to depending on the scheduler and what the scheduler is told).

Related

Purpose of multiprocessors and multi-core processor

I do want to clarify things in my head and model concrete knowledge. dual-core with one processor system, only two threads within the one process can be executed concurrently by each core. Uni-core with two processor system, two different process can be executed by each CPU.
So can we say, each processor can execute processes concurrently. While multi-core processor execute threads within the process concurrently?
I think you have a fundamental misunderstanding of what a process and thread are and how they relate to the hardware itself.
A CPU core can only execute 1 machine level instruction per clock cycle (so essentially, just 1 assembly instruction). CPU's are typically measured by the number of clock cycles they go through in a second. So a 2.5 GHz core can execute 2.5 billion instructions per second.
The OS (the operating system, like Windows, Linux, macOS, Android, iOS, etc.) is responsible for launching programs and giving them access to the hardware resources. Each program can be considered a "process".
Each process can launch multiple threads.
To ensure that multiple processes can share the same hardware resources, the idea of pre-emptive computing came about over 40 years ago.
In a nut-shell, pre-emptive computing, or time-slicing, is a function of the OS. It basically gives a few milliseconds to each thread that is running, regardless of which process that thread is a part of, and keeps the "context" of each thread so that the state of each thread can be handled appropriately when it's time for that thread to run; that's also known as a context switch.
A dual, quad, or even 128 core CPU does not change that, nor will the amount of CPU's in the system (e.g. 4 CPU's each with 128 cores). Each core can only execute 1 instruction per clock cycle.
What changes is how many instructions can be run in true parallel. If my CPU has 16 cores, then that means it can execute 16 instructions per clock cycle, and thus run 16 separate threads of execution without any context switching being necessary (though it does still happen, but that's a different issue).
This doesn't cover hyper-threading, in which 1 core can execute 2 instructions per cycle, essentially doubling your CPU count, and doesn't cover the idea of cache-misses or other low-level ideas in which extra cycles could be spent on a thread, but it covers the general idea of CPU scheduling.

Do different threads in a process run on different physical cores of a multi-core processor need to assign contexts?

A process is the smallest unit for allocating resources. The thread is the smallest scheduling unit.
Does this mean that a process contains at least one thread? Is the
thread equal to the process when there is only one thread in the
process?
Many processors today are multi-core. Join I have a process P. There are two threads in this process P, A and B. I want A and B to run on core 0 and core 1 of a CPU respectively.
But we know that the process needs to be allocated resources. The process has the context. Is the context generally stored in a register? If so, different physical cores use different physical registers. Then when thread A and thread B run on Core 0 and Core 1, respectively.
So do these two cores need to allocate resources? In this case, how do these two threads maintain consistency? Each thread has its own resources, so hasn't this become two processes? Does this mean that different threads in a process running on different cores are the same as different processes running on different cores?
The vast majority of resources in an SMP system with the exception of registers and processing capacity are shared across all cores. This includes memory. So the operating system can schedule multiple threads on different cores, but all pointing to a shared set of process resources in memory.
CPU Caches are handled by the cores, using a cache coherency protocol. So long as the thread follows the memory model with correct use of memory barriers/atomic instructions, the memory visible through the cache should appear to be the same to all cores.

What is the difference in scheduling threads?

I am currently learning about simultaneous multi-threading, multi-core and multi-processor scheduling of threads. I checked some information, my understanding is:
If there is a processor that supports simultaneous multi-threading, it turns one physical core into two logical cores. There are two processes, P1 and P2.
My understanding: In Linux, each process is composed of at least one thread? So scheduling is based on thread scheduling?
P1 and P2 are respectively scheduled to two logical cores. They operate independently. This is the first situation. If there is a process P3, it consists of two threads t1 and t2. Schedule t1 and t2 to different logical cores respectively. So what is the difference between scheduling two different processes to separate logical cores and scheduling different threads in the same process to logical cores?
My understanding: The process is the smallest unit of the system to
allocate resources, and threads share the resources of the process.
Threads in a process share virtual memory, PCB, and can access the
same data. Therefore, when scheduling different threads in a process
and scheduling threads in different processes, there is no difference
for the processor. The difference lies in the address translation of
the page table and whether the cache can be shared. For a multi-core
processor, the processor does not care whether the threads belong to
the same process. The consistency of the data is guaranteed by MESI.
The physical location of the data is guaranteed by the page table.
Is my understanding correct?
Right, there's no difference. The kernel just schedules tasks; each user task refers to a page table (whether that's shared with any other task or not).
Each logical CPU core has its own page-table pointer (e.g. x86 CR3).
And yes, cache coherency is maintained by hardware. The Linux kernel's hand-rolled atomics (using volatile, and inline asm for RMWs and barriers) depend on that.

Can this be viewed as an I/O bound task?

Let's say you have 4 physical cores on your computer, and let's assume there is no hyperthreading and that python version is 3.2+ (although I am not sure if these extra information matter for my question).
If I were to open a pool of 3 subprocesses, hence each subprocess occupying one physical core when they do some CPU bound tasks, and if I were to open up 3 threads from the current process (occupying one core left from 4) where the OS is running, and if I were to send CPU bound tasks down the multiprocessing to each of the 3 subprocesses, then the question is this:
From the perspective of the current process that is managing the threads (and these threads are pushing tasks out to each subprocesses and is waiting for the result to come back from these subprocesses), can these CPU bound tasks be viewed as I/O bound tasks (from the perspective of current process) since the current process is not actually doing any work? Equivalently, will the 3 threads go to sleep, while the 3 subprocesses are crunching away at the numbers and occupying the 3 cores, and let the last core sit there idling?
will the 3 threads go to sleep, while the 3 subprocesses are crunching away at the numbers and occupying the 3 cores, and let the last core sit there idling?
Yes. I can't imagine what else could happen, do you have another possibility in mind? As you say, the threads are waiting.
In this situation you can probably make 4 processes to work on the CPU bound tasks.
It sounds like your problem is well suited for multiprocessing.Pool. In that case note that if you don't specify the number of processes to use, it uses the number of CPU cores by default:
processes is the number of worker processes to use. If processes is None then the number returned by os.cpu_count() is used.
which is an official sign that using as many processes as cores is a normal practice.

Threads vs Cores

Say if I have a processor like this which says # cores = 4, # threads = 4 and without Hyper-threading support.
Does that mean I can run 4 simultaneous program/process (since a core is capable of running only one thread)?
Or does that mean I can run 4 x 4 = 16 program/process simultaneously?
From my digging, if no Hyper-threading, there will be only 1 thread (process) per core. Correct me if I am wrong.
A thread differs from a process. A process can have many threads. A thread is a sequence of commands that have a certain order. A logical core can execute on sequence of commands. The operating system distributes all the threads to all the logical cores available, and if there are more threads than cores, threads are processed in a fast cue, and the core switches from one to another very fast.
It will look like all the threads run simultaneously, when actually the OS distributes CPU time among them.
Having multiple cores gives the advantage that less concurrent threads will be placed on one single core, less switching between threads = greater speed.
Hyper-threading creates 2 logical cores on 1 physical core, and makes switching between threads much faster.
That's basically correct, with the obvious qualifier that most operating systems let you execute far more tasks simultaneously than there are cores or threads, which they accomplish by interleaving the executing of instructions.
A system with hyperthreading generally has twice as many hardware threads as physical cores.
The term thread is generally used as a description of an operating system concept that has the potential to execute independently of other threads. Whether it does so depends on whether it is stuck waiting for some event (disk or screen I/O, message queue), or if there are enough physical CPUs (hyperthreaded or not) to allow it run in the face of other non-waiting threads.
Hyperthreading is a CPU vendor term that means a single core, that can multiplex its attention between two computations. The easy way to think about a hyperthreaded core is as if you had two real CPUs, both slightly slower than what the manufacture says the core can actually do.
Basically this is up to the OS. A thread is a high-level construct holding a instruction pointer, and where the OS places a threads execution on a suitable logical processor. So with 4 cores you can basically execute 4 instructions in parallell. Where as a thread simply contains information about what instructions to execute and the instructions placement in memory.
An application normally uses a single process during execution and the OS switches between processes to give all processes "equal" process time. When an application deploys multiple threads the processes allocates more than one slot for execution but shares memory between threads.
Normally you make a difference between concurrent and parallell execution. Where parallell execution is when you actually physically execute instructions of more than one logical processor and concurrent execution is the the frequent switching of a single logical processor giving the apperence of parallell execution.

Resources