MultiCore CPUs, Multithreading and context switching? - multithreading

Let's say we have a CPU with 20 cores and a process with 20 CPU-intensive independent of each other threads: One thread per CPU core.
I'm trying to figure out whether context switching happens in this case. I believe it happens because there are system processes in the operating system that need CPU-time too.
I understand that there are different CPU architectures and some answers may vary but can you please explain:
How context switching happens e.g. on Linux or Windows and some known CPU architectures? And what happens under the hood on modern hardware?
What if we have 10 cores and 20 threads or the other way around?
How to calculate how many threads we need if we have n CPUs?
Does CPU cache(L1/L2) gets empty after context switching?
Thank you

How context switching happens e.g. on Linux or Windows and some known
CPU architectures? And what happens under the hood on modern hardware?
A context-switch happens when an interrupt occurs and that interrupt, together with the kernel thread and process state data, specify a set of running threads that is different than the set running before the interrupt. Note that, in OS terms, an interrupt may be either a 'real' hardware interrupt that causes a driver to run and that driver requests a scheduling run, or a syscall from a thread that is already running. In either case, the OS scheduling state-machine decides whether to change the set of threads running on the available cores.
The kernel can change the set of running threads by stopping thread/s and running others. It can stop any thread running on any core by queueing up a premption request and generating a hardware interrupt of that core to force the core to run its interprocessor driver to handle the request.
What if we have 10 cores and 20 threads?
Depends on what the threads are doing. If they are in any other state than ready/running, (eg blocked on I/O or inter-thread comms), there will be no context-switching between them because nothing is running. If they are all ready/running, 10 of them will run forever on the 10 cores until there is an interrupt. Most systems have a periodic timer interrupt that can have the effect of sharing the available cores around the threads.
or the other way around
10 threads run on 10 cores. The other 10 cores are halted. The OS may move the threads around the cores, eg. to prevent uneven heat dissipation across the die.
How to calculate how many threads we need if we have n CPUs?
App-dependent. It would be nice if all cores were always used up 100% on exactly as many ready threads as cores but, since most threads are blocked for much more time than they are running, it's difficult, except in some end-cases, (eg - your '20 CPU-intensive threads on 20 cores'), to come up with any optimal number.
Does CPU cache(L1/L2) gets empty after context switching?
Maybe - it depends entirely on the data usage of the threads. The caches will get reloaded on-demand, as usual. There is no 'context-switch total cache reload' but, if the threads access different, large arrays of data while running, then the (L1 at least), cache will indeed get fully reloaded during the thread run.

Related

Scheduling and Synchronization in Multicore CPU and in Single core CPU

From what I am understanding from the top answers of this post (
https://stackoverflow.com/questions/16116952/can-multithreading-be-implemented-on-a-single-processor-system#:~:text=Yes%2C%20you%20can%20have%20multiple,one%20thing%20at%20a%20time.),
If I am only running one multithreaded program that creates 4 threads on a multicore CPU system with 4 cores, there is no need for scheduling as all 4 threads of my program will be running in individual cores (or microprocessors). But there maybe a need for synchronization since all 4 threads access the memory of the program (or a process) that is stored in the same address space in the main memory.
On the other hand,
on a single core CPU computer. If I run the same program that creates 4 threads, I will need both synchronization and scheduling since all threads must utilize the same core (or a microprocessor).
Please correct my understanding if it is wrong.
there is no need for scheduling as all 4 threads of my program will be running in individual cores
This is not true in practice. The OS scheduler operate in both cases. Unless you pin threads to core, threads can migrate from one core to another. In fact, even if you pin them, there are generally few other threads that can be ready on the machine (eg. ssh daemon, tty session, graphical programs, kernel threads, etc.) so the OS has to schedule them. There will be context-switches though the number will be much lower than with a single processor.
there maybe a need for synchronization since all 4 threads access the memory of the program( or a process) that is stored in the same space in the main memory.
This is true. Note that threads can also work on different memory area (so that there is no need for synchronization except when they are joined). Note also that "main memory" includes CPU caches here.
In a single core CPU computer. If I run the same program that creates 4 threads, I will need both synchronization and Scheduling since all threads must utilize the same Core ( or a microprocessor).
Overall, yes. That being said, the term "scheduling" is unclear. THere are multiple kind of scheduling: preemptive VS cooperative scheduling. Here, as a programmer, you do not need to do anything special since the scheduling is done by the OS. Thus, it is a bit unexpected to say that you "need" scheduling. The OS will schedule the threads on the same core using preemption (by allocating different time-slices for each threads on the same core).

Purpose of multiprocessors and multi-core processor

I do want to clarify things in my head and model concrete knowledge. dual-core with one processor system, only two threads within the one process can be executed concurrently by each core. Uni-core with two processor system, two different process can be executed by each CPU.
So can we say, each processor can execute processes concurrently. While multi-core processor execute threads within the process concurrently?
I think you have a fundamental misunderstanding of what a process and thread are and how they relate to the hardware itself.
A CPU core can only execute 1 machine level instruction per clock cycle (so essentially, just 1 assembly instruction). CPU's are typically measured by the number of clock cycles they go through in a second. So a 2.5 GHz core can execute 2.5 billion instructions per second.
The OS (the operating system, like Windows, Linux, macOS, Android, iOS, etc.) is responsible for launching programs and giving them access to the hardware resources. Each program can be considered a "process".
Each process can launch multiple threads.
To ensure that multiple processes can share the same hardware resources, the idea of pre-emptive computing came about over 40 years ago.
In a nut-shell, pre-emptive computing, or time-slicing, is a function of the OS. It basically gives a few milliseconds to each thread that is running, regardless of which process that thread is a part of, and keeps the "context" of each thread so that the state of each thread can be handled appropriately when it's time for that thread to run; that's also known as a context switch.
A dual, quad, or even 128 core CPU does not change that, nor will the amount of CPU's in the system (e.g. 4 CPU's each with 128 cores). Each core can only execute 1 instruction per clock cycle.
What changes is how many instructions can be run in true parallel. If my CPU has 16 cores, then that means it can execute 16 instructions per clock cycle, and thus run 16 separate threads of execution without any context switching being necessary (though it does still happen, but that's a different issue).
This doesn't cover hyper-threading, in which 1 core can execute 2 instructions per cycle, essentially doubling your CPU count, and doesn't cover the idea of cache-misses or other low-level ideas in which extra cycles could be spent on a thread, but it covers the general idea of CPU scheduling.

Threads vs cores when threads are asleep

I am looking to confirm my assumptions about threads and CPU cores.
All the threads are the same. No disk I/O is used, threads do not share memory, and each thread does CPU bound work only.
If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Are any of my assumptions incorrect? if so why?
Edit
When I say the thread is asleep, I mean that the thread is blocked for a specific amount of time. In C++ I would use sleep_for Blocks the execution of the current thread for at least the specified sleep_duration
If we assume that you are talking about threads that are implemented using native thread support in a modern OS, then your statements are more or less correct.
There are a few factors that could cause the behavior to deviate from the "ideal".
If there are other user-space processes, they may compete for resources (CPU, memory, etcetera) with your application. That will reduce (for example) the CPU available to your application. Note that this will include things like the user-space processes responsible for running your desktop environment etc.
There are various overheads that will be incurred by the operating system kernel. There are many places where this happens including:
Managing the file system.
Managing physical / virtual memory system.
Dealing with network traffic.
Scheduling processes and threads.
That will reduce the CPU available to your application.
The thread scheduler typically doesn't do entirely fair scheduling. So one thread may get a larger percentage of the CPU than another.
There are some complicated interactions with the hardware when the application has a large memory footprint, and threads don't have good memory locality. For various reasons, memory intensive threads compete with each other and can slow each other down. These interactions are all accounted as "user process" time, but they result in threads being able to do less actual work.
So:
1) If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
Probably not all of the time, due to other user processes and OS overheads.
2) If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
Approximately. There are the overheads (see above). There is also the issue that time slicing between different threads of the same priority is fairly coarse grained, and not necessarily fair.
3) If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
Approximately: see above.
4) An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
There is also the issue that the OS consumes CPU to manage the sleeping threads; e.g. putting them to sleep, deciding when to wake them, rescheduling.
Another one is that the memory used by the threads may also come at a cost. For instance if the sum of the memory used for all process (including all of the 10,000 threads' stacks) is larger than the available physical RAM, then there is likely to be paging. And that also uses CPU resources.
5) In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Not necessarily. If the virtual memory usage is out of whack (i.e. you are paging heavily), the system may have to idle some of the CPU while waiting for memory pages to be read from and written to the paging device. In short, you need to take account of memory utilization, or it will impact on the CPU utilization.
This also doesn't take account of thread scheduling and context switching between threads. Each time the OS switches a core from one thread to another it has to:
Save the the old thread's registers.
Flush the processor's memory cache
Invalidate the VM mapping registers, etcetera. This includes the TLBs that #bazza mentioned.
Load the new thread's registers.
Take performance hits due to having to do more main memory reads, and vm page translations because of previous cache invalidations.
These overheads can be significant. According to https://unix.stackexchange.com/questions/506564/ this is typically around 1.2 microseconds per context switch. That may not sound much, but if your application is switching threads rapidly, that could amount to many milliseconds in each second.
As already mentioned in the comments, it depends on a number of factors. But in a general sense your assumptions are correct.
Sleep
In the bad old days a sleep() might have been implemented by the C library as a loop doing pointless work (e.g. multiplying 1 by 1 until the required time had elapsed). In that case, the CPU would still be 100% busy. Nowadays a sleep() will actually result in the thread being descheduled for the requisite time. Platforms such as MS-DOS worked this way, but any multitasking OS has had a proper implementation for decades.
10,000 sleeping threads will take up more CPU time, because the OS has to make scheduling judgements every timeslice tick (every 60ms, or thereabouts). The more threads it has to check for being ready to run, the more CPU time that checking takes.
Translate Lookaside Buffers
Adding more threads than cores is generally seen as OK. But you can run into a problem with Translate Lookaside Buffers (or their equivalents on other CPUs). These are part of the virtual memory management side of the CPU, and they themselves are effectively content address memory. This is really hard to implement, so there's never that much of it. Thus the more memory allocations there are (which there will be if you add more and more threads) the more this resource is eaten up, to the point where the OS may have to start swapping in and out different loadings of the TLB in order for all the virtual memory allocations to be accessible. If this starts happenging, everything in the process becomes really, really slow. This is likely less of a problem these days than it was, say, 20 years ago.
Also, modern memory allocators in C libraries (and thence everything else built on top, e.g. Java, C#, the lot) will actually be quite careful in how requests for virtual memory are managed, minising the times they actually have to as the OS for more virtual memory. Basically they seek to provide requested allocations out of pools they've already got, rather than each malloc() resulting in a call to the OS. This takes the pressure of the TLBs.

Multi-thread program(process) on multicore-core processor(s) with hyperthreading

For multicore computing, one thing confusing me from the beginning is the model of multicore hardware is too abstracted from the real machine.
I worked on a laptop with a single intel processor, containing 4 cores and support hyperthreading which makes the num of logical cores 8.
Suppose I have a Java program implemented a concurrent algorithm (it is said Java would use the OS rule of thread scheduling, so JVM won't affect the scheduling), and the program is purely CPU-bound.
My observations:
if the program(process) running less than 8 threads, parallelism of work increases as the num of threads increasing
when the total threads num is larger than 8, the performance become complicated, but usually no more improvement than running 8 threads; and for some certain algorithm, it is much worse, i.e.time-consumption increased hugely than running 8 threads.
My knowledge of this:
As far as I know, the program I run is treated as a user process by the OS, and if the program created threads to try to gain parallelism, the OS will try to schedule these threads among the cores available.
Any threads of the process on a same core may share the total execution time of the process on that core.
My questions:
Suppose the CPU is only running my program. i.e. no other user processes.
if there is only 1 core of the CPU, the process will get no benefit of parallelism by multi-threading, since the total execution time of the process will not change. Is it true?
if there are more than one cores available, the OS will try to schedule the threads of the process evenly and fairly on different cores, and the process's threads on different cores get their own (extra)
execution time, therefore speeding up. Is it true?
if there are n threads and m cores, where n>m, than some core may run more than 1 thread of the process, which may even harm the speed-up of parallelism because of "context switch" among threads on
the same core and potentially side-effect of threads of the process running at different speed. Is this true?
Thanks very much!
if there is only 1 core of the CPU, the process will get no benefit of parallelism by multi-threading, since the total execution time of the process will not change. Is it true?
Only if you are 100% CPU-bound. If you have I/O waits, multiple threads can help a lot even on a single core.
if there are more than one cores available, the OS will schedule the threads of the process evenly and fairly on different cores
That seems to be up to the discretion of the OS. There could be all kinds of quota and priorities involved.
may even harm the speed-up of parallelism because of "context switch" among threads on the same core
It is true that there is overhead in managing extra threads (not just at the scheduling level, but also in synchronizing and communicating within your application), and if these threads cannot make productive use of otherwise idle CPU cores, then having less threads can actually improve performance.

What is the relationship between threads (in a Java or a C++ program) and number of cores in the CPU?

Can someone shed some light on it?
An i7 processor can run 8 threads but I am pretty sure we can create more than 8 threads in a JAVA or C++ program(not sure though). I have an i5 processor and while studying concurrency I have created 10 threads for assignments. I am just trying to understand how Core rating of CPU is related to threads.
The thread you are refering to is called a software thread; and you can create as many software threads as you need, as long as your operating system allows it. Each software thread, or code snippet, can run concurrently from the others.
For each core, there is at least one hardware thread to which the operating system can assign a software thread. If you have 8 cores, for example, then you have a hardware thread pool of capacity 8. You can map tens or hundreds of software threads to this 8-slot pool, where only 8 threads are actually running on hardware at the same time, i.e. in parallel.
Software threads are like people sharing the same computer. Each one can use this computer up to some time, not necessarily have his task completed, then give it up to another.
Hardware threads are like people having a computer for each of them. All of them can proceed with their tasks at the same time.
Note: For i7, there are two hardware threads (so called hyper-threading) in each core. So you can have up to 16 threads running in parallel.
There are already a couple of good answers talking about the hardware side of things, but there isn't much talk about the software side of things.
The essential fact that I believe you're missing is that not all threads have to be executing all the time. When you have thousands of threads on an 8 core machine, only a few of them are actually running at any given time. The others are sitting around doing nothing until some processor time becomes free. This has huge advantages because threads might be waiting on other resources, too. For example, if I have one thread trying to read a file from disk, then there's no reason for it to be taking up CPU time while it's waiting for the hard disk data to load into RAM. Another example is when the thread is waiting for a response from some other machine (such as a web request over the internet). When you have more threads than your processor can handle at once, the operating system and/or runtime (It depends on the OS and runtime implementation.) are responsible for deciding which threads should get available processor time. This kind of set up lets you maximize your machine's productivity because CPU cycles are doing something useful almost all the time.
A "thread" is a software abstraction which defines a single, self-consistent path of execution through the program: in most modern systems, the number of threads is basically only limited by memory. However, only a comparatively small number of threads can be run simultaneously by the CPU. Broadly speaking, the "core count" is how many threads the CPU can run truly in parallel: if there are more threads that want to run than there are cores available, the operating system will use time-slicing of some sort to let all the threads get some time to execute.
There are a whole bunch of terms which are thrown around when it comes to "cores:"
Processor count: the number of physical CPU chips on a system's motherboard. This used to be the only number that mattered, until CPUs with multiple cores became available.
Logical core count: the number of threads that the system hardware can run in parallel
Physical core count: the number of copies of the CPU execution hardware that the system has -- this is not always equal to the logical core count, due to features like SMT ("simultaneous multithreading") which use a single piece of hardware to run multiple threads in parallel
Module count: Recent (Bulldozer-derived) AMD processors have used an architecture which is a hybrid between SMT and the standard one-logical-core-per-physical-core model. On these CPUs, there is a separate copy of the integer execution hardware for each logical core, but two logical cores share a floating-point unit and the fetch-and-decode frontend; AMD calls the unit containing two logical cores a module.
It should also be briefly mentioned that GPUs, graphics cards, have enormous core counts and run huge numbers (thousands) of threads in parallel. The trade-off is that GPU cores have very little memory and often a substantially restricted programming model.
Threads are handled by the OS's scheduler. The number of cores in a CPU determine how many threads it can run at the same time.
Note that threads are constantly switched in and out by the scheduler to give the "illusion" that everything is running at the same time.
More here, if you're interested.
no, no, no... Your I7 has eight execution threads and can run 8 threads at once.
1000 threads or more can be waiting for processor time.
calling thread.sleep moves a thread off the execution core and back into memory where it waits until woken.

Resources