Maximum number of threads and multithreading

Maximum number of threads and multithreading - multithreading

I'm tripping up on the multithreading concept.
For example, my processor has 2 cores (and with hyper threading) 2 threads per core totaling to 4 thread. So does this mean my CPU can execute four separate instructions simultaneously? Is each thread capable of being a multi-thread?

So does this mean my CPU can execute four separate instructions simultaneously? Is each thread capable of being a multi-thread?
In short to both, yes.
A CPU can only execute 1 single instruction per phase in a clock cycle, due to certain factors like pipelining, a CPU might be able to pass multiple instructions through the different phases in a single clock cycle, and the frequency of the clock might be extremely fast, but it's still only 1 instruction at a time.
As an example, NOP is an x86 assembly instruction which the CPU interprets as "no operation this cycle" that's 1 instruction out of the hundreds or thousands (and more) that are executed from something even as simple as:
int main(void)
{
while (1) { /* eat CPU */ }
return 0;
}
A CPU thread of execution is one in which a series of instructions (a thread of instructions) are being executed, it does not matter from what "application" the instructions are coming from, a CPU does not know about high level concepts (like applications), that's a function of the OS.
So if you have a computer with 2 (or 4/8/128/etc.) CPU's that share the same memory (cache/RAM), then you can have 2 (or more) CPU's that can run 2 (or more) instructions at (literally) the exact same time. Keep in mind that these are machine instructions that are running at the same time (i.e. the physical side of the software).
An OS level thread is something a bit different. While the CPU handles the physical side of the execution, the OS handles the logical side. The above code breaks down into more than 1 instruction and when executed, actually gets run on more than 1 CPU (in a multi-CPU aware environment), even though it's a single "thread" (at the OS level), the OS schedules when to run the next instructions and on what CPU (based on the OS's thread scheduling policy, which is different amongst the various OS's). So the above code will eat up 100% CPU usage per a given "time slice" on that CPU it's running on.
This "slicing" of "time" (also known as preemptive computing) is why an OS can run multiple applications "at the same time", it's not literally1 at the same time since a CPU can only handle 1 instruction at a time, but to a human (who can barely comprehend the length of 1 second), it appears "at the same time".
1) except in the case with a multi-CPU setup, then it might be literally the same time.
When an application is run, the kernel (the OS) actually spawns a separate thread (a kernel thread) to run the application on, additionally the application can request to create another external thread (i.e. spawning another process or forking), or by creating an internal thread by calling the OS's (or programming languages) API which actually call lower level kernel routines that spawn and maintain the context switching of the spawned thread, additionally, any created thread is also capable of calling the same API's to spawn other separate threads (thus a thread is capable of being "multi-threaded").
Multi-threading (in the sense of applications and operating systems), is not necessarily portable, so while you might learn Java or C# and use their API's (i.e. Thread.Start or Runnable), utilizing the actual OS API's as provided (i.e. CreateThread or pthread_create and the slew of other concurrency functions) opens a different door for design decisions (i.e. "does platform X support thread library Y"); just something to keep in mind as you explore the different API's.
I hope that can help add some clarity.

I actually researched this very topic in my Operating Systems class.
When using threads, a good rule of thumb for increased performance for CPU bound processes is to use an equal number of threads as cores, except in the case of a hyper-threaded system in which case one should use twice as many cores. The other rule of thumb that can be concluded is for I/O bound processes. This rule is to quadruple the number threads per cores, except for the case of a hyper-threaded system than one can quadruple the number of threads per core.

Related

Is a schedulable unit of CPU time slice process or thread?

I want to clarify whether "a schedulable unit of CPU time slice" is "process" or "thread" (kernel managed thread). What I mean by "schedulable unit of CPU time slice" is the unit which CPU scheduler of an operating system allocates CPU time slice.
According to "Short-term scheduling" in wikipedia, process is used to refer the schedulable unit.
"This scheduler can be preemptive, implying that it is capable of forcibly removing processes from a CPU when it decides to allocate that CPU to another process"
Also, according to "Time slice" in wikepedia,
"The scheduler is run once every time slice to choose the next process to run."
Also, according to "Thread" in wikepedia,
"a process is a unit of resources, while a thread is a unit of scheduling and execution"
According to "Processes and Threads" in microsoft docs,
"A thread is the basic unit to which the operating system allocates processor time."
According to "Is thread scheduling done by the CPU, kernel, or both?" in quora,
"The CPU (hardware) just carries out instructions. The CPU itself has no concept of threads or scheduling, although there may be features in the CPU that support them.
"The operating system kernel (a set of instructions, aka software) executes on the CPU (hardware). A scheduling algorithm in the kernel of the operating system chooses which thread to execute next, and directs the CPU to begin executing the next instruction in that chosen thread".

Clarification: my understanding of "a schedulable unit of CPU time slice" is "a unit that can be scheduled during a given CPU time slice" (since if "schedulable unit" would be a time then the question does not make much sense to me).
Based on this, put it shortly, "a schedulable unit of CPU time slice" for a given logical core can be seen as a software thread (more specifically its execution context composed of registers and process information).
Operating systems scheduler operates on tasks. Tasks can be threads, processes, or other unusual structure (eg. dataflows).
Modern mainstream operating system mainly schedule threads on processing units (typically hardware threads also called logical cores). You can get more information about how the Windows scheduler works in the Microsoft documentation. The documentation explicitly states:
A thread is the entity within a process that can be scheduled for execution
On Linux, the default scheduler, CFS, operates on task (ie. task_struct data structure). Tasks can be a thread, a group of threads or a process. This was done that way so to make the scheduler more generic and also because this scheduler was designed long ago, when processors had only 1 core and people focused on processes rather than thread. The multi-core era since caused applications to use a lot of threads so to use available cores. As a result, nowadays, it is generally threads that are actually scheduled AFAIK. This is explained in the famous research paper The Linux Scheduler: a Decade of Wasted Cores (which also explain a bit how the CFS operate regarding the target processor).
Note that the term "process" can sometime refer to a thread since threads are sometime called "lightweight processes" and basic processes are sometime called "heavy processes". Processes can even be a generic term for both heavy and lightweight processes (ie. threads and actual processes). This is a very confusing terminology and a misuse of language (like the term "processors" sometimes used for cores). In practice, this is often not a problem in a specific context since threads and processes may be used interchangeably though (in such a case, people should use a generic term like "tasks").
As for "a schedulable unit of CPU time slice" this is a bit more complex. A simple and naive answer is: a thread (it is definitively not processes alone). That being said, a thread is a software-defined concept (like processes). It is basically a stack, few registers, and a parent process (with possibly some meta-information and a TLS space). CPUs does not operate directly on such data structure. CPU does not have a concept of thread stack for example (it is just a section of the virtual process memory like any other). They just need an execution context which is composed of registers and a process configuration (in protected mode). For sake of simplicity, we can say that they execute threads. Mainstream modern x86 processors are very complex, and each core is often able to run multiple threads at the same time. This is called simultaneous multithreading (aka. Hyper-Threading for Intel processors). x86 physical cores are typically composed of two logical threads (ie. hardware threads) that can each execute a software threads.

I think your misunderstanding is actually a misunderstanding of what the English words mean in this context.
A time slice is a period of time. Maybe it is a fraction of a second. Maybe a few seconds.
Threads and processes are effectively tasks that the computer is going to perform. (I am simplifying here. The notion of a task has multiple meanings, even in the IT context. And on a modern OS, a process is actually a collection of threads that share the same virtual memory address space.)
The CPU1 or processor is hardware that will run a (native) thread. A typical computer will have multiple CPUs. However, each CPU in a computer can only run one thread at a time.
The operating system therefore needs to schedule each of the threads it knows about to run on a specific CPU. The part of the operating system that does this is called the scheduler.
If there are more threads to run than CPUs to run them, the scheduler will typically schedule a thread to a CPU for a fixed period of time; i.e a time slice. When the thread's time slice has elapsed, the scheduler will suspend it and put it back into the queue, and then schedule a different thread to run on the CPU.
The metaphor is that we are "slicing up" the available compute time on the CPUs and sharing the slices between the threads that need it.
1 - There is some disagreement over what "CPU" actually means. I am taking the view that it refers to what it commonly called a "core". Intel confusingly introduced the marketing term2 "hyperthread" which refers to a feature in which the physical hardware of a "core" can run as two independent instruction executors. However, in hyperthread mode, the OS scheduler will typically treat the hyperthreads as if they are distinct cores, so this is not pertinent to your question.
2 - The actual concept behind hyperthreads goes back to the 1960s, before Intel even existed as a company; see https://en.wikipedia.org/wiki/Barrel_processor.
So the answer to your question:
I want to clarify whether "a schedulable unit of CPU time slice" is "process" or "thread" (kernel managed thread).
A time slice is a schedulable unit of CPU time.
A time slice is neither a process or a thread. Indeed that doesn't even make sense because "process" and "thread" are not time.
I concur with #Solomon Slow's comment. Wikipedia is not authoritative. But the real problem is that different pages are written and edited by different people over time, and they often use IT terminology inconsistently.
My advice would be to find and read One good textbook on (modern) operating system design and architecture. A well-written text book should be self-consistent in its use of terminology.

A time slice is a unit of time, for example 10 ms in a traditional Linux kernel built with HZ=100 (100 timer interrupts per second). After a task has been running for that long on a CPU core, the kernel regains control on that CPU and calls schedule() to decide what task this CPU core should run next; the task it was already running, or a different task.
The scheduler can also run earlier if an external interrupt comes in, especially near the end of the current timeslice: if there's a higher-priority task that's now waiting for a CPU, e.g. after waiting for I/O or after a sleep() system call ended, it makes sense for the OS to schedule it onto this core, instead of finishing the time-slice of whatever CPU-bound task was interrupted.
Task is a useful word for the things the scheduler has to pick from. Without implying separate process or threads within a process, and also leaving room for kernel tasks like a Linux interrupt handler "bottom half" that aren't threads or processes.
Every thread of a process needs to get scheduled separately to execute on a CPU core (if it's not blocked).
The articles you found about scheduling processes is using the simplifying assumption that each process is single-threaded.
Or they're assuming a 1:n threading model, where the OS is only aware of one task for the whole process, and multithreading is done in user-space, "green threads" instead of native threads. Most mainstream OSes these days use a 1:1 threading model, where every C++ or Java thread is a separately schedulable task visible to the OS, although n:m models are possible where you have multiple OS-scheduled tasks, but not as many as you have high-level-language threads.

Purpose of multiprocessors and multi-core processor

I do want to clarify things in my head and model concrete knowledge. dual-core with one processor system, only two threads within the one process can be executed concurrently by each core. Uni-core with two processor system, two different process can be executed by each CPU.
So can we say, each processor can execute processes concurrently. While multi-core processor execute threads within the process concurrently?

I think you have a fundamental misunderstanding of what a process and thread are and how they relate to the hardware itself.
A CPU core can only execute 1 machine level instruction per clock cycle (so essentially, just 1 assembly instruction). CPU's are typically measured by the number of clock cycles they go through in a second. So a 2.5 GHz core can execute 2.5 billion instructions per second.
The OS (the operating system, like Windows, Linux, macOS, Android, iOS, etc.) is responsible for launching programs and giving them access to the hardware resources. Each program can be considered a "process".
Each process can launch multiple threads.
To ensure that multiple processes can share the same hardware resources, the idea of pre-emptive computing came about over 40 years ago.
In a nut-shell, pre-emptive computing, or time-slicing, is a function of the OS. It basically gives a few milliseconds to each thread that is running, regardless of which process that thread is a part of, and keeps the "context" of each thread so that the state of each thread can be handled appropriately when it's time for that thread to run; that's also known as a context switch.
A dual, quad, or even 128 core CPU does not change that, nor will the amount of CPU's in the system (e.g. 4 CPU's each with 128 cores). Each core can only execute 1 instruction per clock cycle.
What changes is how many instructions can be run in true parallel. If my CPU has 16 cores, then that means it can execute 16 instructions per clock cycle, and thus run 16 separate threads of execution without any context switching being necessary (though it does still happen, but that's a different issue).
This doesn't cover hyper-threading, in which 1 core can execute 2 instructions per cycle, essentially doubling your CPU count, and doesn't cover the idea of cache-misses or other low-level ideas in which extra cycles could be spent on a thread, but it covers the general idea of CPU scheduling.

Threads vs processess: are the visualizations correct?

I have no background in Computer Science, but I have read some articles about multiprocessing and multi-threading, and would like to know if this is correct.
SCENARIO 1:HYPERTHREADING DISABLED
Lets say I have 2 cores, 3 threads 'running' (competing?) per core, as shown in the picture (HYPER-THREADING DISABLED). Then I take a snapshot at some moment, and I observe, for example, that:
Core 1 is running Thread 3.
Core 2 is running Thread 5.
Are these declarations (and the picture) correct?
A) There are 6 threads running in concurrency.
B) There are 2 threads (3 and 5) (and processes) running in parallel.
SCENARIO 2:HYPERTHREADING ENABLED
Lets say I have MULTI-THREADING ENABLED this time.
Are these declarations (and the picture) correct?
C) There are 12 threads running in concurrency.
D) There are 4 threads (3,5,7,12) (and processes) running in 'almost' parallel, in the vcpu?.
E) There are 2 threads (5,7) running 'strictlÿ́' in parallel?

A process is an instance of a program running on a computer. The OS uses processes to maximize utilization, support multi-tasking, protection, etc.
Processes are scheduled by the OS - time sharing the CPU. All processes have resources like memory pages, open files, and information that defines the state of a process - program counter, registers, stacks.
In CS, concurrency is the ability of different parts or units of a program, algorithm or problem to be executed out-of-order or in a partial order, without affecting the final outcome.
A "traditional process" is when a process is an OS abstraction to present what is needed to run a single program. There is NO concurrency within a "traditional process" with a single thread of execution.
However, a "modern process" is one with multiple threads of execution. A thread is simply a sequential execution stream within a process. There is no protection between threads since they share the process resources.
Multithreading is when a single program is made up of a number of different concurrent activities (threads of execution).
There are a few concepts that need to be distinguished:
Multiprocessing is whenwe have Multiple CPUs.
Multiprogramming when the CPU executes multiple jobs or processes
Multithreading is when the CPU executes multiple mhreads per Process
So what does it mean to run two threads concurrently?
The scheduler is free to run threads in any order and interleaving a FIFO or Random. It can choose to run each thread to completion or time-slice in big chunks or small chunks.
A concurrent system supports more than one task by allowing all tasks to make progress. A parallel system can perform more than one task simultaneously. It is possible though, to have concurrency without parallelism.
Uniprocessor systems provide the illusion of parallelism by rapidly switching between processes (well, actually, the CPU schedulers provide the illusion). Such processes were running concurrently, but not in parallel.
Hyperthreading is Intel’s name for simultaneous multithreading. It basically means that one CPU core can work on two problems at the same time. It doesn’t mean that the CPU can do twice as much work. Just that it can ensure all its capacity is used by dealing with multiple simpler problems at once.
To your OS, each real silicon CPU core looks like two, so it feeds each one work as if they were separate. Because so much of what a CPU does is not enough to work it to the maximum, hyperthreading makes sure you’re getting your money’s worth from that chip.

There are a couple of things that are wrong (or unrealistic) about your diagrams:
A typical desktop or laptop has one processor chipset on its motherboard. With Intel and similar, the chipset consists of a CPU chip together with a "northbridge" chip and a "southbridge" chip.
On a server class machine, the motherboard may actually have multiple CPU chips.
A typical modern CPU chip will have more than one core; e.g. 2 or 4 on low-end chips, and up to 28 (for Intel) or 64 (for AMD) on high-end chips.
Hyperthreading and VCPUs are different things.
Hyperthreading is Intel proprietary technology1 which allows one physical to at as two logical cores running two independent instructions streams in parallel. Essentially, the physical core has two sets of registers; i.e. 2 program counters, 2 stack pointers and so on. The instructions for both instruction streams share instruction execution pipelines, on-chip memory caches and so on. The net result is that for some instruction mixes (non-memory intensive) you get significantly better performance than if the instruction pipelines are dedicated to a single instruction stream. The operating system sees each hyperthread as if it was a dedicated core, albeit a bit slower.
VCPU or virtual CPU terminology used in cloud computing context. On a typical cloud computing server, the customer gets a virtual server that behaves like a regular single or multi-core computer. In reality, there will typically be many of these virtual servers on a compute node. Some special software called a hypervisor mediates access to the hardware devices (network interfaces, disks, etc) and allocates CPU resources according to demand. A VCPU is a virtual server's view of a core, and is mapped to a physical core by the hypervisor. (The accounting trick is that VCPUs are typically over committed; i.e. the sum of VCPUs is greater than the number of physical cores. This is fine ... unless the virtual servers all get busy at the same time.)
In your diagram, you are using the term VCPU where the correct term would be hyperthread.
Your diagram shows each core (or hyperthread) associated with a distinct group of threads. In reality, the mapping from cores to threads is more fluid. If a core is idle, the operating system is free to schedule any (runnable) thread to run on it. (Some operating systems allow you to tie a given thread to a specific core for performance reasons. It is rarely necessary to do this.)
Your observations about the first diagram are correct.
Your observations about the second diagram are slightly incorrect. As stated above the hyperthreads on a core share the execution pipelines. This means that they are effectively executing at the same time. There is no "almost parallel". As I said, above, it is simplest to think of a hyperthread as a core "that runs a bit slower".
1 - Intel was not the first computer to com up with this idea. For example, CDC mainframes used this idea in the 1960's to get 10 PPUs from a single core and 10 sets of registers. This was before the days of pipelined architectures.

Who schedules threads?

I have a question about scheduling threads. on the one hand, I learned that threads are scheduled and treated as processes in Linux, meaning they get scheduled like any other process using the conventional methods. (for example, the Completely Fair Scheduler in linux)
On the other hand, I also know that the CPU might also switch between threads using methods like Switch on Event or Fine-grain. For example, on cache miss event the CPU switches a thread. but what if the scheduler doesn't want to switch the thread? how do they agree on one action?
I'm really confused between the two: who schedules a thread? the OS or the CPU?
thanks alot :)

The answer is both.
What happens is really fairly simple: on a CPU that supports multiple threads per core (e.g., an Intel with Hyperthreading) the CPU appears to the OS as having some number of virtual cores. For example, an Intel i7 has 4 actual cores, but looks to the OS like 8 cores.
The OS schedules 8 threads onto those 8 (virtual) cores. When it's time to do a task switch, the OS's scheduler looks through the threads and finds the 8 that are...the most eligible to run (taking into account things like thread priority, time since they last ran, etc.)
The CPU has only 4 real cores, but those cores support executing multiple instructions simultaneously (and out of order, in the absence of dependencies). Incoming instructions get decoded and thrown into a "pool". Each clock cycle, the CPU tries to find some instructions in that pool that don't depend on the results of some previous instruction.
With multiple threads per core, what happens is basically that each actual core has two input streams to put into the "pool" of instructions it might be able to execute in a given clock cycle. Each clock cycle it still looks for instructions from that pool that don't depend on the results of previous instructions that haven't finished executing yet. If it finds some, it puts them into the execution units and executes them. The only major change is that each instruction now needs some sort of tag attached to designate which "virtual core" will be used to store results into--that is, each of the two threads has (for example) its own set of registers, and instructions from each thread have to write to the registers for that virtual core.
It is possible, however, for a CPU to support some degree of thread priority so that (for example) if the pool of available instructions includes some instructions from both input threads (or all N input threads, if there are more than two) it will prefer to choose instructions from one thread over instructions from another thread in any given clock cycle. This can be absolute, so it runs thread A as fast as possible, and thread B only with cycles A can't use, or it can be a "milder" preference, such as attempting to maintain a 2:1 ratio of instructions executed (or, of course, essentially any other ratio preferred).
Of course, there are other ways of setting up priorities as well (such as partitioning execution resources), but the general idea remains the same.
An OS that's aware of shared cores like this can also modify its scheduling to suit, such as scheduling only one thread on a pair of cores if that thread has higher priority.

The OS handles scheduling and dispatching of ready threads, (those that require CPU), onto cores, managing CPU execution in a similar fashion as it manages other resources. A cache-miss is no reason to swap out a thread. A page-fault, where the desired page is not loaded into RAM at all, may cause a thread to be blocked until the page gets loaded from disk. The memory-management hardware does that by generating a hardware interrupt to an OS driver that handles the page-fault.

What is the relationship between threads (in a Java or a C++ program) and number of cores in the CPU?

Can someone shed some light on it?
An i7 processor can run 8 threads but I am pretty sure we can create more than 8 threads in a JAVA or C++ program(not sure though). I have an i5 processor and while studying concurrency I have created 10 threads for assignments. I am just trying to understand how Core rating of CPU is related to threads.

The thread you are refering to is called a software thread; and you can create as many software threads as you need, as long as your operating system allows it. Each software thread, or code snippet, can run concurrently from the others.
For each core, there is at least one hardware thread to which the operating system can assign a software thread. If you have 8 cores, for example, then you have a hardware thread pool of capacity 8. You can map tens or hundreds of software threads to this 8-slot pool, where only 8 threads are actually running on hardware at the same time, i.e. in parallel.
Software threads are like people sharing the same computer. Each one can use this computer up to some time, not necessarily have his task completed, then give it up to another.
Hardware threads are like people having a computer for each of them. All of them can proceed with their tasks at the same time.
Note: For i7, there are two hardware threads (so called hyper-threading) in each core. So you can have up to 16 threads running in parallel.

There are already a couple of good answers talking about the hardware side of things, but there isn't much talk about the software side of things.
The essential fact that I believe you're missing is that not all threads have to be executing all the time. When you have thousands of threads on an 8 core machine, only a few of them are actually running at any given time. The others are sitting around doing nothing until some processor time becomes free. This has huge advantages because threads might be waiting on other resources, too. For example, if I have one thread trying to read a file from disk, then there's no reason for it to be taking up CPU time while it's waiting for the hard disk data to load into RAM. Another example is when the thread is waiting for a response from some other machine (such as a web request over the internet). When you have more threads than your processor can handle at once, the operating system and/or runtime (It depends on the OS and runtime implementation.) are responsible for deciding which threads should get available processor time. This kind of set up lets you maximize your machine's productivity because CPU cycles are doing something useful almost all the time.

A "thread" is a software abstraction which defines a single, self-consistent path of execution through the program: in most modern systems, the number of threads is basically only limited by memory. However, only a comparatively small number of threads can be run simultaneously by the CPU. Broadly speaking, the "core count" is how many threads the CPU can run truly in parallel: if there are more threads that want to run than there are cores available, the operating system will use time-slicing of some sort to let all the threads get some time to execute.
There are a whole bunch of terms which are thrown around when it comes to "cores:"
Processor count: the number of physical CPU chips on a system's motherboard. This used to be the only number that mattered, until CPUs with multiple cores became available.
Logical core count: the number of threads that the system hardware can run in parallel
Physical core count: the number of copies of the CPU execution hardware that the system has -- this is not always equal to the logical core count, due to features like SMT ("simultaneous multithreading") which use a single piece of hardware to run multiple threads in parallel
Module count: Recent (Bulldozer-derived) AMD processors have used an architecture which is a hybrid between SMT and the standard one-logical-core-per-physical-core model. On these CPUs, there is a separate copy of the integer execution hardware for each logical core, but two logical cores share a floating-point unit and the fetch-and-decode frontend; AMD calls the unit containing two logical cores a module.
It should also be briefly mentioned that GPUs, graphics cards, have enormous core counts and run huge numbers (thousands) of threads in parallel. The trade-off is that GPU cores have very little memory and often a substantially restricted programming model.

Threads are handled by the OS's scheduler. The number of cores in a CPU determine how many threads it can run at the same time.
Note that threads are constantly switched in and out by the scheduler to give the "illusion" that everything is running at the same time.
More here, if you're interested.

no, no, no... Your I7 has eight execution threads and can run 8 threads at once.
1000 threads or more can be waiting for processor time.
calling thread.sleep moves a thread off the execution core and back into memory where it waits until woken.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string