Can multiple OS processes run in parallel on multicore CPU? - multithreading

So I got into a debate whether multicore CPU allows parallel execution of separate processes.
As far as I understand, each core allows executing different threads but they all have to belong to one process. Or am I wrong?
My reasoning is that, while each core has separate set of registers and L1/L2 cache (depending on hardware), they all have to share other stuff like L3 cache or TLB (I don't have a lot of knowledge about cpu architecture, so feel free to correct me).
I tried searching for an answer, but couldn't find any results (maybe the question is too dumb lol).
Thanks a lot in adance.

Multiple threads of multiple processes can be scheduled to run on a single core. Of course, at a given time only one thread runs on the core. The queue of processes to run on the core is managed by the scheduler. A good scheduler will provide to the core a good mix of CPU-bound and I/O-bound processes so that all of the components in the machine have well-balanced load.
So a multi-core CPU allows not only parallel but also concurrent execution of processes. On the other hand, a single core CPU can allow only parallel execution. No concurrency is there in single core machines.
All the resources of a core are given to the thread/process currently running on it (not in Hyper Threading though). The first resource that is in possession of multiple processes at the same time, if I'm not wrong, is Main Memory or RAM. All processes use some part of the RAM even when they are not running on the core. To load the process to the core a Process Control Block (PCB) is loaded from RAM by setting the registers, address spaces and stack to the same state which the process was in, when it was unloaded from the core to give time to another process.
The time quantum for each process varies from a few ms to a few hundred ms. Compared to that a L1/L2 cache access is a few ns and a main memory access is a few hundred ns. The image below should be interesting:

Two processes or threads can be run truly concurrently on separate cores provided they don't contend on a shared resource at the electronic level.
The most obvious thing to contend on in an Intel chip is the L3 cache and RAM. If you have two or more Intel chips they're talking to each other over QPI. Whilst this allows a cluster of CPUs each with their own memory controllers to operate in an SMP configuration, it becomes another thing to contend on if threads want data from another chip's memory.
In AMD chips each core has a memory controller, and Hypertransport does the job of synthesizing an SMP configuration. Pleasingly this makes all cores everywhere the pretty much the same, even in multi-chip systems (it's Hypertransport inside and outside the chips).
Both Intel and AMD have done excellent jobs of creating architectures that minimise the memory contention that occurs in a multi-core, symmetric multi-processing system without us having to think too hard about how we write software. If you want the absolute most out of your hardware you can program taking into account the underlying NUMA hardware architecture, and you may (it's really hard) reduce some of the contention that's going on.
Other things that might prevent true concurrent execution is if there's a specialised subsystem serving several cores. For example the UltraSPARC T1 shared a floating point unit between 8 cores. Obviously they can't all use it at once!
FPGAs are often seen as great things for parallelisable computations such as FFTs. However they have limited internal memory, and if the computation starts needing to store more data you have to use external RAM. That immediately limits the degree of parallelism that can be achieved, as different parts of the FPGA start contending for access to the external RAM. In such cases it is doubtful whether an FPGA is the right way to go; an FPGA clocked at 500MHz accessing RAM (which is also very slow, still) with no advanced onboard caching is not going to be as fast as a well design CPU with an advanced cache and multi memory controller subsystems.


Is synchronization faster on the same physical CPU core?

I have a question. If a thread modifies a variable, will the thread on the same physical core (a different hyperthread core) see the modification earlier than other cores? Or it has to wait until all the other cores see it?
I've been trying to pin two threads on the same physical core, but get performance degradation. I know it's because two cores share lots of resources. But in terms of synchronization. Will it help to put threads on the same physical core?
The answer is dependant of the platform (especially the underlying architecture). That being said, on the (mainstream) x86-64 architecture, threads sharing the same core communicate faster than threads on different cores or even different sockets. One main reason is that the two threads will often share the same L1 cache (and if not, the L2 cache). Thus, on thread can directly read what the other just wrote. Moreover, the threads can often run in parallel thanks to simultaneous multithreading (called Hyper-Threading on Intel CPUs) reducing the communication latency (no scheduling quantum to wait).
Meanwhile, threads on different cores will have to communicate through a (slow) bus or share data using the L3 cache (significantly slower than the L1/L2).
Then your workload is bound by communication (latency or throughput), it is often better to put threads close to each other (ie. on the same core). When the number of threads per core exceed the number of hardware thread, then performance decrease due to preemptive multitasking. When the workload is compute bound, it is better to put them on separate cores. Note that on modern x86 processors, threads working on the same core can even share the computing resources (ALUs) at the instruction level.

Threads vs processess: are the visualizations correct?

I have no background in Computer Science, but I have read some articles about multiprocessing and multi-threading, and would like to know if this is correct.
Lets say I have 2 cores, 3 threads 'running' (competing?) per core, as shown in the picture (HYPER-THREADING DISABLED). Then I take a snapshot at some moment, and I observe, for example, that:
Core 1 is running Thread 3.
Core 2 is running Thread 5.
Are these declarations (and the picture) correct?
A) There are 6 threads running in concurrency.
B) There are 2 threads (3 and 5) (and processes) running in parallel.
Lets say I have MULTI-THREADING ENABLED this time.
Are these declarations (and the picture) correct?
C) There are 12 threads running in concurrency.
D) There are 4 threads (3,5,7,12) (and processes) running in 'almost' parallel, in the vcpu?.
E) There are 2 threads (5,7) running 'strictlÿ́' in parallel?
A process is an instance of a program running on a computer. The OS uses processes to maximize utilization, support multi-tasking, protection, etc.
Processes are scheduled by the OS - time sharing the CPU. All processes have resources like memory pages, open files, and information that defines the state of a process - program counter, registers, stacks.
In CS, concurrency is the ability of different parts or units of a program, algorithm or problem to be executed out-of-order or in a partial order, without affecting the final outcome.
A "traditional process" is when a process is an OS abstraction to present what is needed to run a single program. There is NO concurrency within a "traditional process" with a single thread of execution.
However, a "modern process" is one with multiple threads of execution. A thread is simply a sequential execution stream within a process. There is no protection between threads since they share the process resources.
Multithreading is when a single program is made up of a number of different concurrent activities (threads of execution).
There are a few concepts that need to be distinguished:
Multiprocessing is whenwe have Multiple CPUs.
Multiprogramming when the CPU executes multiple jobs or processes
Multithreading is when the CPU executes multiple mhreads per Process
So what does it mean to run two threads concurrently?
The scheduler is free to run threads in any order and interleaving a FIFO or Random. It can choose to run each thread to completion or time-slice in big chunks or small chunks.
A concurrent system supports more than one task by allowing all tasks to make progress. A parallel system can perform more than one task simultaneously. It is possible though, to have concurrency without parallelism.
Uniprocessor systems provide the illusion of parallelism by rapidly switching between processes (well, actually, the CPU schedulers provide the illusion). Such processes were running concurrently, but not in parallel.
Hyperthreading is Intel’s name for simultaneous multithreading. It basically means that one CPU core can work on two problems at the same time. It doesn’t mean that the CPU can do twice as much work. Just that it can ensure all its capacity is used by dealing with multiple simpler problems at once.
To your OS, each real silicon CPU core looks like two, so it feeds each one work as if they were separate. Because so much of what a CPU does is not enough to work it to the maximum, hyperthreading makes sure you’re getting your money’s worth from that chip.
There are a couple of things that are wrong (or unrealistic) about your diagrams:
A typical desktop or laptop has one processor chipset on its motherboard. With Intel and similar, the chipset consists of a CPU chip together with a "northbridge" chip and a "southbridge" chip.
On a server class machine, the motherboard may actually have multiple CPU chips.
A typical modern CPU chip will have more than one core; e.g. 2 or 4 on low-end chips, and up to 28 (for Intel) or 64 (for AMD) on high-end chips.
Hyperthreading and VCPUs are different things.
Hyperthreading is Intel proprietary technology1 which allows one physical to at as two logical cores running two independent instructions streams in parallel. Essentially, the physical core has two sets of registers; i.e. 2 program counters, 2 stack pointers and so on. The instructions for both instruction streams share instruction execution pipelines, on-chip memory caches and so on. The net result is that for some instruction mixes (non-memory intensive) you get significantly better performance than if the instruction pipelines are dedicated to a single instruction stream. The operating system sees each hyperthread as if it was a dedicated core, albeit a bit slower.
VCPU or virtual CPU terminology used in cloud computing context. On a typical cloud computing server, the customer gets a virtual server that behaves like a regular single or multi-core computer. In reality, there will typically be many of these virtual servers on a compute node. Some special software called a hypervisor mediates access to the hardware devices (network interfaces, disks, etc) and allocates CPU resources according to demand. A VCPU is a virtual server's view of a core, and is mapped to a physical core by the hypervisor. (The accounting trick is that VCPUs are typically over committed; i.e. the sum of VCPUs is greater than the number of physical cores. This is fine ... unless the virtual servers all get busy at the same time.)
In your diagram, you are using the term VCPU where the correct term would be hyperthread.
Your diagram shows each core (or hyperthread) associated with a distinct group of threads. In reality, the mapping from cores to threads is more fluid. If a core is idle, the operating system is free to schedule any (runnable) thread to run on it. (Some operating systems allow you to tie a given thread to a specific core for performance reasons. It is rarely necessary to do this.)
Your observations about the first diagram are correct.
Your observations about the second diagram are slightly incorrect. As stated above the hyperthreads on a core share the execution pipelines. This means that they are effectively executing at the same time. There is no "almost parallel". As I said, above, it is simplest to think of a hyperthread as a core "that runs a bit slower".
1 - Intel was not the first computer to com up with this idea. For example, CDC mainframes used this idea in the 1960's to get 10 PPUs from a single core and 10 sets of registers. This was before the days of pipelined architectures.

In what sense does each thread appears to the operating system as a separate CPU?

In the book Modern Operating Systems,
Multithreading has implications for the operating system because each thread
appears to the operating system as a separate CPU. E.g., Consider a system with two
actual CPUs, each with two threads. The operating system will see this as four
I don't understand that. A thread is a light weighted process which in turn is a running program. A cpu is a hardware.
A thread runs on a cpu.
An OS manages the hardware directly including cpu, while processes (including threads) see the hardware indirectly via the abstraction provided by OS. How can an OS not know how many cpus are there?
In what sense does each thread appears to the operating system as a separate CPU?
The fact that a cpu have multiple physical cores, is one thing (cpu cores), but the fact that the cpu can have virtual cores, are usually called "threads" in the hardware context(but they are not the same as the programming term "threds"). The simple way to think about "threads" in a hardware context is the amount of cpu cores (notice however this is partially correct, and incorrect as well, but in order to understand the difference I would recommend looking at wikipedia, for example:
The statement is not "silly" and is a decent first approximation to understanding tasking and threading.
A task/thread has a set of registers, a memory address space (populated with RAM and sometimes ROM), and the ability to execute instructions. This is the same as a basic CPU. So one can conceive of a multi-tasking or multi-threading system as being a collection of CPUs.
(And, to carry this analogy further, there are situations where multiple threads/tasks are simulated on a single thread/task.)
Yes, modern CPUs have a lot of extra registers and controls for, eg, controlling tasks and threads, but that's going beyond the basics.
And, from the standpoint of multi-threading within an OS, it is true that there are essentially the same concerns about synchronization and atomicity whether you have multiple streams of execution running in multiple threads on a single CPU, or the multiple streams each running on its own CPU. The only major difference is with regard to cache coherence, and even for that case there exist systems where each thread on a single CPU has its own cache.
The various threads of a SMT CPU appear as separate cores to the OS. For example on x86 with hyperthreading, interprocessor interrupts apply to virtual cores. For example a SIPI (start-up ipi) addressed to "all" will fire up all threads, not just all actual physical cores.
The OS can know how many APs (application processors, contracted with the BSP, bootstrap processor, which is the one that starts when you turn the machine on) there are by observing how many times the code that you instructed them to start at (with the SIPI) runs (but this is better used a check), or by parsing the MP tables (definitely do that, you have to anyway in order to detect memory layout and devices and so on).

Pros and Cons of CPU affinity

Suppose I have a multi-threaded application (say ~40 threads) running on a multiprocessor system (say 8 cores) with Linux as the operating system where different threads are more essentially LWP (Light Weight Processes) being scheduled by the kernel.
What would be benefits/drawbacks of using the CPU affinity? Whether CPU affinity is going to help by localizing the threads to a subset of cores thus minimizing cache sharing/misses?
If you use strict affinity, then a particular thread MUST run on that processor (or set of processors). If you have many threads that work completely independently, and they work on larger chunks of memory than a few kilobytes, then it's unlikely you'll benefit much from running on one particular core - since it's quite possible the other threads running on this particular CPU would have thrown out any L1 cache, and quite possibly L2 caches too. Which is more important for performance - cahce content or "getting to run sooner"? Are some CPU's always idle, or is the CPU load 100% on every core?
However, only you know (until you tell us) what your threads are doing. How big is the "working set" (how much memory - code and data) are they touching each time they get to run? How long does each thread run when they are running? What is the interaction with other threads? Are other threads using shared data with "this" thread? How much and what is the pattern of sharing?
Finally, the ultimate answer is "What makes it run faster?" - an answer you can only find by having good (realistic) benchmarks and trying the different possible options. Even if you give us every single line of code, running time measurements for each thread, etc, etc, we could only make more or less sophisticated guesses - until these have been tried and tested (with VARYING usage patterns), it's almost impossible to know.
In general, I'd suggest that having many threads either suggest that each thread isn't very busy (CPU-wise), or you are "doing it wrong"... More threads aren't better if they are all running flat out - better to have fewer threads in that case, because they are just going to fight each other.
The scheduler already tries to keep threads on the same cores, and to avoid migrations. This suggests that there's probably not a lot of mileage in managing thread affinity manually, unless:
you can demonstrate that for some reason the kernel is doing a bad a job for your particular application; or
there's some specific knowledge about your application that you can exploit to good effect.
localizing the threads to a subset of cores thus minimizing cache
Not necessarily, you have to consider cache coherence too, if two or more threads access a shared memory buffer and each one is bound to a different CPU core their caches have to be synchronized if one thread writes to a shared cache line there will be a significant overhead to invalidate other caches.

Dual-Core Hyperthreading: Should I use 4 threads or 3 or 2?

If you're spawning multiple threads (or processes) concurrently, is it better to spawn as many as the number of physical processors or the number of logical processors, assuming the task is CPU-bound? Or is it better to do something in between (say, 3 threads)?
Does the performance depend on the kind of instructions that are getting executed (say, would non-local memory access be much different from cache hits)? If so, in which cases is it better to take advantage of hyperthreading?
The reason I'm asking is, I remember reading somewhere that if you have as many tasks as the number of virtual processors, tasks on the same physical core can sometimes starve some CPU resources and prevent each other from getting as many resources as needed, possibly decreasing performance. That's why I'm wondering if having as many threads as virtual cores is a good idea.
The performance depends on a huge variety of factors. Most tasks are not strictly CPU bound, since even if all of the data is in memory it is usually not on-board in the processor cache. I have seen examples (like this one) where memory access patterns can dramatically change the performance profile of a given 'parallel' process.
In short, there is no perfect number for all situations.
Chances are pretty good that you will see a performance improvement running 2 threads per core with HyperThreading enabled. Jobs that appear to be entirely CPU bound usually aren't, and HyperThreading can extract a few "extra" cycles out of the occasional interrupt or context switch.
On the other hand, with a core iX processor that has Turbo Boost, you might actually do better running 1 thread per core to encourage the CPU to overclock itself.
At work, we routinely run many-core servers at full CPU doing various kinds of calculation for days at a time. A while back we measured the performance difference with and without HT. We found that on average, with HyperThreading, and running twice as many jobs at once, we could complete the same amount of jobs about 10% faster than than without HyperThreading.
Assume that 2 × cores is a good place to start, but the bottom line is: measure!
I remember info that hyperthreading can give you up to 30% of performance boost. in general you'd better to treat them as 4 different cores. of course in some specific circumstances (e.g. having the same long running task bound to each core) you can divide your processing better taking into account that some cores are just logical ones
more info about hyperthreading itself here
Using Hyperthreading to run two threads on the same core, when both threads have similar memory access patterns but access disjoint data structures, would be very roughly equivalent to running them on two separate cores each with half the cache. If the memory-access patterns are such that half the cache would be sufficient to prevent thrashing, performance may be good. If the memory-access patterns are such that halving the cache induces thrashing, there may be a ten-fold performance hit (implying one would have been much better off without hyperthreading).
On the other hand, there are some situations where hyperthreading may be a huge win. If many threads will all be reading and writing the same shared data using lock-free data structures, and all threads must see a consistent view of the data, trying to run threads on disjoint processor may cause thrashing since only one processor at a time may have read-write access to any given cache line; running such a threads on two cores may take longer than running only one at a time. Such cache arbitration is not required, however, when a piece of data is accessed by multiple threads on a single core. In those cases, hyperthreading can be a huge win.
Unfortunately, I don't know any way to give the scheduler any "hints" to suggest that some threads should share a core when possible, while others should run separately when possible.
HT allows a boost of approximately 10-30% for mostly cpu-bound tasks that use the extra virtual cores. Although these tasks may seem CPU-bound, unless they are custom made assembly, they will usually suffer from IO waits between RAM and local cache. This allows one thread running on a physical HT-enabled core to work while the other thread is waiting for IO. This does come with a disadvantage though, as two threads share the same cache/bus, which will result in less resources each which may cause both threads to pause while waiting for IO.
In the last case, running a single thread will decrease the maximum simultaneous theoretical processing power(by 10-30%) in favor of running a single thread without the slowdown of cache thrashing which may be very significant in some applications.
Choosing which cores to use is just as important as choosing how many threads to run. If each thread is CPU-bound for roughly the same duration it is best to set the affinity such that threads using mostly different resources find themselves on different physical cores and threads using common resources be grouped to the same physical cores(different virtual core) so that common resources can be used from the same cache without extra IO wait.
Since each program has different CPU-usage characteristics and cache thrashing may or may not be a major slowdown(it usually is) it is impossible to determine what the ideal number of threads should be without profiling first. One last thing to note is that the OS/Kernel will also require some CPU and cache space. It is usually ideal to keep a single (physical)core set aside for the OS if real-time latency is required on CPU-bound threads so as to avoid sharing cache/cpu resources. If threads are often waiting for IO and cache thrashing is not an issue, or if running a real-time OS specifically designed for the application, you can skip this last step.
All of the other answers already give lots of excellent info. But, one more point to consider is that the SIMD unit is shared between logical cores on the same die. So, if you are running threads with SSE code, do you run them on all 4 logical cores, or just spawn 2 threads (assuming you have two chips)? For this odd case, best to profile with your app.
