Atomicity of taking multiple locks on x86

Atomicity of taking multiple locks on x86 - multithreading

Let's consider multicore CPU x86. Now, we have a some computation that is being executing by 4 threads. To make a computation it is necessary to take two locks (e.g. mutexes). So, it is clear to avoid deadlock we can take them atomically. It means: all mutexes are taken or not.
In C++11 it can be done with std::lock.
But, how it can be ensured that it is done atomically? I cannot imagine. I see how to do it on UP systems: just disable interrupts.

Related

What is the difference between an atomic operation and critical section?, which of the two prevents context switching?

A programming language or the processor already has "default" atomic operations and we can use them as far as I understand.
https://en.wikipedia.org/wiki/Linearizability

What is the difference between an atomic operation and critical section?
Atomic operations are instructions that guarantee atomic accesses/updates of shared (small) variables. This generally include operations like incrementation, decrementation, addition, subtraction, compare and swap (aka. CAS), exchange, logical operations (and, or, xor) as well as basic loads/stores. If you want to perform a non trivial operation that is not supported by the target platform (or one involving large variables), then you cannot use one atomic operation. This means either multiple of them is required or another mechanism should be used instead (eg. critical section, transactional memory). Note that using multiple atomic operations often makes things significantly more complex (see ABA problem). On mainstream CPUs, atomic operations are generally implemented by locking cache lines of shared caches (eg. L3) so that only one thread can access to it at a time.
Critical sections are meant to protect one or multiple instructions from being executed by multiple threads at the same time. They are generally protected using a system mutex. The thread entering the critical section lock the associated mutex and unlock it when leaving the section. System mutexes cause the thread entering a critical section to wait if the associated mutex is already locked. This is generally done using a context switch (the thread is descheduled and rescheduled later).
Critical section can be efficient when the lock is very rarely already taken by another thread. Context switches can significantly impact the performance. Atomic operation are not great either when many thread perform atomic operations on it. Contention effects can make atomic accesses significantly slower (eg. spin locks). This is especially true for atomic CAS operations. Some platform can execute atomic operation very quickly (eg. GPUs) since they have dedicated units to execute atomic operation efficiently.
which of the two prevents context switching?
None of the two prevent context switching. Modern operating systems can perform a context switching at any time. That being said, critical section generally cause context switches: a thread trying to enter into a critical section already locked by another thread will typically enter in sleeping mode and be awaken by the OS scheduler when the other thread will unlock the section. Atomic operations do not impact the scheduling of the system (at least not on mainstream platforms).
Note that the above text is also true for processes.

Speaking only to the nomenclature question:
"Atomic" means "cannot be broken down into smaller parts." In programming, an operation performed by one thread is "atomic" (as seen from other threads) if there is no possible way for the other threads to see the operation in a half-way done state. From the point of view of other threads, it's as if the entire operation happened in a single instant. It either has already happened, or it hasn't happened yet. There is no in between.
As Jérôme Richard points out, modern computer hardware provides atomic operations on simple variables. We can use those to make more complex operations seem "atomic" from the point of view of other threads either by using the hardware atomics in tricky non-blocking algorithms, or by using the hardware atomics in the implementation of mutex locks.
"Critical section" comes from a time before multi-threading. In operating system kernel code, and in "bare metal" application code, there has always been a limited form of concurrency between the main body of code and the interrupt handlers. "Critical section," back in the day, referred to a routine in the main body of code that was protected from interference by the interrupt handlers by executing it with interrupts disabled.
Systems programmers today still use "critical section" with the original meaning, but now we also sometimes say it to talk about a routine that is executed by a thread while the thread has a mutex locked.
IMO, "critical section" encourages a somewhat less useful way of thinking about mutex locks though because it's never the code that needs protection from interference. It's always about protecting the integrity of shared data. Sometimes a programmer who worries about defining The critical section can lose sight of the fact that there may be multiple routines in the program that all access the same shared data.
IMO, this is one place where an object-oriented style of programming shines, because it's easier to keep track of what needs to be protected if it is encapsulated in private members of some object and, can only be accessed through the object's thread-safe, public methods.

Multithreading on multiple core/processors

I get the idea that if locking and unlocking a mutex is an atomic operation, it can protect the critical section of code in case of a single processor architecture.
Any thread, which would be scheduled first, would be able to "lock" the mutex in a single machine code operation.
But how are mutexes any good when the threads are running on multiple cores? (Where different threads could be running at the same time on different "cores" at the same time).
I can't seem to grasp the idea of how a multithreaded program would work without any deadlock or race condition on multiple cores?

The general answer:
Mutexes are an operating system concept. An operating system offering mutexes has to ensure that these mutexes work correctly on all hardware that this operation system wants to support. If implementing a mutex is not possible for a specific hardware, the operating system cannot offer mutexes on that hardware. If the operating system requires the existence of mutexes to work correctly, it cannot support that hardware at all. How the operating system is implementing mutexes for a specific hardware is unsurprisingly very hardware dependent and varies a lot between the operating systems and their supported hardware.
The detailed answer:
Most general purpose CPUs offer atomic operations. These operations are designed to be atomic across all CPU cores within a system, whether these cores are part of a single or multiple individual CPUs.
With as little as two atomic operations, atomic_or and atomic_and, it is possible to implement a lock. E.g. think of
int atomic_or ( int * addr, int val )
It atomically calculates *addr = *addr | val and returns the old value of *addr prior to performing the calculation. If *lock == 0 and multiple threads call atomic_or(lock, 1), then only one of them will get 0 as result; only the first thread to perform that operation. All other threads get 1 as result. The one thread that got 0 is the winner, it has the lock, all other threads register for an event and go to sleep.
The winner thread now has exclusive access to the section following the atomic_or, it can perform the desired work and once it is done, it just clears the lock again (atomic_and(lock, 0)) and generates a system event, that the lock is now available again.
The system will then wake up one, some, or all of the threads that registered for this event before going to sleep and the race for the lock starts all over. Either one of the woken up threads will win the race or possibly none of them, as another thread was even faster and may have grabbed the lock in between the atomic_and and before the other threads were even woken up but that is okay and still correct, as it's still only one thread having access. All threads that failed to obtain the lock go back to sleep.
Of course, the actual implementations of modern systems are often much more complicated than that, they may take things like threads priorities into account (high prio threads may be preferred in the lock race) or might ensure that every thread waiting for a mutex will eventually also get it (precautions exist that prevent a thread from always losing the lock-race). Also mutexes can be recursive, in which case the system ensures that the same thread can obtain the same mutex multiple times without deadlocking and this requires some additional bookkeeping.
Probably needless to say but atomic operations are more expensive operations as they require the cores within a system to synchronize their work and this will slow their processing throughput. They may be somewhat expensive if all cores run on a single CPU but they may even be very expensive if there are multiple CPUs as the synchronization must take place over the CPU bus system that connects the CPUs with each other and this bus system usually does not operate at CPU speed level.
On the other hand, using mutexes will always slow down processing to begin with as providing exclusive access to resources has to slow down processing if multiple threads ever require access at the same time to continue their work. So for implementing mutexes this is irrelevant. Actually, if you can implement a function in a thread-safe way using just atomic operations instead of full featured mutexes, you will quite often have a noticeable speed benefit, despite these operations being more expensive than normal operations.

Threads are managed by the operating system, which among other things, is responsible for scheduling threads to cores, so it can also avoid scheduling a specific thread onto a core.
A mutex is an operating-system concept. You're basically asking the OS to block a thread until some other thread tells the OS it's ok

On modern operating systems, threads are an abstraction over the physical hardware. A programmer targets the thread as an abstraction for code execution. There is no separate abstraction for working on a hardware core available. The operating system is responsible for mapping threads to physical cores.
A mutex is a data structure that lives in system memory. Any thread that has access can read that memory position, regardless of what thread or core it is running in. It doesn't matter whether your code is executing on core 1 or 20, its still has the ability to read the current state of the lock.
In other words, regardless of the number of threads or cores, there is only shared system memory for them to act on.

Benefits of user-level threads

I was looking at the differences between user-level threads and kernel-level threads, which I basically understood.
What's not clear to me is the point of implementing user-level threads at all.
If the kernel is unaware of the existence of multiple threads within a single process, then which benefits could I experience?
I have read a couple of articles that stated user-level implementation of threads is advisable only if such threads do not perform blocking operations (which would cause the entire process to block).
This being said, what's the difference between a sequential execution of all the threads and a "parallel" execution of them, considering they cannot take advantage of multiple processors and independent scheduling?
An answer to a previously asked question (similar to mine) was something like:
No modern operating system actually maps n user-level threads to 1
kernel-level thread.
But for some reason, many people on the Internet state that user-level threads can never take advantage of multiple processors.
Could you help me understand this, please?

I strongly recommend Modern Operating Systems 4th Edition by Andrew S. Tanenbaum (starring in shows such as the debate about Linux; also participating: Linus Torvalds). Costs a whole lot of bucks but it's definitely worth it if you really want to know stuff. For eager students and desperate enthusiasts it's great.
Your questions answered
[...] what's not clear to me is the point of implementing User-level threads
at all.
Read my post. It is comprehensive, I daresay.
If the kernel is unaware of the existence of multiple threads within a
single process, then which benefits could I experience?
Read the section "Disadvantages" below.
I have read a couple of articles that stated that user-level
implementation of threads is advisable only if such threads do not
perform blocking operations (which would cause the entire process to
block).
Read the subsection "No coordination with system calls" in "Disadvantages."
All citations are from the book I recommended in the top of this answer, Chapter 2.2.4, "Implementing Threads in User Space."
Advantages
Enables threads on systems without threads
The first advantage is that user-level threads are a way to work with threads on a system without threads.
The first, and most obvious, advantage is that
a user-level threads package can be implemented on an operating system that does not support threads. All operating systems used to
fall into this category, and even now some still do.
No kernel interaction required
A further benefit is the light overhead when switching threads, as opposed to switching to the kernel mode, doing stuff, switching back, etc. The lighter thread switching is described like this in the book:
When a thread does something that may cause it to become blocked
locally, for example, waiting for another thread in its process to
complete some work, it calls a run-time system procedure. This
procedure checks to see if the thread must be put into blocked state.
If, so it stores the thread’s registers (i.e., its own) [...] and
reloads the machine registers with the new thread’s saved values. As soon as the stack
pointer and program counter have been switched, the new thread comes
to life again automatically. If the machine happens to have an
instruction to store all the registers and another one to load them
all, the entire thread switch can be done in just a handful of in-
structions. Doing thread switching like this is at least an order of
magnitude—maybe more—faster than trapping to the kernel and is a
strong argument in favor of user-level threads packages.
This efficiency is also nice because it spares us from incredibly heavy context switches and all that stuff.
Individually adjusted scheduling algorithms
Also, hence there is no central scheduling algorithm, every process can have its own scheduling algorithm and is way more flexible in its variety of choices. In addition, the "private" scheduling algorithm is way more flexible concerning the information it gets from the threads. The number of information can be adjusted manually and per-process, so it's very finely-grained. This is because, again, there is no central scheduling algorithm needing to fit the needs of every process; it has to be very general and all and must deliver adequate performance in every case. User-level threads allow an extremely specialized scheduling algorithm.
This is only restricted by the disadvantage "No automatic switching to the scheduler."
They [user-level threads] allow each process to have its own
customized scheduling algorithm. For some applications, for example,
those with a garbage-collector thread, not having to worry about a
thread being stopped at an inconvenient moment is a plus. They also
scale better, since kernel threads invariably require some table space
and stack space in the kernel, which can be a problem if there are a
very large number of threads.
Disadvantages
No coordination with system calls
The user-level scheduling algorithm has no idea if some thread has called a blocking read system call. OTOH, a kernel-level scheduling algorithm would've known because it can be notified by the system call; both belong to the kernel code base.
Suppose that a thread reads from the keyboard before any keys have
been hit. Letting the thread actually make the system call is
unacceptable, since this will stop all the threads. One of the main
goals of having threads in the first place was to allow each one to
use blocking calls, but to prevent one blocked thread from affecting
the others. With blocking system calls, it is hard to see how this
goal can be achieved readily.
He goes on that system calls could be made non-blocking but that would be very inconvenient and compatibility to existing OSes would be drastically hurt.
Mr Tanenbaum also says that the library wrappers around the system calls (as found in glibc, for example) could be modified to predict when a system cal blocks using select but he utters that this is inelegant.
Building upon that, he says that threads do block often. Often blocking requires many system calls. And many system calls are bad. And without blocking, threads become less useful:
For applications that are essentially entirely CPU bound and rarely
block, what is the point of having threads at all? No one would
seriously propose computing the first n prime numbers or playing chess
using threads because there is nothing to be gained by doing it that
way.
Page faults block per-process if unaware of threads
The OS has no notion of threads. Therefore, if a page fault occurs, the whole process will be blocked, effectively blocking all user-level threads.
Somewhat analogous to the problem of blocking system calls is the
problem of page faults. [...] If the program calls or jumps to an
instruction that is not in memory, a page fault occurs and the
operating system will go and get the missing instruction (and its
neighbors) from disk. [...] The process is blocked while the necessary
instruction is being located and read in. If a thread causes a page
fault, the kernel, unaware of even the existence of threads, naturally
blocks the entire process until the disk I/O is complete, even though
other threads might be runnable.
I think this can be generalized to all interrupts.
No automatic switching to the scheduler
Since there is no per-process clock interrupt, a thread acquires the CPU forever unless some OS-dependent mechanism (such as a context switch) occurs or it voluntarily releases the CPU.
This prevents usual scheduling algorithms from working, including the Round-Robin algorithm.
[...] if a thread starts running, no other thread in that process
will ever run unless the first thread voluntarily gives up the CPU.
Within a single process, there are no clock interrupts, making it
impossible to schedule processes round-robin fashion (taking turns).
Unless a thread enters the run-time system of its own free will, the scheduler will never get a chance.
He says that a possible solution would be
[...] to have the run-time system request a clock signal (interrupt) once a
second to give it control, but this, too, is crude and messy to
program.
I would even go on further and say that such a "request" would require some system call to happen, whose drawback is already explained in "No coordination with system calls." If no system call then the program would need free access to the timer, which is a security hole and unacceptable in modern OSes.

What's not clear to me is the point of implementing user-level threads at all.
User-level threads largely came into the mainstream due to Ada and its requirement for threads (tasks in Ada terminology). At the time, there were few multiprocessor systems and most multiprocessors were of the master/slave variety. Kernel threads simply did not exist. User threads had to be created to implement languages like Ada.
If the kernel is unaware of the existence of multiple threads within a single process, then which benefits could I experience?
If you have kernel threads, threads multiple threads within a single process can run simultaneously. In user threads, the threads always execute interleaved.
Using threads can simplify some types of programming.
I have read a couple of articles that stated user-level implementation of threads is advisable only if such threads do not perform blocking operations (which would cause the entire process to block).
That is true on Unix and maybe not all unix implementations. User threads on many operating systems function perfectly fine with blocking I/O.
This being said, what's the difference between a sequential execution of all the threads and a "parallel" execution of them, considering they cannot take advantage of multiple processors and independent scheduling?
In user threads. there is never parallel execution. In kernel threads, the can be parallel execution IF there are multiple processors. On a single processor system, there is not much advantage to using kernel threads over single threads (contra: note the blocking I/O issue on Unix and user threads).
But for some reason, many people on the Internet state that user-level threads can never take advantage of multiple processors.
In user threads, the process manages its own "threads" by interleaving execution within itself. The process can only have a thread run in the processor that the process is running in.
If the operating system provides system services to schedule code to run on a different processor, user threads could run on multiple processors.
I conclude by saying that for practicable purposes there are no advantages to user threads over kernel threads. There are those that will assert that there are performance advantages, but for there to be such an advantage it would be system dependent.

linux thread synchronization

I am new to linux and linux threads. I have spent some time googling to try to understand the differences between all the functions available for thread synchronization. I still have some questions.
I have found all of these different types of synchronizations, each with a number of functions for locking, unlocking, testing the lock, etc.
gcc atomic operations
futexes
mutexes
spinlocks
seqlocks
rculocks
conditions
semaphores
My current (but probably flawed) understanding is this:
semaphores are process wide, involve the filesystem (virtually I assume), and are probably the slowest.
Futexes might be the base locking mechanism used by mutexes, spinlocks, seqlocks, and rculocks. Futexes might be faster than the locking mechanisms that are based on them.
Spinlocks dont block and thus avoid context swtiches. However they avoid the context switch at the expense of consuming all the cycles on a CPU until the lock is released (spinning). They should only should be used on multi processor systems for obvious reasons. Never sleep in a spinlock.
The seq lock just tells you when you finished your work if a writer changed the data the work was based on. You have to go back and repeat the work in this case.
Atomic operations are the fastest synch call, and probably are used in all the above locking mechanisms. You do not want to use atomic operations on all the fields in your shared data. You want to use a lock (mutex, futex, spin, seq, rcu) or a single atomic opertation on a lock flag when you are accessing multiple data fields.
My questions go like this:
Am I right so far with my assumptions?
Does anyone know the cpu cycle cost of the various options? I am adding parallelism to the app so we can get better wall time response at the expense of running fewer app instances per box. Performances is the utmost consideration. I don't want to consume cpu with context switching, spinning, or lots of extra cpu cycles to read and write shared memory. I am absolutely concerned with number of cpu cycles consumed.
Which (if any) of the locks prevent interruption of a thread by the scheduler or interrupt...or am I just an idiot and all synchonization mechanisms do this. What kinds of interruption are prevented? Can I block all threads or threads just on the locking thread's CPU? This question stems from my fear of interrupting a thread holding a lock for a very commonly used function. I expect that the scheduler might schedule any number of other workers who will likely run into this function and then block because it was locked. A lot of context switching would be wasted until the thread with the lock gets rescheduled and finishes. I can re-write this function to minimize lock time, but still it is so commonly called I would like to use a lock that prevents interruption...across all processors.
I am writing user code...so I get software interrupts, not hardware ones...right? I should stay away from any functions (spin/seq locks) that have the word "irq" in them.
Which locks are for writing kernel or driver code and which are meant for user mode?
Does anyone think using an atomic operation to have multiple threads move through a linked list is nuts? I am thinking to atomicly change the current item pointer to the next item in the list. If the attempt works, then the thread can safely use the data the current item pointed to before it was moved. Other threads would now be moved along the list.
futexes? Any reason to use them instead of mutexes?
Is there a better way than using a condition to sleep a thread when there is no work?
When using gcc atomic ops, specifically the test_and_set, can I get a performance increase by doing a non atomic test first and then using test_and_set to confirm? I know this will be case specific, so here is the case. There is a large collection of work items, say thousands. Each work item has a flag that is initialized to 0. When a thread has exclusive access to the work item, the flag will be one. There will be lots of worker threads. Any time a thread is looking for work, they can non atomicly test for 1. If they read a 1, we know for certain that the work is unavailable. If they read a zero, they need to perform the atomic test_and_set to confirm. So if the atomic test_and_set is 500 cpu cycles because it is disabling pipelining, causes cpu's to communicate and L2 caches to flush/fill .... and a simple test is 1 cycle .... then as long as I had a better ratio of 500 to 1 when it came to stumbling upon already completed work items....this would be a win.
I hope to use mutexes or spinlocks to sparilngly protect sections of code that I want only one thread on the SYSTEM (not jsut the CPU) to access at a time. I hope to sparingly use gcc atomic ops to select work and minimize use of mutexes and spinlocks. For instance: a flag in a work item can be checked to see if a thread has worked it (0=no, 1=yes or in progress). A simple test_and_set tells the thread if it has work or needs to move on. I hope to use conditions to wake up threads when there is work.
Thanks!

Application code should probably use posix thread functions. I assume you have man pages so type
man pthread_mutex_init
man pthread_rwlock_init
man pthread_spin_init
Read up on them and the functions that operate on them to figure out what you need.
If you're doing kernel mode programming then it's a different story. You'll need to have a feel for what you are doing, how long it takes, and what context it gets called in to have any idea what you need to use.

Thanks to all who answered. We resorted to using gcc atomic operations to synchronize all of our threads. The atomic ops were about 2x slower than setting a value without synchronization, but magnitudes faster than locking a mutex, changeing the value, and then unlocking the mutex (this becomes super slow when you start having threads bang into the locks...) We only use pthread_create, attr, cancel, and kill. We use pthread_kill to signal threads to wake up that we put to sleep. This method is 40x faster than cond_wait. So basicly....use pthreads_mutexes if you have time to waste.

in addtion you should check the nexts books
Pthreads Programming: A POSIX
Standard for Better Multiprocessing
and
Programming with POSIX(R) Threads

regarding question # 8
Is there a better way than using a condition to sleep a thread when there is no work?
yes i think that the best aproach instead of using sleep
is using function like sem_post() and sem_wait of "semaphore.h"
regards

A note on futexes - they are more descriptively called fast userspace mutexes. With a futex, the kernel is involved only when arbitration is required, which is what provides the speed up and savings.
Implementing a futex can be extremely tricky (PDF), debugging them can lead to madness. Unless you really, really, really need the speed, its usually best to use the pthread mutex implementation.
Synchronization is never exactly easy, but trying to implement your own in userspace makes it inordinately difficult.

Critical sections with multicore processors

With a single-core processor, where all your threads are run from the one single CPU, the idea of implementing a critical section using an atomic test-and-set operation on some mutex (or semaphore or etc) in memory seems straightforward enough; because your processor is executing a test-and-set from one spot in your program, it necessarily can't be doing one from another spot in your program disguised as some other thread.
But what happens when you do actually have more than one physical processor? It seems that simple instruction level atomicity wouldn't be sufficient, b/c with two processors potentially executing their test-and-set operations at the same time, what you really need to maintain atomicity on is access to the shared memory location of the mutex. (And if the shared memory location is loaded into cache, there's the whole cache consistency thing to deal with, too..)
This seems like it would incur far more overhead than the single core case, so here's the meat of the question: How much worse is it? Is it worse? Do we just live with it? Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?

Multi-core/SMP systems are not just several CPUs glued together. There's explicit support for doing things in parallel. All the synchronization primitives are implemented with the help of hardware along the lines of atomic CAS. The instruction either locks the bus shared by CPUs and memory controller (and devices that do DMA) and updates the memory, or just updates the memory relying on cache snooping. This in turn causes cache coherency algorithm to kick in forcing all involved parties to flush their caches.Disclaimer - this is very basic description, there are more interesting things here like virtual vs. physical caches, cache write-back policies, memory models, fences, etc. etc.
If you want to know more about how OS might use these hardware facilities - here's an excellent book on the subject.

The vendor of multi-core cpus has to take care that the different cores coordinate themselves when executing instructions which guarantee atomic memory access.
On intel chips for instance you have the 'cmpxchg' instruction. It compares the value stored at a memory location to an expected value and exchanges it for the new value if the two match. If you precede it with the 'lock' instruction, it is guaranteed to be atomic with respect to all cores.

You would need a test-and-set that forces the processor to notify all the other cores of the operation so that they are aware. Yes, that introduces an overhead and you have to live with it. It's a reason to design multithreaded applications in such a way that they don't wait for synchronization primitives too often.

Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?
That would cancel the whole point of multithreading. When you are using a lock, semaphore, or other syncronization techniques, you are relying on OS to make sure that these operations are interlocked, no matter how many cores you are using.
The time to switch to a different thread after a lock has been released is mostly determined by the cost of a context switch. This SO thread deals with the context switching overhead, so you might want to check that.
There are some other interesting threads also:
What are the differences between various threading synchronization options in C#?
Threading best practices
You should read this MSDN article also: Understanding the Impact of Low-Lock Techniques in Multithreaded Apps.

Memory accesses are handled by the memory controller which should take care of multi-core issues, i.e. it shouldn't allow simultaneous access to same addresses (probably handled either by memory page or memory line basis). So you can use a flag to indicate whether another processor is updating the memory contents of some block (this to avoid a type of dirty read where part of the record is updated, but not all).
A more elegant solution is to use a HW semaphore block if the processor has such a feature. A HW semaphore is a simple queue which could be of size no_of_cores -1. This is how it is in TI's 6487/8 processor. You can either query the semaphore directly (and loop until it is released) or do an indirect query which will result in an interrupt once your core gets the resource. The requests are queued and served in the order they were made. A semaphore query is an atomic operation.
Cache consistency is another issue and you might need to do cache writebacks and refreshes in some cases. But this is a very cache implementation specific thing. With 6487/8 we needed to do that on a few operations.

Well, depending on what type of computers you have laying around the house, do the following: Write a simple multithreaded application. Run this application on a single core (Pentium 4 or Core Solo) and then run it on a multicore processor (Core 2 Duo or similar) and see how big the speed up is.
Granted these are unfair comparisons since Pentium 4 and Core Solo are much slower regardless of cores than a Core 2 Duo. Maybe compare between a Core 2 Duo and a Core 2 Quad with an application that can use 4 or more threads.
You raise a number of valid points. Muliple processors introduce a lot of headache and overhead. However, we just have to live with them, because the speed boost of parallelism can far outweigh them, if the critical sections are made long enough.
As for your final suggestion about having all threads on the same physical core, that completely defeats the point of a multi-core computer!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string