rcu_read_lock and x86-64 memory ordering - linux

On a preemptible SMP kernel, rcu_read_lock compiles the following:
current->rcu_read_lock_nesting++;
barrier();
With barrier being a compiler directive that compiles to nothing.
So, according to Intel's X86-64 memory ordering white paper:
Loads may be reordered with older stores to different locations
why is the implementation actually OK?
Consider the following situation:
rcu_read_lock();
read_non_atomic_stuff();
rcu_read_unlock();
What prevents read_non_atomic_stuff from "leaking" forward past rcu_read_lock, causing it to run concurrently with the reclamation code running in another thread?

For observers on other CPUs, nothing prevents this. You're right, StoreLoad reordering of the store part of ++ can make it globally visible after some of your loads.
Thus we can conclude that current->rcu_read_lock_nesting is only ever observed by code running on this core, or that has remotely triggered a memory barrier on this core by getting scheduled here, or with a dedicated mechanism for getting all cores to execute a barrier in a handler for an inter-processor interrupt (IPI). e.g. similar to the membarrier() user-space system call.
If this core starts running another task, that task is guaranteed to see this task's operations in program order. (Because it's on the same core, and a core always sees its own operations in order.) Also, context switches might involve a full memory barrier so tasks can be resumed on another core without breaking single-threaded logic. (That would make it safe for any core to look at rcu_read_lock_nesting when this task / thread is not running anywhere.)
Notice that the kernel starts one RCU task per core of your machine; e.g. ps output shows [rcuc/0], [rcuc/1], ..., [rcu/7] on my 4c8t quad core. Presumably they're an important part of this design that lets readers be wait-free with no barriers.
I haven't looked into full details of RCU, but one of the "toy" examples in
https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt is "classic RCU" that implements synchronize_rcu() as for_each_possible_cpu(cpu) run_on(cpu);, to get the reclaimer to execute on every core that might have done an RCU operation (i.e. every core). Once that's done, we know that a full memory barrier must have happened in there somewhere as part of the switching.
So yes, RCU doesn't follow the classic method where you'd need a full memory barrier (including StoreLoad) to make the core wait until the first store was visible before doing any reads. RCU avoids the overhead of a full memory barrier in the read path. This is one of the major attractions for it, besides the avoidance of contention.

Related

What happens to expected memory semantics (such as read after write) when a thread is scheduled on a different CPU core?

Code within a single thread has certain memory guarantees, such as read after write (i.e. writing some value to a memory location, then reading it back should give the value you wrote).
What happens to such memory guarantees if a thread is rescheduled to execute on a different CPU core? Say a thread writes 10 to memory location X, then gets rescheduled to a different core. That core's L1 cache might have a different value for X (from another thread that was executing on that core previously), so now a read of X wouldn't return 10 as the thread expects. Is there some L1 cache synchronization that occurs when a thread is scheduled on a different core?
All that is required in this case is that the writes performed while on the first processor become globally visible before the process begins executing on the second processor. In the Intel 64 architecture this is accomplished by including one or more instructions with memory fence semantics in the code that the OS uses to transfer the process from one core to another. An example from the Linux kernel:
/*
* Make previous memory operations globally visible before
* sending the IPI through x2apic wrmsr. We need a serializing instruction or
* mfence for this.
*/
static inline void x2apic_wrmsr_fence(void)
{
asm volatile("mfence" : : : "memory");
}
This ensures that the stores from the original core are globally visible before execution of the inter-processor interrupt that will start the thread running on the new core.
Reference: Sections 8.2 and 8.3 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-071, October 2019).
TL;DR: It depends on the architecture and the OS. On x86, this type of read-after-write hazard is mostly not issue that has to be considered at the software level, except for the weakly-order WC stores which require a store fence to be executed in software on the same logical core before the thread is migrated.
Usually the thread migration operation includes at least one memory store. Consider an architecture with the following property:
The memory model is such that memory stores may not become globally observable in program order. This Wikipedia article has a not-accurate-but-good-enough table that shows examples of architectures that have this property (see the row "Stores can be reordered after stores ").
The ordering hazard you mentioned may be possible on such an architecture because even if the thread migration operation completes, it doesn't necessarily mean that all the stores that the thread has performed are globally observable. On architectures with strict sequential store ordering, this hazard cannot occur.
On a completely hypothetical architecture where it's possible to migrate a thread without doing a single memory store (e.g., by directly transferring the thread's context to another core), the hazard can occur even if all stores are sequential on an architecture with the following property:
There is a "window of vulnerability" between the time when a store retires and when it becomes globally observable. This can happen, for example, due to the presence of store buffers and/or MSHRs. Most modern processors have this property.
So even with sequential store ordering, it may be possible that the thread running on the new core may not see the last N stores.
Note that on an machine with in-order retirement, the window of vulnerability is a necessary but insufficient condition for a memory model that supports stores that may not be sequential.
Usually a thread is rescheduled to run on a different core using one of the following two methods:
A hardware interrupt, such as a timer interrupt, occurs that ultimately causes the thread to be rescheduled on a different logical core.
The thread itself performs a system call, such as sched_setaffinity, that ultimately causes it to run on a different core.
The question is at which point does the system guarantee that retired stores become globally observable? On Intel and AMD x86 processors, hardware interrupts are fully serializing events, so all user-mode stores (including cacheable and uncacheable) are guaranteed to be globally observable before the interrupt handler is executed, in which the thread may be rescheduled to run a different logical core.
On Intel and AMD x86 processors, there are multiple ways to perform system calls (i.e., change the privilege level) including INT, SYSCALL, SYSENTER, and far CALL. None of them guarantee that all previous stores become globally observable. Therefore, the OS is supposed to do this explicitly when scheduling a thread on a different core by executing a store fence operation. This is done as part of saving the thread context (architectural user-mode registers) to memory and adding the thread to the queue associated with the other core. These operations involve at least one store that is subject to the sequential ordering guarantee. When the scheduler runs on the target core, it would see the full register and memory architectural state (at the point of the last retired instruction) of the thread would be available on that core.
On x86, if the thread uses stores of type WC, which do not guarantee the sequential ordering, the OS may not guarantee in this case that it will make these stores globally observable. The x86 spec explicitly states that in order to make WC stores globally observable, a store fence has to be used (either in the thread on the same core or, much simpler, in the OS). An OS generally should do this, as mentioned in #JohnDMcCalpin's answer. Otherwise, if the OS doesn't provide the program order guarantee to software threads, then the user-mode programmer may need to take this into account. One way would be the following:
Save a copy of the current CPU mask and pin the thread to the current core (or any single core).
Execute the weakly-ordered stores.
Execute a store fence.
Restore the CPU mask.
This temporarily disables migration to ensure that the store fence is executed on the same core as the weakly-ordered stores. After executing the store fence, the thread can safely migrate without possibly violating program order.
Note that user-mode sleep instructions, such as UMWAIT, cannot cause the thread to be rescheduled on a different core because the OS does not take control in this case.
Thread Migration in the Linux Kernel
The code snippet from #JohnDMcCalpin's answer falls on the path to send an inter-processor interrupt, which is achieved using a WRMSR instruction to an APIC register. An IPI may be sent for many reasons. For example, to perform a TLB shootdown operation. In this case, it's important to ensure that the updated paging structures are globally observable before invaliding the TLB entries on the other cores. That's why x2apic_wrmsr_fence may be needed, which is invoked just before sending an IPI.
That said, I don't think thread migration requires sending an IPI. Essentially, a thread is migrated by removing it from some data structure that is associated with one core and add it to the one associated with the target core. A thread may be migrated for numerous reasons, such as when the affinity changes or when the scheduler decides to rebalance the load. As mentioned in the Linux source code, all paths of thread migration in the source code end up executing the following:
stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg)
where arg holds the task to be migrated and the destination core identifier. migration_cpu_stop is a function that does the actual migration. However, the task to be migrated may be currently running or waiting in some runqueue to run on the source core (i.e, the core on which the task is currently scheduled). It's required to stop the task before the migrating it. This is achieved by adding the call to the function migration_cpu_stop to the queue of the stopper task associated with the source core. stop_one_cpu then sets the stopper task as ready for execution. The stopper task has the highest priority. So on the next timer interrupt on the source core (Which could be the same as the current core), one of the tasks with the highest priority will be selected to run. Eventually, the stopper task will run and it will execute migration_cpu_stop, which in turn performs the migration. Since this process involves a hardware interrupt, all stores of the target task are guaranteed to be globally observable.
There appears to be a bug in x2apic_wrmsr_fence
The purpose of x2apic_wrmsr_fence is to make all previous stores globally observable before sending the IPI. As discussed in this thread, SFENCE is not sufficient here. To see why, consider the following sequence:
store
sfence
wrmsr
The store fence here can order the preceding store operation, but not the MSR write. The WRMSR instruction doesn't have any serializing properties when writing to an APIC register in x2APIC mode. This is mentioned in the Intel SDM volume 3 Section 10.12.3:
To allow for efficient access to the APIC registers in x2APIC mode,
the serializing semantics of WRMSR are relaxed when writing to the
APIC registers.
The problem here is that MFENCE is also not guaranteed to order the later WRMSR with respect to previous stores. On Intel processors, it's documented to only order memory operations. Only on AMD processors it's guaranteed to be fully serializing. So to make it work on Intel processors, there needs to be an LFENCE after the MFENCE (SFENCE is not ordered with LFENCE, so MFENCE must be used even though we don't need to order loads). Actually Section 10.12.3 mentions this.
If a platform is going to support moving a thread from one core to another, whatever code does that moving must respect whatever guarantees a thread is allowed to rely on. If a thread is allowed to rely on the guarantee that a read after a write will see the updated value, then whatever code migrates a thread from one core to another must ensure that guarantee is preserved.
Everything else is platform specific. If a platform has an L1 cache then hardware must make that cache fully coherent or some form of invalidation or flushing will be necessary. On most typical modern processors, hardware makes the cache only partially coherent because reads can also be prefetched and writes can be posted. On x86 CPUs, special hardware magic solves the prefetch problem (the prefetch is invalidated if the L1 cache line is invalidated). I believe the OS and/or scheduler has to specifically flush posted writes, but I'm not entirely sure and it may vary based on the exact CPU.
The CPU goes to great cost to ensure that a write will always see a previous read in the same instruction stream. For an OS to remove this guarantee and require all user-space code to work without it would be a complete non-starter since user-space code has no way to know where in its code it might get migrated.
Adding my two bits here. On first glance, a barrier seems like an overkill (answers above)
Consider this logic: when a thread wants to write to a cacheline, HW cache coherence kicks in and we need to invalidate all other copies of the cacheline that are present with other cores in the system; the write doesn't proceed without the invalidations. When a thread is re-scheduled to a different core then, it will have to fetch the cacheline from the L1-cache that has write permission thereby maintaining read-after-write sequential behavior.
The problem with this logic is that invalidations from cores aren't applied immediately, hence it is possible to read a stale value after being rescheduled (the read to the new L1-cache somehow beats the pending invalidation present in a queue with that core). This is ok for different threads because they are allowed to slip and slide, but with the same thread a barrier becomes essential.

Why threads implemented in kernel space are slow?

When a thread does something that may cause it to become blocked locally, for example, waiting for another thread in its process to complete some work, it calls a run-time system procedure. This procedure checks to see if the thread must be put into blocked state. If so, it stores the thread's registers in the thread table, looks in the table for a ready thread to run, and reloads the machine registers with the new thread's saved values. As soon as the stack pointer and program counter have been switched, the new thread comes to life again automatically. If the machine happens to have an instruction to store all the registers and another one to load them all, the entire thread switch can be done in just a handful of instructions. Doing thread switching like this is at least an order of magnitude-maybe more-faster than trapping to the kernel and is a strong argument in favor of user-level threads packages.
Source: Modern Operating Systems (Andrew S. Tanenbaum | Herbert Bos)
The above argument is made in favor of user-level threads. The user-level thread implementation is depicted as kernel managing all the processes, where individual processes can have their own run-time (made available by a library package) that manages all the threads in that process.
Of course, merely calling a function in the run-time than trapping to kernel might have a few less instructions to execute but why the difference is so huge?
For example, if threads are implemented in kernel space, every time a thread has to be created the program is required to make a system call. Yes. But the call only involves adding an entry to the thread table with certain attributes (which is also the case in user space threads). When a thread switch has to happen, kernel can simply do what the run-time (at user-space) would do. The only real difference I can see here is that the kernel is being involved in all this. How can the performance difference be so significant?
Threads implemented as a library package in user space perform significantly better. Why?
They're not.
The fact is that most task switches are caused by threads blocking (having to wait for IO from disk or network, or from user, or for time to pass, or for some kind of semaphore/mutex shared with a different process, or some kind of pipe/message/packet from a different process) or caused by threads unblocking (because whatever they were waiting for happened); and most reasons to block and unblock involve the kernel in some way (e.g. device drivers, networking stack, ...); so doing task switches in kernel when you're already in the kernel is faster (because it avoids the overhead of switching to user-space and back for no sane reason).
Where user-space task switching "works" is when kernel isn't involved at all. This mostly only happens when someone failed to do threads properly (e.g. they've got thousands of threads and coarse-grained locking and are constantly switching between threads due to lock contention, instead of something sensible like a "worker thread pool"). It also only works when all threads are the same priority - you don't want a situation where very important threads belonging to one process don't get CPU time because very unimportant threads belonging to a different process are hogging the CPU (but that's exactly what happens with user-space threading because one process has no idea about threads belonging to a different process).
Mostly; user-space threading is a silly broken mess. It's not faster or "significantly better"; it's worse.
When a thread does something that may cause it to become blocked locally, for example, waiting for another thread in its process to complete some work, it calls a run-time system procedure. This procedure checks to see if the thread must be put into blocked state. If so, it stores the thread's registers in the thread table, looks in the table for a ready thread to run, and reloads the machine registers with the new thread's saved values. As soon as the stack pointer and program counter have been switched, the new thread comes to life again automatically. If the machine happens to have an instruction to store all the registers and another one to load them all, the entire thread switch can be done in just a handful of instructions. Doing thread switching like this is at least an order of magnitude-maybe more-faster than trapping to the kernel and is a strong argument in favor of user-level threads packages.
This is talking about a situation where the CPU itself does the actual task switch (and either the kernel or a user-space library tells the CPU when to do a task switch to what). This has some relatively interesting history behind it...
In the 1980s Intel designed a CPU ("iAPX" - see https://en.wikipedia.org/wiki/Intel_iAPX_432 ) for "secure object oriented programming"; where each object has its own isolated memory segments and its own privilege level, and can transfer control directly to other objects. The general idea being that you'd have a single-tasking system consisting of global objects using cooperating flow control. This failed for multiple reasons, partly because all the protection checks ruined performance, and partly because the majority of software at the time was designed for "multi-process preemptive time sharing, with procedural programming".
When Intel designed protected mode (80286, 80386) they still had hopes for "single-tasking system consisting of global objects using cooperating flow control". They included hardware task/object switching, local descriptor table (so each task/object can have its own isolated segments), call gates (so tasks/objects can transfer control to each other directly), and modified a few control flow instructions (call far and jmp far) to support the new control flow. Of course this failed for the same reason iAPX failed; and (as far as I know) nobody has ever used these things for the "global objects using cooperative flow control" they were originally designed for. Some people (e.g. very early Linux) did try to use the hardware task switching for more traditional "multi-process preemptive time sharing, with procedural programming" systems; but found that it was slow because the hardware task switch did too many protection checks that could be avoided by software task switching and saved/reloaded too much state that could be avoided by a software task switching;p and didn't do any of the other stuff needed for a task switch (e.g. keeping statistics of CPU time used, saving/restoring debug registers, etc).
Now.. Andrew S. Tanenbaum is a micro-kernel advocate. His ideal system consists of isolated pieces in user-space (processes, services, drivers, ...) communicating via. synchronous messaging. In practice (ignoring superficial differences in terminology) this "isolated pieces in user-space communicating via. synchronous messaging" is almost entirely identical to Intel's twice failed "global objects using cooperative flow control".
Mostly; in theory (if you ignore all the practical problems, like CPU not saving all of the state, and wanting to do extra work on task switches like tracking statistics), for a specific type of OS that Andrew S. Tanenbaum prefers (micro-kernel with synchronous message passing, without any thread priorities), it's plausible that the paragraph quoted above is more than just wishful thinking.
I think the answer to this can use a lot of OS and parallel distributive computing knowledge (And I am not sure about the answer but I will try my best)
So if you think about it. The library package will have a greater amount of performance than you write in the kernel itself. In the package thing, interrupt given by this code will be held at once and al the execution will be done. While when you write in kernel different other interrupts can come before. Plus accessing threads again and again is harsh on the kernel since everytime there will be an interrupt. I hope it will be a better view.
it's not correct to say the user-space threads are better that the kernel-space threads since each one has its own pros and cons.
in terms of user-space threads, as the application is responsible for managing thread, its easier to implement such threads and that kind of threads have not much reliance on OS. however, you are not able to use the advantages of multi processing.
In contrary, the kernel space modules are handled by OS, so you need to implement them according to the OS that you use, and it would be a more complicated task. However, you have more control over your threads.
for more comprehensive tutorial, take a look here.

Why doesn't the instruction reorder issue occur on a single CPU core?

From this post:
Two threads being timesliced on a single CPU core won't run into a reordering problem. A single core always knows about its own reordering and will properly resolve all its own memory accesses. Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
Why can't the instruction reorder issue occur on a single CPU core? This article doesn't explain it.
EXAMPLE:
The following pictures are picked from Memory Reordering Caught in the Act:
Below is recorded:
I think the recorded instructions can also cause issue on a single CPU, because both r1 and r2 aren't 1.
A single core always knows about its own reordering and will properly resolve all its own memory accesses.
A single CPU core does reorder, but it knows it's own reordering, and can do clever tricks to pretend it's not. Thus, things go faster, without weird side effects.
Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
When a CPU reorders, the other CPUs can't compensate for this. Imagine if CPU #1 is waiting for a write to variableA, then it reads from variableB. If CPU#2 wrotes to variableB, then variableA like the code says, no problems occur. If CPU#2 reorders to write to variableA first, then CPU#1 doesn't know and tries to read from variableB before it has a value. This can cause crashes or any "random" behavior. (Intel chips have more magic that makes this not happen)
Two threads being timesliced on a single CPU core won't run into a reordering problem.
If both threads are on the same CPU, then it doesn't matter which order the writes happen in, because if they're reordered, then they're both in progress, and the CPU won't really switch until both are written, in which case they're safe to read from the other thread.
Example
For the code to have a problem on a single core, it would have to rearrange the two instructions from process 1 and be interrupted by process 2 and execute that between the two instructions. But if interrupted between them, it knows it has to abort both of them since it knows about it's own reordering, and knows it's in a dangerous state. So it will either do them in order, or do both before switching to process 2, or do neither before switching to process 2. All of which avoid the reordering problem.
There are multiple effects at work, but they are modeled as just one effect. Makes it easier to reason about them. Yes, a modern core already re-orders instructions by itself. But it maintains logical flow between them, if two instructions have an inter-dependency between them then they stay ordered so the logic of the program does not change. Discovering these inter-dependencies and preventing an instruction from being issued too early is the job of the reorder buffer in the execution engine.
This logic is solid and can be relied upon, it would be next to impossible to write a program if that wasn't the case. But that same guarantee cannot be provided by the memory controller. It has the un-enviable job of giving multiple processors access to the same shared memory.
First is the prefetcher, it reads data from memory ahead of time to ensure the data is available by the time a read instruction executes. Ensures the core won't stall waiting for the read to complete. With the problem that, because memory was read early, it might be a stale value that was changed by another core between the time the prefetch was done and the read instruction executes. To an outside observer it looks like the instruction executed early.
And the store buffer, it takes the data of a write instruction and writes it lazily to memory. Later, after the instruction executed. Ensures the core won't stall waiting on the memory bus write cycle to complete. To an outside observer, it just looks like the instruction executed late.
Modeling the effects of the prefetcher and store buffer as instruction reordering effects is very convenient. You can write that down on a piece of paper easily and reason about the side-effects.
To the core itself, the effects of the prefetcher and store buffer are entirely benign and it is oblivious to them. As long as there isn't another core that's also changing memory content. A machine with a single core always has that guarantee.

Memory barrier in a single thread

I have this question related to memory barriers.
In a multi-threaded applications a memory barrier must be used if data is shared between them , because a write in a thread that is runing on one core , may not be seen by another thread on an another core.
From what I read from other explanations of memory barriers, it was said that if you have a single thread working with some data you don't need a memory barrier.
And here is my question: it could be the case that a thread modifies some data on a specific core, and then after some time the scheduler decides to migrate that thread to another core.
Is it possible that this thread will not see its modifications done on the other core?
In principle: Yes, if program execution moves from one core to the next, it might not see all writes that occurred on the previous core.
Keep in mind though that processes don't switch cores by themselves. It is the operating system that preempts execution and moves the thread to a new core. Thus it is also the operating system's responsibility to ensure that memory operations are properly synchronized when performing a context switch.
For you as a programmer this means that, as long as you are not trying work on a level where there is no SMP-aware OS (for instance, when you are trying to write your own OS or when working on an embedded platform without a fully-fledged OS), you do not need to worry about synchronization issues for this case.
The OS is responsible of memory coherency or visibility in additonal to memory ordering after a thread migration. a.k.a, below test always passes:
int a = A
/* migration here */
assert(a == A)

Critical sections with multicore processors

With a single-core processor, where all your threads are run from the one single CPU, the idea of implementing a critical section using an atomic test-and-set operation on some mutex (or semaphore or etc) in memory seems straightforward enough; because your processor is executing a test-and-set from one spot in your program, it necessarily can't be doing one from another spot in your program disguised as some other thread.
But what happens when you do actually have more than one physical processor? It seems that simple instruction level atomicity wouldn't be sufficient, b/c with two processors potentially executing their test-and-set operations at the same time, what you really need to maintain atomicity on is access to the shared memory location of the mutex. (And if the shared memory location is loaded into cache, there's the whole cache consistency thing to deal with, too..)
This seems like it would incur far more overhead than the single core case, so here's the meat of the question: How much worse is it? Is it worse? Do we just live with it? Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?
Multi-core/SMP systems are not just several CPUs glued together. There's explicit support for doing things in parallel. All the synchronization primitives are implemented with the help of hardware along the lines of atomic CAS. The instruction either locks the bus shared by CPUs and memory controller (and devices that do DMA) and updates the memory, or just updates the memory relying on cache snooping. This in turn causes cache coherency algorithm to kick in forcing all involved parties to flush their caches.Disclaimer - this is very basic description, there are more interesting things here like virtual vs. physical caches, cache write-back policies, memory models, fences, etc. etc.
If you want to know more about how OS might use these hardware facilities - here's an excellent book on the subject.
The vendor of multi-core cpus has to take care that the different cores coordinate themselves when executing instructions which guarantee atomic memory access.
On intel chips for instance you have the 'cmpxchg' instruction. It compares the value stored at a memory location to an expected value and exchanges it for the new value if the two match. If you precede it with the 'lock' instruction, it is guaranteed to be atomic with respect to all cores.
You would need a test-and-set that forces the processor to notify all the other cores of the operation so that they are aware. Yes, that introduces an overhead and you have to live with it. It's a reason to design multithreaded applications in such a way that they don't wait for synchronization primitives too often.
Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?
That would cancel the whole point of multithreading. When you are using a lock, semaphore, or other syncronization techniques, you are relying on OS to make sure that these operations are interlocked, no matter how many cores you are using.
The time to switch to a different thread after a lock has been released is mostly determined by the cost of a context switch. This SO thread deals with the context switching overhead, so you might want to check that.
There are some other interesting threads also:
What are the differences between various threading synchronization options in C#?
Threading best practices
You should read this MSDN article also: Understanding the Impact of Low-Lock Techniques in Multithreaded Apps.
Memory accesses are handled by the memory controller which should take care of multi-core issues, i.e. it shouldn't allow simultaneous access to same addresses (probably handled either by memory page or memory line basis). So you can use a flag to indicate whether another processor is updating the memory contents of some block (this to avoid a type of dirty read where part of the record is updated, but not all).
A more elegant solution is to use a HW semaphore block if the processor has such a feature. A HW semaphore is a simple queue which could be of size no_of_cores -1. This is how it is in TI's 6487/8 processor. You can either query the semaphore directly (and loop until it is released) or do an indirect query which will result in an interrupt once your core gets the resource. The requests are queued and served in the order they were made. A semaphore query is an atomic operation.
Cache consistency is another issue and you might need to do cache writebacks and refreshes in some cases. But this is a very cache implementation specific thing. With 6487/8 we needed to do that on a few operations.
Well, depending on what type of computers you have laying around the house, do the following: Write a simple multithreaded application. Run this application on a single core (Pentium 4 or Core Solo) and then run it on a multicore processor (Core 2 Duo or similar) and see how big the speed up is.
Granted these are unfair comparisons since Pentium 4 and Core Solo are much slower regardless of cores than a Core 2 Duo. Maybe compare between a Core 2 Duo and a Core 2 Quad with an application that can use 4 or more threads.
You raise a number of valid points. Muliple processors introduce a lot of headache and overhead. However, we just have to live with them, because the speed boost of parallelism can far outweigh them, if the critical sections are made long enough.
As for your final suggestion about having all threads on the same physical core, that completely defeats the point of a multi-core computer!

Resources