What's the point of cache coherency? - multithreading

On CPUs like x86, which provide cache coherency, how is this useful from a practical perspective? I understand that the idea is to make memory updates done on one core immediately visible on all other cores. This is a useful property. However, one can't rely too heavily on it if not writing in assembly language, because the compiler can store variable assignments in registers and never write them to memory. This means that one must still take explicit steps to make sure that stuff done in other threads is visible in the current thread. Therefore, from a practical perspective, what has cache coherency achieved?

The short story is, non-cache coherent system are exceptionally difficult to program especially if you want to maintain efficiency - which is also the main reason even most NUMA systems today are cache-coherent.
If the caches wern't coherent, the "explicit steps" would have to enforce the coherency - explicit steps are usually things like critical sections/mutexes(e.g. volatile in C/C++ is rarly enough) . It's quite hard, if not impossible for services such as mutexes to keep track of only the memory that have changes and needs to be updated in all the caches -it would probably have to update all the memory, and that is if it could even track which cores have what pieces of that memory in their caches.
Presumable the hardware can do a much better and efficient job at tracking the memory addresses/ranges that have been changed, and keep them in sync.
And, imagine a process running on core 1 and gets preempted. When it gets scheduled again, it got scheduled on core 2.
This would be pretty fatal if the caches weren't choerent as otherwise there might be remnants of the process data in the cache of core 1, which doesn't exist in core 2's cache. Though, for systems working that way, the OS would have to enforce the cache coherency as threads are scheduled - which would probably be an "update all the memory in caches between all the cores" operation, or perhaps it could track dirty pages vith the help of the MMU and only sync the memory pages that have been changed - again, the hardware likely keep the caches coherent in a more finegrainded and effcient way.

There are some nuances not covered by the great responses from the other authors.
First off, consider that a CPU doesn't deal with memory byte-by-byte, but with cache lines. A line might have 64 bytes. Now, if I allocate a 2 byte piece of memory at location P, and another CPU allocates an 8 byte piece of memory at location P + 8, and both P and P + 8 live on the same cache line, observe that without cache coherence the two CPUs can't concurrently update P and P + 8 without clobbering each others changes! Because each CPU does read-modify-write on the cache line, they might both write out a copy of the line that doesn't include the other CPU's changes! The last writer would win, and one of your modifications to memory would have "disappeared"!
The other thing to bear in mind is the distinction between coherency and consistency. Because even x86 derived CPUs use store buffers, there aren't the guarantees you might expect that instructions that have already finished have modified memory in such a way that other CPUs can see those modifications, even if the compiler has decided to write the value back to memory (maybe because of volatile?). Instead the mods may be sitting around in store buffers. Pretty much all CPUs in general use are cache coherent, but very few CPUs have a consistency model that is as forgiving as the x86's. Check out, for example, http://www.cs.nmsu.edu/~pfeiffer/classes/573/notes/consistency.html for more information on this topic.
Hope this helps, and BTW, I work at Corensic, a company that's building a concurrency debugger that you may want to check out. It helps pick up the pieces when assumptions about concurrency, coherence, and consistency prove unfounded :)

Imagine you do this:
lock(); //some synchronization primitive e.g. a semaphore/mutex
globalint = somevalue;
unlock();
If there were no cache coherence, that last unlock() would have to assure that globalint are now visible everywhere, with cache coherance all you need to do is to write it to memory and let the hardware do the magic. A software solution would have keep tack of which memory exists in which caches, on which cores, and somehow make sure they're atomically in sync.
You'd win an award if you can find a software solution that keeps track of all the pieces of memory that exist in the caches that needs to be keept in sync, that's more efficient than a current hardware solution.

Cache coherency becomes extremely important when you are dealing with multiple threads and are accessing the same variable from multiple threads. In that particular case, you have to ensure that all processors/cores do see the same value if they access the variable at the same time, otherwise you'll have wonderfully non-deterministic behaviour.

It's not needed for locking. The locking code would include cache flushing if that was needed. It's mainly needed to ensure that concurrent updates by different processors to different variables in the same cache line aren't lost.

Cache coherency is implemented in hardware because the programmer doesn't have to worry about making sure all threads see the latest value of a memory location while operating in multicore/multiprocessor enviroment. Cache coherence gives an abstraction that all cores/processors are operating on a single unified cache, though every core/processor has it own individual cache.
It also makes sure the legacy multi-threaded code works as is on new processors models/multi processor systems, without making any code changes to ensure data consistency.

Related

Does an x86 CPU reorder instructions?

I have read that some CPUs reorder instructions, but this is not a problem for single threaded programs (the instructions would still be reordered in single threaded programs, but it would appear as if the instructions were executed in order), it is only a problem for multithreaded programs.
To solve the problem of instructions reordering, we can insert memory barriers in the appropriate places in the code.
But does an x86 CPU reorder instructions? If it does not, then there is no need to use memory barriers, right?
Reordering
Yes, all modern x86 chips from Intel and AMD aggressively reorder instructions across a window which is around 200 instructions deep on recent CPUs from both manufacturers (i.e. a new instruction may execute while an older instruction more than 200 instructions "in the past" is still waiting). This is generally all invisible to a single thread since the CPU still maintains the illusion of serial execution1 by the current thread by respecting dependencies, so from the point of view of the current thread of execution it is as-if the instructions were executed serially.
Memory Barriers
That should answer the titular question, but then your second question is about memory barriers. It contains, however, an incorrect assumption that instruction reordering necessarily causes (and is the only cause of) visible memory reordering. In fact, instruction reordering is neither sufficient nor necessary for cross-thread memory re-ordering.
Now it is definitely true that out-of-order execution is a primary driver of out-of-order memory access capabilities, or perhaps it is the quest for MLP (Memory Level Parallelism) that drives the increasingly powerful out-of-order abilities for modern CPUs. In fact, both are probably true at once: increasing out-of-order capabilities benefit a lot from strong memory reordering capabilities, and at the same time aggressive memory reordering and overlapping isn't possible without good out-of-order capabilities, so they help each other in kind of a self-reinforcing sum-greater-than-parts kind of loop.
So yes, out-of-order execution and memory reordering certainly have a relationship; however, you can easily get re-ordering without out-of-order execution! For example, a core-local store buffer often causes apparent reordering: at the point of execution the store isn't written directly to the cache (and hence isn't visible at the coherency point), which delays local stores with respect to local loads which need to read their values at the point of execution.
As Peter also points out in the comment thread you can also get a type of load-load reordering when loads are allowed to overlap in an in-order design: load 1 may start but in the absence of an instruction consuming its result a pipelined in-order design may proceed to the following instructions which might include another load 2. If load 2 is a cache hit and load 1 was a cache miss, load 2 might be satisfied earlier in time from load 1 and hence the apparent order may be swapped re-ordered.
So we see that not all cross-thread memory re-ordering is caused by instruction re-ordering, but certain instruction re-ordering also implies out-of-order memory access, right? No so fast! There are two different contexts here: what happens at the hardware level (i.e., whether memory access instructions can, as a practical matter, execute out-of-order), and what is guaranteed by the ISA and platform documentation (often called the memory model applicable to the hardware).
x86 re-ordering
In the case of x86, for example, modern chips will freely re-order more or less any stream of loads and stores with respect to each other: if a load or store is ready to execute, the CPU will usually attempt it, despite the existence of earlier uncompleted load and store operations.
At the same time, x86 defines quite a strict memory model, which bans most possible reorderings, roughly summarized as follows:
Stores have a single global order of visibility, observed consistently by all CPUs, subject to one loosening of this rule below.
Local load operations are never reordered with respect to other local load operations.
Local store operations are never reordered with respect to other local store operations (i.e., a store that appears earlier in the instruction stream always appears earlier in the global order).
Local load operations may be reordered with respect to earlier local store operations, such that the load appears to execute earlier wrt the global store order than the local store, but the reverse (earlier load, older store) is not true.
So actually most memory re-orderings are not allowed: loads with respect to each outer, stores with respect to each other, and loads with respect to later stores. Yet I said above that x86 pretty much freely executes out-of-order all memory access instructions - how can you reconcile these two facts?
Well, x86 does a bunch of extra work to track exactly the original order of loads and stores, and makes sure no memory re-orderings that breaks the rules is ever visible. For example, let's say load 2 executes before load 1 (load 1 appears earlier in program order), but that both involved cache lines were in the "exclusively owned" state during the period that load 1 and load 2 executed: there has been reordering, but the local core knows that it cannot be observed because no other was able to peek into this local operation.
In concert with the above optimizations, CPUs also uses speculative execution: execute everything out of order, even if it is possible that at some later point some core can observe the difference, but don't actually commit the instructions until such an observation is impossible. If such an observation does occur, you roll back the CPU to an earlier state and try again. This is cause of the "memory ordering machine clear" on Intel.
So it is possible to define an ISA that doesn't allow any re-ordering at all, but under the covers do re-ordering but carefully check that it isn't observed. PA-RISC is an example of such a sequentially consistent architecture. Intel has a strong memory model that allows one type of reordering, but disallows many others, but each chip internally may do more (or less) re-ordering as long as they can guarantee to play by the rules in an observable sense (in this sense, it somewhat related to the "as-if" rule that compilers play by when it comes to optimizations).
The upshot of all that is that yes, x86 requires memory barriers to prevent specifically the so-called StoreLoad re-ordering (for algorithms that require this guarantee). You don't find many standalone memory barriers in practice in x86, because most concurrent algorithms also need atomic operations, such as atomic add, test-and-set or compare-and-exchange, and on x86 those all come with full barriers for free. So the use of explicit memory barrier instructions like mfence is limited to cases where you aren't also doing an atomic read-modify-write operation.
Jeff Preshing's Memory Reordering Caught in the Act
has one example that does show memory reordering on real x86 CPUs, and that mfence prevents it.
1 Of course if you try hard enough, such reordering is visible! An high-impact recent example of that would be the Spectre and Meltdown exploits which exploited speculative out-of-order execution and a cache side channel to violate memory protection security boundaries.

Threads on same core accessing same cache line

I am working in a bare-metal environment and thus evaluating performance at a low-level. How should I expect two threads on the same core to perform when writing to different sections of the same cache line?
I am somewhat new to multicore/multithread architectures. I understand that when different cores write to the same cache line locks or atomic operations are required to ensure race conditions are avoided. At the same time sharing a cache line between cores also sets one up for performance issues such as false sharing.
However, do I need to worry about similar things when the two threads are on the same core? I'm unsure seeing as they share the same cache and there are multiple load-store units. For example, say thread1 writes to section1 of the cache line at the same time that thread2 wants to write to section2 of the cache line. Does each thread just modify its own section of the cache line, or do they read the full line, modify their section, and write the full line back into the cache? If it's the latter do I need to worry about race conditions or performance delays?
You are over-complicating this.
There are different layers of caches, depends very specifically on the cpu you are using not just generically x86 or arm, but which architecture version/generation, but you may have an L1 cache intimately connected to the individual cores, then L2 is where the cores come together on the way to shared memory/address space.
All a cache does at whatever layer is sit on the main memory (space) bus and watch things go by, if a transaction is tagged as cacheable, then it examines its tags to see if there is a hit or miss and acts accordingly. The cache does not know, cannot know, nor care who or what caused that transaction, was it an instruction what instruction, what task/program/thread was that instruction from, is it a prefetch, is it a dma engine. doesnt care, there is a transaction like any other follow the rules, pass it on through if not cacheable, if cacheable look for hits and deal with hits or misses.
So from that if you have more than one core/cpu hitting a shared cache, and for some reason they happen to be accessing memory so close that it is in the same cache line, well then the cache will react accordingly.
if you have the same cpu with two threads, will the whole at the same time thing doesnt apply, of course it doesnt apply on the shared cpu as well, you could have them one clock apart but it is a shared bus, generally not dual/multi-ported at this level. but despite that the cache will act per its design, ignore and pass on if marked as not-cacheable, or search for a hit if it is and act accordingly.

Guaranteed CPU cache update after a certain time

Let's say I have a variable var located somewhere in memory and that an arbitrary number of processors/threads could read and modify it at any given time. But it's guaranteed that at least n seconds will have elapsed between a processor modifying var and any other one reading var. Is it possible to be certain that, if time in seconds is n, there's a value for n that guarantees that the processor reading var will read the updated value?
If your concern really is Cache coherence you should generally be safe 1.
Specifically, however, you may be not.
Cache coherence is usually handled by the hardware2 without the help of the software.
However this is very implementation specific: NUMA may be non cache-coherent, a Compute Shader may need specific built-in functions, IA32e and ARM generally hide cache coherence from the programmer.
To answer you question directly: No, you have no guarantees whatsoever.
The point is that cache coherence is something you deal with in clustered and parallel non uniform architectures.
While in this situations the programming model is inherently multi-threading, the two concepts3 are separated and what really should bug you is how to properly handle multi-threading, specifically synchronization and memory order.
Your question seems to suggest a simple case, where the readers are executed long after the writer is done.
If this property is really enforced you don't need any synchronization nor memory barrier. Beware however that sleep functions don't qualify as a valid enforcement.
If you instead need to synchronize (and so to order the memory accesses) then you need to use language specific constructs, for example volatile in C# and Java, atomics in C and C++ or specific instructions in assembly.
You may need to implement Critical sections too.
If you actually need to manually control the cache coherence for your architecture, than you have to check the specifications of interest (usually datasheets and formal papers) because there is no uniform way to deal with it and the compiler should provide some intrinsic or the runtime should provide a library.
So to add something to the direct answer above: No, you have no guarantees whatsoever, but when an usual CPU, in an usual architecture, need that data, it will be able to use the most updated one anyway. So you don't need to worry about that aspect.
Please note the use of the words common and that
1 For example if you use an Intel/AMD/ARM CPU, don't even think about cache coherence.
2 Either the CPU itself, a local monitor, a system monitor or a specific device.
3 Multi-threading and cache-coherence.
The cache will tend to get flushed on operating system tick interrupts when it goes into the scheduler to see if there's a different task to run.
However, as operating systems get smarter with things such as tickless NoHz and as CPU core counts go up, this gets less and less likely, and you shouldn't count on it.
Supercomputer clusters may not task switch for minutes at a time because they're using customized operating system code that doesn't interrupt the running jobs, ever. Compute jobs are assigned to a core from 1-7 with no interrupts and all of the other work runs on core-0.
There are two concepts mixed in you question: software synchronization and hardware coherency. Hardware coherency is talked by Margaret already so I won't cover it here.
Software Synchronization
x86 provides guarantee that quadword access would be carried out atomically if aligned on 64-bit boundary. But this guarantees that other processor won't read partial result (e.g. [32bit New]<32bit Old> weird mixture). It does not guarantee a hard time deadline before which another processor would see the newly assigned value. Let another thread wait for some time is not quite an elegant solution because first the two threads need to have the same starting time synchronized. So, if you need such guarantee, you need conditional variable to make sure another thread should wait.
https://en.wikipedia.org/wiki/Monitor_(synchronization)
In a word, use conditional variable if you need a sequencing effect and use locks/transactional memory, etc. to protect variable longer than quadword or not 64bit aligned.
Btw, here is an useful material for cache coherency if you are interested.
http://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/lectures/10_coherence.pdf

Memory barriers force cache coherency?

I was reading this question about using a bool for thread control and got intrigued by this answer by #eran:
Using volatile is enough only on single cores, where all threads use the same cache. On multi-cores, if stop() is called on one core and run() is executing on another, it might take some time for the CPU caches to synchronize, which means two cores might see two different views of isRunning_.
If you use synchronization mechanisms, they will ensure all caches get the same values, in the price of stalling the program for a while. Whether performance or correctness is more important to you depends on your actual needs.
I have spent over an hour searching for some statement that says synchronization primitives force cache coherency but have failed. The closest I have come is Wikipedia:
The keyword volatile does not guarantee a memory barrier to enforce cache-consistency.
Which suggests that memory barriers do force cache consistency, and since some synchronization primitives are implemented using memory barriers (again from Wikipedia) this is some "evidence".
But I don't know enough to be certain whether to believe this or not, and be sure that I'm not misinterpreting it.
Can someone please clarify this?
Short Answer : Cache coherency works most of the time but not always. You can still read stale data. If you don't want to take chances, then just use a memory barrier
Long Answer : CPU core is no longer directly connected to the main memory. All loads and stores have to go through the cache. The fact that each CPU has its own private cache causes new problems. If more than one CPU is accessing the same memory it must still be assured that both processors see the same memory content at all times. If a cache line is dirty on one processor (i.e., it has not been written back yet to main memory) and a second processor tries to read the same memory location, the read operation cannot just go out to the main memory. . Instead the content of the first processor’s cacheline is needed. The question now is when does this cache line transfer have to happen? This question is pretty easy to answer: when one processor needs a cache line which is dirty in another processor’s cache for reading or writing. But how can a processor determine whether a cache line is dirty in another processor’s cache? Assuming it just because a cache line is loaded by another processor would be suboptimal (at best). Usually the majority of memory accesses are read accesses and the resulting cache lines are not dirty. Here comes cache coherency protocols. CPU's maintain data consistency across their caches via MESI or some other cache coherence protocol.
With cache coherency in place, should we not see that latest value always for the cacheline even if it was modified by another CPU? After all that is whole purpose of the cache coherency protocols. Usually when a cacheline is modified, the corresponding CPU sends an "invalidate cacheline" request to all other CPU's. It turns out that CPU’s can send acknowledgement to the invalidate requests immediately but defer the actual invalidation of the cacheline to a later point in time. This is done via invalidation queues. Now if we get un-lucky enough to read the cacheline within this short window (between the CPU acknowledging an invalidation request and actually invalidating the cacheline) then we can read a stale value. Now why would a CPU do such a horrible thing. The simple answer is PERFORMANCE. So lets look into different scenarios where invalidation queues can improve performance
Scenario 1 : CPU1 receives an invalidation request from CPU2. CPU1 also has a lot of stores and loads queued up for the cache. This means that the invalidation of the requested cacheline takes times and CPU2 gets stalled waiting for the acknowledgment
Scenario 2 : CPU1 receives a lot of invalidation requests in a short amount of time. Now it takes time for CPU1 to invalidate all the cachelines.
Placing an entry into the invalidate queue is essentially a promise by the CPU to process that entry before transmitting any MESI protocol messages regarding that cache line. So invalidation queues are the reason why we may not see the latest value even when doing a simple read of a single variable.
Now the keen reader might be thinking, when the CPU wants to read a cacheline, it could scan the invalidation queue first before reading from the cache. This should avoid the problem. However the CPU and invalidation queue are physically placed on opposite sides of the cache and this limits the CPU from directly accessing the invalidation queue. (Invalidation queues of one CPU’s cache are populated by cache coherency messages from other CPU’s via the system bus. So it kind of makes sense for the invalidation queues to be placed between the cache and the system bus). So in order to actually see the latest value of any shared variable, we should empty the invalidation queue. Usually a read memory barrier does that.
I just talked about invalidation queues and read memory barriers. [1] is a good reference for understanding the need for read and write memory barriers and details of MESI cache coherency protocol
[1] http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf
As I understand, synchronization primitives won't affect cache coherency at all. Cache is French for hidden, it's not supposed to be visible to the user. A cache coherency protocol should work without the programmer's involvement.
Synchronization primitives will affect the memory ordering, which is well defined and visible to the user through the processor's ISA.
A good source with detailed information is A Primer on Memory Consistency and Cache Coherence from the Synthesis Lectures on Computer Architecture collection.
EDIT: To clarify your doubt
The Wikipedia statement is slightly wrong. I think the confusion might come from the terms memory consistency and cache coherency. They don't mean the same thing.
The volatile keyword in C means that the variable is always read from memory (as opposed to a register) and that the compiler won't reorder loads/stores around it. It doesn't mean the hardware won't reorder the loads/stores. This is a memory consistency problem. When using weaker consistency models the programmer is required to use synchronization primitives to enforce a specific ordering. This is not the same as cache coherency. For example, if thread 1 modifies location A, then after this event thread 2 loads location A, it will receive an updated (consistent) value. This should happen automatically if cache coherency is used. Memory ordering is a different problem. You can check out the famous paper Shared Memory Consistency Models: A Tutorial for more information. One of the better known examples is Dekker's Algorithm which requires sequential consistency or synchronization primitives.
EDIT2: I would like to clarify one thing. While my cache coherency example is correct, there is a situation where memory consistency might seem to overlap with it. This when stores are executed in the processor but delayed going to the cache (they are in a store queue/buffer). Since the processor's cache hasn't received an updated value, the other caches won't either. This may seem like a cache coherency problem but in reality it is not and is actually part of the memory consistency model of the ISA. In this case synchronization primitives can be used to flush the store queue to the cache. With this in mind, the Wikipedia text that you highlighted in bold is correct but this other one is still slightly wrong: The keyword volatile does not guarantee a memory barrier to enforce cache-consistency. It should say: The keyword volatile does not guarantee a memory barrier to enforce memory consistency.
What wikipedia tells you is that volatile does not mean that a memory barrier will be inserted to enforce cache-consistency. A proper memory barrier will however enforce that memory access between multiple CPU cores is consistent, you may find reading the std::memory_order documentation helpful.

Critical sections with multicore processors

With a single-core processor, where all your threads are run from the one single CPU, the idea of implementing a critical section using an atomic test-and-set operation on some mutex (or semaphore or etc) in memory seems straightforward enough; because your processor is executing a test-and-set from one spot in your program, it necessarily can't be doing one from another spot in your program disguised as some other thread.
But what happens when you do actually have more than one physical processor? It seems that simple instruction level atomicity wouldn't be sufficient, b/c with two processors potentially executing their test-and-set operations at the same time, what you really need to maintain atomicity on is access to the shared memory location of the mutex. (And if the shared memory location is loaded into cache, there's the whole cache consistency thing to deal with, too..)
This seems like it would incur far more overhead than the single core case, so here's the meat of the question: How much worse is it? Is it worse? Do we just live with it? Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?
Multi-core/SMP systems are not just several CPUs glued together. There's explicit support for doing things in parallel. All the synchronization primitives are implemented with the help of hardware along the lines of atomic CAS. The instruction either locks the bus shared by CPUs and memory controller (and devices that do DMA) and updates the memory, or just updates the memory relying on cache snooping. This in turn causes cache coherency algorithm to kick in forcing all involved parties to flush their caches.Disclaimer - this is very basic description, there are more interesting things here like virtual vs. physical caches, cache write-back policies, memory models, fences, etc. etc.
If you want to know more about how OS might use these hardware facilities - here's an excellent book on the subject.
The vendor of multi-core cpus has to take care that the different cores coordinate themselves when executing instructions which guarantee atomic memory access.
On intel chips for instance you have the 'cmpxchg' instruction. It compares the value stored at a memory location to an expected value and exchanges it for the new value if the two match. If you precede it with the 'lock' instruction, it is guaranteed to be atomic with respect to all cores.
You would need a test-and-set that forces the processor to notify all the other cores of the operation so that they are aware. Yes, that introduces an overhead and you have to live with it. It's a reason to design multithreaded applications in such a way that they don't wait for synchronization primitives too often.
Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?
That would cancel the whole point of multithreading. When you are using a lock, semaphore, or other syncronization techniques, you are relying on OS to make sure that these operations are interlocked, no matter how many cores you are using.
The time to switch to a different thread after a lock has been released is mostly determined by the cost of a context switch. This SO thread deals with the context switching overhead, so you might want to check that.
There are some other interesting threads also:
What are the differences between various threading synchronization options in C#?
Threading best practices
You should read this MSDN article also: Understanding the Impact of Low-Lock Techniques in Multithreaded Apps.
Memory accesses are handled by the memory controller which should take care of multi-core issues, i.e. it shouldn't allow simultaneous access to same addresses (probably handled either by memory page or memory line basis). So you can use a flag to indicate whether another processor is updating the memory contents of some block (this to avoid a type of dirty read where part of the record is updated, but not all).
A more elegant solution is to use a HW semaphore block if the processor has such a feature. A HW semaphore is a simple queue which could be of size no_of_cores -1. This is how it is in TI's 6487/8 processor. You can either query the semaphore directly (and loop until it is released) or do an indirect query which will result in an interrupt once your core gets the resource. The requests are queued and served in the order they were made. A semaphore query is an atomic operation.
Cache consistency is another issue and you might need to do cache writebacks and refreshes in some cases. But this is a very cache implementation specific thing. With 6487/8 we needed to do that on a few operations.
Well, depending on what type of computers you have laying around the house, do the following: Write a simple multithreaded application. Run this application on a single core (Pentium 4 or Core Solo) and then run it on a multicore processor (Core 2 Duo or similar) and see how big the speed up is.
Granted these are unfair comparisons since Pentium 4 and Core Solo are much slower regardless of cores than a Core 2 Duo. Maybe compare between a Core 2 Duo and a Core 2 Quad with an application that can use 4 or more threads.
You raise a number of valid points. Muliple processors introduce a lot of headache and overhead. However, we just have to live with them, because the speed boost of parallelism can far outweigh them, if the critical sections are made long enough.
As for your final suggestion about having all threads on the same physical core, that completely defeats the point of a multi-core computer!

Resources