How can OpenMP's round robin scheduling hurt ccNUMA's performance?

How can OpenMP's round robin scheduling hurt ccNUMA's performance? - multithreading

I'm trying to understand ccNUMA systems but I'm a little bit confused about how OpenMP's scheduling can hurt the performance.Let's say we have the below code.What is happening if c1 is smaller than c0 or bigger?I understand the general idea that different chunk size leads to remote accesses but I read somewhere that for small chunk sizes something is happening with cache lines and I got really confused.
#pragma omp parallel for schedule(static,c0)
for(int i=0;i<N;i++)
A[i]=0;
#pragma omp parallel for schedule(static,c1)
for(int i=0;i<N;i++)
B[i]=A[i]*i;

When A[] has been allocated using malloc, the OS only promised that you will get the memory the pointer is pointing to. No actual memory allocation has been performed, that is, the physical memory pages have not been assigned yet. This happens when you execute the first parallel region where you touch the data for the first time (see also "first-touch policy"). When the first access happens, the OS creates the physical page on the same NUMA domain that executes the touching thread.
So, depending on how you choose c0 you get a certain distribution of the memory pages across the system. With a bit of math involved you can actually determine which value of c0 will lead to what distribution of the memory pages.
In the second loop, you're using a c1 that potentially different from c0. For certain values of c1 (esp., c1 equal to c0) you should see almost no NUMA traffic on the system, while for others you'll see a lot. Again, it's simple to determine those values mathematically.
Another thing you might be facing is false sharing. If c0 and c1 are chosen such that the data processed by a chunk is less than the size of a cache line, you'll see that a cache line is shared across multiple threads and thus is bouncing between the different caches of the system.

Related

Why memory reordering is not a problem on single core/processor machines?

Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:
x = 0;
f = 0;
Thread #1:
while (f == 0);
print x;
Thread #2:
x = 42;
f = 1;
I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.
However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:
When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?

The CPU would not be aware that these are two threads. Threads are a software construct (1).
So the CPU sees these instructions, in this order:
store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x
If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.
In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.
(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.

The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)
The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.
(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)
But remember, other threads aren't observing memory directly with a logic analyzer, they're just running load instructions on that same CPU core that's doing and tracking the reordering.
If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.
Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache, can be a problem. Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering. (x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain. But some non-x86 OSes that had cache-control instructions designed in from the start do requires OSes to be aware of it. i.e. to make sure cache is invalidated before reading in new contents from disk, and to make sure it's at least written back to somewhere DMA can read from before asking a device to read from a page.)
And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/
so what makes sure the program order is preserved?
Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location1.
Without this, code like
x = 1;
int tmp = x;
might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.
We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.
Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction.
This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...

Linux slab allocator and cache performance

From the guide understanding linux kernel 3rd edition, chapter 8.2.10, Slab coloring-
We know from Chapter 2 that the same hardware cache line maps many different blocks of RAM. In this
chapter, we have also seen that objects of the same size end up being stored at the same offset within a cache.
Objects that have the same offset within different slabs will, with a relatively high probability, end up mapped
in the same cache line. The cache hardware might therefore waste memory cycles transferring two objects
from the same cache line back and forth to different RAM locations, while other cache lines go underutilized.
The slab allocator tries to reduce this unpleasant cache behavior by a policy called slab coloring : different
arbitrary values called colors are assigned to the slabs.
(1) I am unable to understand the issue that the slab coloring tries to solve. When a normal proccess accesses data, if it is not in the cache and a cache miss is encountered, the data is fetched into the cache along with data from the surounding address of the data the process tries to access to boost performance. How can a situation occur such that same specific cache lines keeps getting swapped? the probability that a process keeps accessing two different data addresses in same offset inside a memory area of two different memory areas is very low. And even if it does happen, cache policies usually choose lines to be swapped according to some agenda such as LRU, Random, etc. No policy exist such that chooses to evict lines according to a match in the least significant bits of the addresses being accessed.
(2) I am unable to understand how the slab coloring, which takes free bytes from end of slab to the beginning and results with different slabs with different offsets for the first objects, solve the cache-swapping issue?
[SOLVED] after a small investigation I believe I found an answer to my question. Answer been posted.

After many studying and thinking, I have got explanation seemingly more reasonable, not only by specific address examples.
Firstly, you must learn basics knowledge such as cache , tag, sets , line allocation.
It is certain that colour_off's unit is cache_line_size from linux kernel code. colour_off is the basic offset unit and colour is the number of colour_off in struct kmem_cache.
int __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
cachep->align = ralign;
cachep->colour_off = cache_line_size(); // colour_off's unit is cache_line_size
/* Offset must be a multiple of the alignment. */
if (cachep->colour_off < cachep->align)
cachep->colour_off = cachep->align;
.....
err = setup_cpu_cache(cachep, gfp);
https://elixir.bootlin.com/linux/v4.6/source/mm/slab.c#L2056
So we can analyse it in two cases.
The first is cache > slab.
You see slab 1 slab2 slab3 ... has no possibility to collide mostly because cache is big enough except slab1 vs slab5 which can collide. So colouring mechanism is not so clear to improve performance in the case. But with slab1 and slab5 we just ignore to explain it why, I am sure you will work it out after reading the following.
The second is slab > cache.
A blank line means a color_off or cache line. Clearly, slab1 and slab2 has no possibility to collide on the lines signed by tick as well as slab2 slab3.
We make sure colouring mechanism optimize two lines between two adjacent slabs, much less slab1 vs slab3 which optimize more lines, 2+2 = 4 lines, you can count it.
To summarize, colouring mechanism optimize cache performance (detailly just optimize some lines of colour_off at the beginning and end, not other lines which can still collide ) by using originally useless memory as possible as it can.

I think I got it, the answer is related to Associativity.
A cache can be divided to certain sets, each set can only cache a certain memory blocks type in it. For example, set0 will contain memory blocks with addresses of multiple of 8, set1 will contain memory blocks with addresses of multiple of 12. The reason for that is to boost cache performance, to avoid the situation where every address is searched throught the whole cache. This way only a certain set of the cache needs to be searched.
Now, from the link Understanding CPU Caching and performance
From page 377 of Henessey and Patterson, the cache placement formula is as follows:
(Block address) MOD (Number of sets in cache)
Lets take memory block address 0x10000008 (from slabX with color C) and memory block address 0x20000009 (from slabY with color Z). For most N (number of sets in cache), the calculation for <address> MOD <N> will yield a different value, hence a different set to cache the data. If the addresses were with same least significant bits values (for example 0x10000008 and 0x20000008) then for most of N the calculation will yield same value, hence the blocks will collide to the same cache set.
So, by keeping an a different offset (colors) for the objects in different slabs, the slabs objects will potentially reach different sets in cache and will not collide to the same set, and overall cache performance is increased.
EDIT: Furthermore, if the cache is a direct mapped one, then according to wikipedia, CPU Cache, no cache replacement policy exist and the modulu calculation yields the cache block to which the memory block will be stored:
Direct-mapped cache
In this cache organization, each location in main memory can go in only one entry in the cache. Therefore, a direct-mapped cache can also be called a "one-way set associative" cache. It does not have a replacement policy as such, since there is no choice of which cache entry's contents to evict. This means that if two locations map to the same entry, they may continually knock each other out. Although simpler, a direct-mapped cache needs to be much larger than an associative one to give comparable performance, and it is more unpredictable. Let x be block number in cache, y be block number of memory, and nbe number of blocks in cache, then mapping is done with the help of the equation x = y mod n.

Say you have a 256 KB cache and it uses a super-simple algorithm where it does cache line = (real address AND 0x3FFFFF).
Now if you have slabs starting on each megabyte boundary then item 20 in Slab 1 will kick Item 20 of Slab 2 out of cache because they use the same cache line tag.
By offsetting the slabs it becomes less likely that different slabs will share the same cache line tag. If Slab 1 and Slab 2 both hold 32 byte objects and Slab 2 is offset 8 bytes, its cache tags will never be exactly equal to Slab 1's.
I'm sure I have some details wrong, but take it for what it's worth.

How to prevent two processess from fighting for a common cache?

I was asked this question on an exam. We have two CPUs, or two cores in the same CPU, that share a common cache (for example, L3). On each CPU there is an MPI process (or a thread of one common process). How can we assure that these two processes don't interfere, meaning that they don't push each others entries out or use a half of the cache each or something similar. The goal is to improve the speed of memory access here.
The OS is some sort of Unix, if that is important.

Based on your comments, it seems that a "textbook answer" is expected, so I would suggest partitioning the cache between the processes. This way you guarantee that they don't compete over the same cache sets and thrash each other. This is assuming you don't want to actually share anything between the 2 processes, in which case this approach would fail (although a possible fix would be to split the cache space in 3 - one range for each process, and one for shared data).
Since you're probably not expected to redesign the cache and provide HW partitioning scheme (unless the question comes in the scope of computer architecture course), the simplest way to achieve this is simply by inspecting the cache size and associativity, figuring our the number of sets, and aligning the data sets of each process/thread to a different part.
For example, if your shared cache is 2MB big, and has 16 ways and 64B lines, you would have 2k sets. In such case, each process would want to align its physical addresses (assuming the cache is physically mapped) to a different half 1k sets, or a different 0x10000 out of each 0x20000. In other words, P0 would be free to use any physical address with bit 16 equals 0 , and P1 would use the addresses with bit 16 equals 1.
Note, that since that exceeds the size of a basic 4k page (alignment of 0x1000), you would either need to hack your OS to assign your pages to the appropriate physical addresses for each process, or simply use larger pages (2M would be enough).
Also note that by keeping a contiguous 0x10000 per allocation, we still enjoy spatial locality and efficient hardware prefetching (otherwise you could simply pick any other split, even even/odd sets by using bit 6, but that would leave your data fractured.
Last issue is for data sets larger than this 0x10000 quota - to make then align you'd simply have to break them into chunks up to 0x10000, and align each separately. There's also the issue of code/stack/pagemap and other types of OS/system data which you have less control over (actually code can also be aligned, or more likely in this case - shared) - I'm assuming this has negligible impact on thrashing.
Again - this attempts to answer without knowing what system you work with, what you need to achieve, or even what is the context of the course. With more context we can probably focus this to a simpler solution.

How large is a way in the cache?
For example, if you have a cache where each way is 128KiB in size, you partition your memory in such a way that for each address modulo 128KiB, process A uses the 0-64KiB region, and process B uses the lower 64KiB-128KiB region. (This assumes private L1-per-core).
If your physical page size is 4KiB (and your CPU uses physical addresses for caching, not virtual - which does occur on some CPUs), you can make this much nicer. Let's say you're mapping the same amount of memory into virtual address space for each core - 16KiB. Pages 0, 2, 4, 6 go to process A's memory map, and pages 1, 3, 5, 7 go to process B's memory map. As long as you only address memory in that carefully laid out region, the caches should never fight. Of course, you've effectively halved the size of your cache-ways by doing so, but you have multiple ways...

You'll want to utilize a lock in regards to multi-thread programming. It's hard to provide an example due to not knowing your specific situation.
When one process has access, lock all other processes out until the 'accessing' process is finished with the resource.

What will be used for data exchange between threads are executing on one Core with HT?

Hyper-Threading Technology is a form of simultaneous multithreading
technology introduced by Intel.
These resources include the execution engine, caches, and system bus
interface; the sharing of resources allows two logical processors to
work with each other more efficiently, and allows a stalled logical
processor to borrow resources from the other one.
In the Intel CPU with Hyper-Threading, one CPU-Core (with several ALUs) can execute instructions from 2 threads at the same clock. And both 2 threads share: store-buffer, caches L1/L2 and system bus.
But if two thread execute simultaneous on one Core, thread-1 stores atomic value and thread-2 loads this value, what will be used for this exchange: shared store-buffer, shared cache L1 / L2 or as usual cache L3?
What will be happen if both 2 threads from one the same process (the same virtual address space) and if from two different processes (the different virtual address space)?
Sandy Bridge Intel CPU - cache L1:
32 KB - cache size
64 B - cache line size
512 - lines (512 = 32 KB / 64 B)
8-way
64 - number sets of ways (64 = 512 lines / 8-way)
6 bits [11:6] - of virtual address (index) defines current set number (this is tag)
4 K - each the same (virtual address / 4 K) compete for the same set (32 KB / 8-way)
low 12 bits - significant for determining the current set number
4 KB - standard page size
low 12 bits - the same in virtual and physical addresses for each address

I think you'll get a round-trip to L1. (Not the same thing as store->load forwarding within a single thread, which is even faster than that.)
Intel's optimization manual says that store and load buffers are statically partitioned between threads, which tells us a lot about how this will work. I haven't tested most of this, so please let me know if my predictions aren't matching up with experiment.
Update: See this Q&A for some experimental testing of throughput and latency.
A store has to retire in the writing thread, and then commit to L1 from the store buffer/queue some time after that. At that point it will be visible to the other thread, and a load to that address from either thread should hit in L1. Before that, the other thread should get an L1 hit with the old data, and the storing thread should get the stored data via store->load forwarding.
Store data enters the store buffer when the store uop executes, but it can't commit to L1 until it's known to be non-speculative, i.e. it retires. But the store buffer also de-couples retirement from the ROB (the ReOrder Buffer in the out-of-order core) vs. commitment to L1, which is great for stores that miss in cache. The out-of-order core can keep working until the store buffer fills up.
Two threads running on the same core with hyperthreading can see StoreLoad re-ordering if they don't use memory fences, because store-forwarding doesn't happen between threads. Jeff Preshing's Memory Reordering Caught in the Act code could be used to test for it in practice, using CPU affinity to run the threads on different logical CPUs of the same physical core.
An atomic read-modify-write operation has to make its store globally visible (commit to L1) as part of its execution, otherwise it wouldn't be atomic. As long as the data doesn't cross a boundary between cache lines, it can just lock that cache line. (AFAIK this is how CPUs do typically implement atomic RMW operations like lock add [mem], 1 or lock cmpxchg [mem], rax.)
Either way, once it's done the data will be hot in the core's L1 cache, where either thread can get a cache hit from loading it.
I suspect that two hyperthreads doing atomic increments to a shared counter (or any other locked operation, like xchg [mem], eax) would achieve about the same throughput as a single thread. This is much higher than for two threads running on separate physical cores, where the cache line has to bounce between the L1 caches of the two cores (via L3).
movNT (Non-Temporal) weakly-ordered stores bypass the cache, and put their data into a line-fill buffer. They also evict the line from L1 if it was hot in cache to start with. They probably have to retire before the data goes into a fill buffer, so a load from the other thread probably won't see it at all until it enters a fill-buffer. Then probably it's the same as an movnt store followed by a load inside a single thread. (i.e. a round-trip to DRAM, a few hundred cycles of latency). Don't use NT stores for a small piece of data you expect another thread to read right away.
L1 hits are possible because of the way Intel CPUs share the L1 cache. Intel uses virtually indexed, physically tagged (VIPT) L1 caches in most (all?) of their designs. (e.g. the Sandybridge family.) But since the index bits (which select a set of 8 tags) are below the page-offset, it behaves exactly like a PIPT cache (think of it as translation of the low 12 bits being a no-op), but with the speed advantage of a VIPT cache: it can fetch the tags from a set in parallel with the TLB lookup to translate the upper bits. See the "L1 also uses speed tricks that wouldn't work if it was larger" paragraph in this answer.
Since L1d cache behaves like PIPT, and the same physical address really means the same memory, it doesn't matter whether it's 2 threads of the same process with the same virtual address for a cache line, or whether it's two separate processes mapping a block of shared memory to different addresses in each process. This is why L1d can be (and is) competitively by both hyperthreads without risk of false-positive cache hits. Unlike the dTLB, which needs to tag its entries with a core ID.
A previous version of this answer had a paragraph here based on the incorrect idea that Skylake had reduced L1 associativity. It's Skylake's L2 that's 4-way, vs. 8-way in Broadwell and earlier. Still, the discussion on a more recent answer might be of interest.
Intel's x86 manual vol3, chapter 11.5.6 documents that Netburst (P4) has an option to not work this way. The default is "Adaptive mode", which lets logical processors within a core share data.
There is a "shared mode":
In shared mode, the L1 data cache is competitively shared between logical processors. This is true even if the
logical processors use identical CR3 registers and paging modes.
In shared mode, linear addresses in the L1 data cache can be aliased, meaning that one linear address in the cache
can point to different physical locations. The mechanism for resolving aliasing can lead to thrashing. For this
reason, IA32_MISC_ENABLE[bit 24] = 0 is the preferred configuration for processors based on the Intel NetBurst
microarchitecture that support Intel Hyper-Threading Technology
It doesn't say anything about this for hyperthreading in Nehalem / SnB uarches, so I assume they didn't include "slow mode" support when they introduced HT support in another uarch, since they knew they'd gotten "fast mode" to work correctly in netburst. I kinda wonder if this mode bit only existed in case they discovered a bug and had to disable it with microcode updates.
The rest of this answer only addresses the normal setting for P4, which I'm pretty sure is also the way Nehalem and SnB-family CPUs work.
It would be possible in theory to build an OOO SMT CPU core that made stores from one thread visible to the other as soon as they retired, but before they leaves the store buffer and commit to L1d (i.e. before they become globally visible). This is not how Intel's designs work, since they statically partition the store queue instead of competitively sharing it.
Even if the threads shared one store-buffer, store forwarding between threads for stores that haven't retired yet couldn't be allowed because they're still speculative at that point. That would tie the two threads together for branch mispredicts and other rollbacks.
Using a shared store queue for multiple hardware threads would take extra logic to always forward to loads from the same thread, but only forward retired stores to loads from the other thread(s). Besides transistor count, this would probably have a significant power cost. You couldn't just omit store-forwarding entirely for non-retired stores, because that would break single-threaded code.
Some POWER CPUs may actually do this; it seems like the most likely explanation for not all threads agreeing on a single global order for stores. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?.
As #BeeOnRope points out, this wouldn't work for an x86 CPU, only for an ISA that doesn't guarantee a Total Store Order, because this this would let the SMT sibling(s) see your store before it becomes globally visible to other cores.
TSO could maybe be preserved by treating data from sibling store-buffers as speculative, or not able to happen before any cache-miss loads (because lines that stay hot in your L1D cache can't contain new stores from other cores). IDK, I haven't thought this through fully. It seems way overcomplicated and probably not able to do useful forwarding while maintaining TSO, even beyond the complications of having a shared store-buffer or probing sibling store-buffers.

Severe multi-threaded memory bottleneck after reaching a specific number of cores

We are testing our software for the first time on a machine with > 12 cores for scalability and we are encountering a nasty drop in performance after the 12th thread is added. After spending a couple days on this, we are stumped regarding what to try next.
The test system is a dual Opteron 6174 (2x12 cores) with 16 GB of memory, Windows Server 2008 R2.
Basically, performance peaks from 10 - 12 threads, then drops off a cliff and is soon performing work at about the same rate it was with about 4 threads. The drop-off is fairly steep and by 16 - 20 threads it reaches bottom in terms of throughput. We have tested both with a single process running multiple threads and as multiple processes running single threads-- the results are pretty much the same. The processing is fairly memory intensive and somewhat disk intensive.
We are fairly certain this is a memory bottleneck, but we don't believe it a cache issue. The evidence is as follows:
CPU usages continues to climb from 50 to 100% when scaling from 12 to 24 threads. If we were having synchronization/deadlock issues, we would have expected CPU usage to top out before reaching 100%.
Testing while copying a large amount of files in the background had very little impact on the processing rates. We think this rules out disk i/o as the bottleneck.
The commit charge is only about 4 GBs, so we should be well below the threshold in which paging would become an issue.
The best data comes from using AMD's CodeAnalyst tool. CodeAnalyst shows the windows kernel goes from taking about 6% of the cpu time with 12 threads to 80-90% of CPU time when using 24 threads. A vast majority of that time is spent in the ExAcquireResourceSharedLite (50%) and KeAcquireInStackQueuedSpinLockAtDpcLevel (46%) functions. Here are the highlights of the kernel's factor change when going from running with 12 threads to running with 24:
Instructions: 5.56 (times more)
Clock cycles: 10.39
Memory operations: 4.58
Cache miss ratio: 0.25 (actual cache miss ratio is 0.1, 4 times smaller than with 12 threads)
Avg cache miss latency: 8.92
Total cache miss latency: 6.69
Mem bank load conflict: 11.32
Mem bank store conflict: 2.73
Mem forwarded: 7.42
We thought this might be evidence of the problem described in this paper, however we found that pinning each worker thread/process to a particular core didn't improve the results at all (if anything, performance got a little worse).
So that's where we're at. Any ideas on the precise cause of this bottleneck or how we might avoid it?

I'm not sure that I understand the issues completely such that I can offer you a solution but from what you've explained I may have some alternative view points which may be of help.
I program in C so what works for me may not be applicable in your case.
Your processors have 12MB of L3 and 6MB of L2 which is big but in my view they're seldom big enough!
You're probably using rdtsc for timing individual sections. When I use it I have a statistics structure into which I send the measurement results from different parts of the executing code. Average, minimum, maximum and number of observations are obvious but also standard deviation has its place in that it can help you decide whether a large maximum value should be researched or not. Standard deviation only needs to be calculated when it needs to be read out: until then it can be stored in its components (n, sum x, sum x^2). Unless you're timing very short sequences you can omit the preceding synchronizing instruction. Make sure you quantifiy the timing overhead, if only to be able to rule it out as insignificant.
When I program multi-threaded I try to make each core's/thread's task as "memory limited" as possible. By memory limited I mean not doing things which requires unnecessary memory access. Unnecessary memory access usually means as much inline code as possible and as litte OS access as possible. To me the OS is a great unknown in terms of how much memory work a call to it will generate so I try to keep calls to it to a minimum. In the same manner but usually to a lesser performance impacting extent I try to avoid calling application functions: if they must be called I'd rather they didn't call a lot of other stuff.
In the same manner I minimize memory allocations: if I need several I add them together into one and then subdivide that one big allocation into smaller ones. This will help later allocations in that they will need to loop through fewer blocks before finding the block returned. I only block initialize when absolutely necessary.
I also try to reduce code size by inlining. When moving/setting small blocks of memory I prefer using intrinsics based on rep movsb and rep stosb rather than calling memcopy/memset which are usually both optimized for larger blocks and not especially limited in size.
I've only recently begun using spinlocks but I implement them such that they become inline (anything is better than calling the OS!). I guess the OS alternative is critical sections and though they are fast local spinlocks are faster. Since they perform additional processing it means that they prevent application processing from being performed during that time. This is the implementation:
inline void spinlock_init (SPINLOCK *slp)
{
slp->lock_part=0;
}
inline char spinlock_failed (SPINLOCK *slp)
{
return (char) __xchg (&slp->lock_part,1);
}
Or more elaborate (but not overly so):
inline char spinlock_failed (SPINLOCK *slp)
{
if (__xchg (&slp->lock_part,1)==1) return 1;
slp->count_part=1;
return 0;
}
And to release
inline void spinlock_leave (SPINLOCK *slp)
{
slp->lock_part=0;
}
Or
inline void spinlock_leave (SPINLOCK *slp)
{
if (slp->count_part==0) __breakpoint ();
if (--slp->count_part==0) slp->lock_part=0;
}
The count part is something I've brought along from embedded (and other programming) where it is used for handling nested interrupts.
I'm also a big fan of IOCPs for their efficiency in handling IO events and threads but your description does not indicate whether your application could use them. In any case you appear to economize on them, which is good.

To address your bullet points:
1) If you have 12 cores at 100% usage and 12 cores idle, then your total CPU usage would be 50%. If your synchronization is spinlock-esque, then your threads would still be saturating their CPUs even while not accomplishing useful work.
2) skipped
3) I agree with your conclusion. In the future, you should know that Perfmon has a counter: Process\Page Faults/sec that can verify this.
4) If you don't have the private symbols for ntoskrnl, CodeAnalyst may not be able to tell you the correct function names in its profile. Rather, it can only point to the nearest function for which it has symbols. Can you get stack traces with the profiles using CodeAnalyst? This could help you determine what operation your threads perform that drives the kernel usage.
Also, my former team at Microsoft has provided a number of tools and guidelines for performance analysis here, including taking stack traces on CPU profiles.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string