how does the diagram for this statement look like in multithreading

how does the diagram for this statement look like in multithreading - multithreading

I have come across this statement and I am wondering how it should be interpreted, please be patient with me thank you.
"Given a multi core system Pr1 and Pr2. The addresses Add1 and Add2 are mapped to the same cache block but A1 is not equal to A2. The cache state is initially invalid."
Would the cache blocks and processor look like the diagram I have drawn in A or in B? I'm confused about what it means when Add1 and Add2 are mapped to the same cache but does that mean that Pr1 and Pr2 access the same single block? Or do they each have their own blocks ?
I came across this diagram hence why I am confused how the architecture in this statement looks like.
Any kind explanation is appreciated, thank you!

Think of a cache as a bounded hash table, as used by software. The addresses are hashed to the same location and that causes a collision.
A direct mapped cache maps to a array element (replacing on collision)
A set-associative cache will have a fixed sized list at the array slot (replacing the LRU within the list)
A fully-associative cache is a open addressed hash table (LRU across all elements).
In your example the cache starts out as empty (invalid blocks). When loading the memory location the two addresses map to the same cache line. If we assume direct mapped cache then the loads conflict, causing the other item to be evicted when loaded into the block. This conflict could be a performance problem by causing thrashing if both are hot entries being frequently used, thereby causing stalls due to memory fetches. An associative cache would support some collisions, e.g. 2-way might be enough in practice to avoid this problem in most cases. This might mean that a L1 direct mapped + 2-way L2 may be acceptable, where the per-core cache suffers from collisions but the shared cache does not and the penalty to main memory is avoided.

Related

How to put data in L2 cache with A72 Core?

I have an array of data that looks like this :
uint32_t data[128]; //Could be more than L1D Cache size
In order to do computation on it, I want to put the data as close as possible to my computing unit so in the L2 Cache.
My target runs with a linux kernel and some additionnal apps
I know that I can get an access to a certain area of the memory with mmap and I have succesfully done it in some part of my available memory shared between cores.
How to do the same thing but in L2 Cache area ?
I've read part of gcc documentation and AArch64 assembly instruction set but cannot figure out the way to achieve this.

How to do the same thing but in L2 Cache area ?
Your hardware doesn't support that.
In general, the ARMv8 architecture doesn't make any guarantees about the contents of caches and does not provide any means to explicitly manipulate or query them - it only makes guarantees and provides tools for dealing with coherency.
Specifically, from section D4.4.1 "General behavior of the caches" of the spec:
[...] the architecture cannot guarantee whether:
• A memory location present in the cache remains in the cache.
• A memory location not present in the cache is brought into the cache.
Instead, the following principles apply to the behavior of caches:
• The architecture has a concept of an entry locked down in the cache.
How lockdown is achieved is IMPLEMENTATION DEFINED, and lockdown might
not be supported by:
— A particular implementation.
— Some memory attributes.
• An unlocked entry in a cache might not remain in that cache. The
architecture does not guarantee that an unlocked cache entry remains in
the cache or remains incoherent with the rest of memory. Software must
not assume that an unlocked item that remains in the cache remains dirty.
• A locked entry in a cache is guaranteed to remain in that cache. The
architecture does not guarantee that a locked cache entry remains
incoherent with the rest of memory, that is, it might not remain dirty.
[...]
• Any memory location is not guaranteed to remain incoherent with the rest of memory.
So basically you want cache lockdown. Consulting the manual of your CPU though:
• The Cortex-A72 processor does not support TLB or cache lockdown.
So you can't put something in cache on purpose. Now, you might be able to tell whether something has been cached by trying to observe side effects. The two common side effects of caches are latency and coherency. So you could try and time access times or modify the contents of DRAM and check whether you see that change in your cached mapping... but that's still a terrible idea.
For one, both of these are destructive operations, meaning they will change the property you're measuring, by measuring it. And for another, just because you observe them once does not mean you can rely on that happening.
Bottom line: you cannot guarantee that something is held in any particular cache by the time you use it.

Cache - is not a place where data should be stored, it's just... cache? :)
I mean, your processor decide which data it should cache and where (L1/L2/L3) and logic depends on CPU implementation.
If you wanted to, you could try to find the algorithm of placing and replacing data in cache and play with this (without guaranties, of course) by using dedicated instructions to prefetch your data and then maintain the cache with non-caching instructions for your other program.
Maybe for modern ARM there are easier ways, I spoke from x86/x64 perspective, but my whole point is "are you really sure that you need this"?
CPU's smart enough to cache the data which they need and they do it better and better year by year.
I'd recommend you to use any profiler that can show you cache misses to be sure than your data is not presented in the cache already.
If it don't, the first thing to optimize - is an algorithm. Try to figure out why there was a cache miss - maybe you should load less data in loop by using temp variables, for example, or even unroll the loop manually to control where and what being accessed.

Can one CPU core observe others' modification immediately? [duplicate]

Under the total store order(TSO) memory consistency model, a x86 cpu will have a write buffer to buffer write requests and can serve reordered read requests from the write buffer. And it says that the write requests in the write buffer will exit and be issued toward cache hierarchy in FIFO order, which is the same as program order.
I am curious about:
To serve the write requests issued from the write buffer, does L1 cache controller handle the write requests, finish the cache coherence of the write requests and insert data into L1 cache in the same order as the issue order?

Your terminology is unusual. You say "finish the cache coherence"; what actually happens is that the core has to get (exclusive) ownership of the cache line before it can modify it. At the instant/cycle that modification happens, it becomes part of the view of memory contents shared by all participants in the cache-coherency protocol.
So yes, you do "finish the cache coherence" = get exclusive ownership before the store can even enter cache and become globally visible = available for requests to share that cache line. Cache always maintains coherence (that's the point of MESI), not gets out of sync and then wait for coherence. I think your confusion stems from your mental model not matching that reality.
(Weakly-ordered architectures have mind-bending possibilities like not all cores seeing the stores from two other cores in the same order; that can happen by private store-forwarding between SMT threads on one physical core letting another logical core see a store ahead of commit to L1d = global visibility.)
I think you know some of this, but let me start from the basics.
L1 cache in each core participates in the cache-coherency protocol that keeps its cache coherent with the other caches in the coherency domain (e.g. L2 and L3, and L1 in other cores, but not video-RAM caches inside the GPU).
Loads become globally visible at the instant their data is read from L1 cache (or from the store buffer or from uncacheable RAM or MMIO). MFENCE can force them to wait for earlier stores to become globally visible before sampling L1, to avoid StoreLoad reordering.
Stores become globally visible at the instant their data is committed to L1 cache. The conditions required before this can happen are:
It's finished executing: the data+address is in a store buffer entry. (i.e. the store-address and store-data uops executed on the appropriate ports once their inputs were ready, writing the address and data into the store buffer, aka Memory Order Buffer on Intel CPUs).
It's retired from the out-of-order part of the core, and thus known to be non-speculative. Before retirement, we don't know that it and all preceding instructions won't fault, or that it wasn't in the shadow of a branch mispredict or other mis-speculation.
Retirement can only happen after it's finished executing, but isn't tied to commitment to L1d. The store buffer can continue to track a non-speculative store that will definitely happen eventually even after the ROB (out-of-order execution ReOrder Buffer) has forgotten about the store instruction.
All preceding loads/stores/fences are already globally visible (because of x86's memory ordering rules). This excludes weakly-ordered ops (NT stores); other loads/stores can pass them.
The cache line is in the Exclusive or Modified state of the MESI/MESIF/MOESI cache-coherence protocol, in the L1d cache of the current core. This can take a long time if the RFO (read for ownership) encounters a cache miss in outer levels of cache, or contention with other cores that also want exclusive access to write, or atomically RMW, a cache line.
See wikipedia's MESI article for diagrams of allowed state transitions, and details. The key point is that coherency is achieved by only allowing a core to modify its copy of a cache line when it's sure that no other caches contain that line, so that it's impossible for two conflicting copies of the same line to exist.
Intel CPUs actually use MESIF, while AMD CPUs actually use MOESI to allow cache->cache data transfer of dirty data instead of write-back to a shared outer cache like the basic MESI protocol requires.
Also note that modern Intel designs (before Skylake-AVX512) implement use a large shared inclusive L3 cache as a backstop for cache-coherency, so snoop requests don't actually have to be broadcast to all cores; they just check L3 tags (which contain extra metadata to track which core is caching what.
Intel's L3 is tag-inclusive even for lines that inner caches have in Exclusive or Modified state and thus are Invalid in L3. See this paper for more details of a simplified version of what Intel does).
Also related: I wrote an answer recently about why we have small/fast L1 + larger L2/L3, instead of one big cache, including some links to other cache-related stuff.
Back to the actual question:
Yes, stores are committed to L1 in program order, because that's the order that x86 requires them to become globally visible. L1-commit order is the same thing as global-visibility order.
Instead of "finish the cache coherence", instead you should say "get ownership of the cache line". This involves communicating with other caches using the cache coherency protocol, so I guess you probably meant "finish getting exclusive ownership using the cache coherency protocl".
The memory ordering part of the MESI wiki article points out that buffering stores in a store queue is separate from out-of-order execution in general.
The store buffer decouples commit to L1d from OoO exec retirement. This can potentially hide a lot more store latency than the regular out-of-order window size. However, retired stores must eventually happen (in the right order) even if an interrupt arrives, so allowing lots of retired but not committed stores can increase interrupt latency.
The store buffer tries to commit retired stores to L1d as quickly as it can, but it's restricted by the memory ordering rules. (i.e. other cores will see stores soon; you don't need a fence to flush the store buffer unless you need the current thread to wait for that to happen before a later load in this thread. e.g. for sequentially-consistent stores.)
On a weakly-ordered ISA, later stores can commit to L1d while an earlier store is still waiting for a cache miss. (But you'd still need a memory order buffer to preserve the illusion of a single core running instructions in program order.)
The store buffer can have multiple cache misses in flight at once, because even on strongly-ordered x86 it can send an RFO for a cache line before that store is the oldest one in the buffer.

Yes in a model like x86-TSO stores are likely committed to the L1 in program order, and Peter's answer covers it well. That is, the store buffer is maintained in program order, and the core will commit only the oldest store (or perhaps several consecutive oldest stores if they are all going to the same cache line) to L1 before moving on.1
However, you mention in the comments your concern that this might impact performance by essentially making the store buffer commit a blocking (serialized) process:
And why I am confused about this problem is that cache controller
could handle the requests in a non-blocking way. But, to conform to
the TSO and make sure data globally visible on a multi-core system,
should cache controller follow the store ordering? Because if there
are two variable A and B being updated sequentially on core 1 and core
2 get the updated B from core 1, then core 2 must also can see the
updated A. And to achieve this, I think the private cache hierarchy on
core 1 have to finishes the cache coherence of the variable A and B in
order and make them globally visible. Am I right?
The good news is that even though the store buffer might commit in a ordered way only the oldest store to L1, it can still get plenty of parallelism with respect to the rest of the memory subsystem by looking ahead in the store buffer and making prefetch RFO requests: trying to get the line in the E state in the local core even before the store first in line to commit to L1.
This approach doesn't violate ordering, since the stores are still written in program order, but it allows full parallelism when resolving L1 store misses. It is L1 store misses that really matter anyways: stores hits in L1 can commit rapidly, at least 1 per cycle, so committing a bunch of hits doens't help much: but getting MLP on store misses is very important, especially for scattered stores the prefetcher can't deal with.
Do x86 chips actually use a technique like this? Almost certainly. Most convincingly, tests of a long series of random writes show a much better average latency than the full memory latency, implying MLP significantly better than one. You can also find patents like this one or this one where Intel describes pretty much exactly this method.
Still, nothing is perfect. There is some evidence that ordering concerns causes weird performance hiccups when stores are missing L1, even if they hit in L2.
1 It is certainly possible that it can commit stores out of order if in maintains the illusion of in-order commit, e.g., by not relinquishing ownership of cache lines written out of order until order is restored, but this is prone to deadlocks and other complicated cases, and I have no evidence that x86 does so.

what is sequential store buffer structure in gc specifically?

I have read garbage collection book, it mentioned a sort of data structure ,sequential store buffer , could anyone help me to explain how it works? or the principle? or where i can find the thesis about it ?

For generational collectors, different regions of the heap get collected at different times (minor for young gen., major for old gen.). To ensure consistency of collection a remembered set is typically used that records links from objects in the old generation to the young generation.
There are different ways of recording the remembered set, as described in the GC book you mention. A common way is the use of a card table, which is how the G1 collector does it.
An alternative is the sequential store buffer. This is an area of memory that is treated roughly like a stack, i.e. there is a pointer to where the next piece of data can be stored. Once the data is saved the pointer is bumped by the size of the data. This is very efficient (and is also the way space is allocated in the young generation). For a GC algorithm that uses a write-barrier (most) this is a good way of reducing the load created by the write-barrier. It is also very efficient on pipelined architectures with branch prediction.

Linux slab allocator and cache performance

From the guide understanding linux kernel 3rd edition, chapter 8.2.10, Slab coloring-
We know from Chapter 2 that the same hardware cache line maps many different blocks of RAM. In this
chapter, we have also seen that objects of the same size end up being stored at the same offset within a cache.
Objects that have the same offset within different slabs will, with a relatively high probability, end up mapped
in the same cache line. The cache hardware might therefore waste memory cycles transferring two objects
from the same cache line back and forth to different RAM locations, while other cache lines go underutilized.
The slab allocator tries to reduce this unpleasant cache behavior by a policy called slab coloring : different
arbitrary values called colors are assigned to the slabs.
(1) I am unable to understand the issue that the slab coloring tries to solve. When a normal proccess accesses data, if it is not in the cache and a cache miss is encountered, the data is fetched into the cache along with data from the surounding address of the data the process tries to access to boost performance. How can a situation occur such that same specific cache lines keeps getting swapped? the probability that a process keeps accessing two different data addresses in same offset inside a memory area of two different memory areas is very low. And even if it does happen, cache policies usually choose lines to be swapped according to some agenda such as LRU, Random, etc. No policy exist such that chooses to evict lines according to a match in the least significant bits of the addresses being accessed.
(2) I am unable to understand how the slab coloring, which takes free bytes from end of slab to the beginning and results with different slabs with different offsets for the first objects, solve the cache-swapping issue?
[SOLVED] after a small investigation I believe I found an answer to my question. Answer been posted.

After many studying and thinking, I have got explanation seemingly more reasonable, not only by specific address examples.
Firstly, you must learn basics knowledge such as cache , tag, sets , line allocation.
It is certain that colour_off's unit is cache_line_size from linux kernel code. colour_off is the basic offset unit and colour is the number of colour_off in struct kmem_cache.
int __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
cachep->align = ralign;
cachep->colour_off = cache_line_size(); // colour_off's unit is cache_line_size
/* Offset must be a multiple of the alignment. */
if (cachep->colour_off < cachep->align)
cachep->colour_off = cachep->align;
.....
err = setup_cpu_cache(cachep, gfp);
https://elixir.bootlin.com/linux/v4.6/source/mm/slab.c#L2056
So we can analyse it in two cases.
The first is cache > slab.
You see slab 1 slab2 slab3 ... has no possibility to collide mostly because cache is big enough except slab1 vs slab5 which can collide. So colouring mechanism is not so clear to improve performance in the case. But with slab1 and slab5 we just ignore to explain it why, I am sure you will work it out after reading the following.
The second is slab > cache.
A blank line means a color_off or cache line. Clearly, slab1 and slab2 has no possibility to collide on the lines signed by tick as well as slab2 slab3.
We make sure colouring mechanism optimize two lines between two adjacent slabs, much less slab1 vs slab3 which optimize more lines, 2+2 = 4 lines, you can count it.
To summarize, colouring mechanism optimize cache performance (detailly just optimize some lines of colour_off at the beginning and end, not other lines which can still collide ) by using originally useless memory as possible as it can.

I think I got it, the answer is related to Associativity.
A cache can be divided to certain sets, each set can only cache a certain memory blocks type in it. For example, set0 will contain memory blocks with addresses of multiple of 8, set1 will contain memory blocks with addresses of multiple of 12. The reason for that is to boost cache performance, to avoid the situation where every address is searched throught the whole cache. This way only a certain set of the cache needs to be searched.
Now, from the link Understanding CPU Caching and performance
From page 377 of Henessey and Patterson, the cache placement formula is as follows:
(Block address) MOD (Number of sets in cache)
Lets take memory block address 0x10000008 (from slabX with color C) and memory block address 0x20000009 (from slabY with color Z). For most N (number of sets in cache), the calculation for <address> MOD <N> will yield a different value, hence a different set to cache the data. If the addresses were with same least significant bits values (for example 0x10000008 and 0x20000008) then for most of N the calculation will yield same value, hence the blocks will collide to the same cache set.
So, by keeping an a different offset (colors) for the objects in different slabs, the slabs objects will potentially reach different sets in cache and will not collide to the same set, and overall cache performance is increased.
EDIT: Furthermore, if the cache is a direct mapped one, then according to wikipedia, CPU Cache, no cache replacement policy exist and the modulu calculation yields the cache block to which the memory block will be stored:
Direct-mapped cache
In this cache organization, each location in main memory can go in only one entry in the cache. Therefore, a direct-mapped cache can also be called a "one-way set associative" cache. It does not have a replacement policy as such, since there is no choice of which cache entry's contents to evict. This means that if two locations map to the same entry, they may continually knock each other out. Although simpler, a direct-mapped cache needs to be much larger than an associative one to give comparable performance, and it is more unpredictable. Let x be block number in cache, y be block number of memory, and nbe number of blocks in cache, then mapping is done with the help of the equation x = y mod n.

Say you have a 256 KB cache and it uses a super-simple algorithm where it does cache line = (real address AND 0x3FFFFF).
Now if you have slabs starting on each megabyte boundary then item 20 in Slab 1 will kick Item 20 of Slab 2 out of cache because they use the same cache line tag.
By offsetting the slabs it becomes less likely that different slabs will share the same cache line tag. If Slab 1 and Slab 2 both hold 32 byte objects and Slab 2 is offset 8 bytes, its cache tags will never be exactly equal to Slab 1's.
I'm sure I have some details wrong, but take it for what it's worth.

How to prevent two processess from fighting for a common cache?

I was asked this question on an exam. We have two CPUs, or two cores in the same CPU, that share a common cache (for example, L3). On each CPU there is an MPI process (or a thread of one common process). How can we assure that these two processes don't interfere, meaning that they don't push each others entries out or use a half of the cache each or something similar. The goal is to improve the speed of memory access here.
The OS is some sort of Unix, if that is important.

Based on your comments, it seems that a "textbook answer" is expected, so I would suggest partitioning the cache between the processes. This way you guarantee that they don't compete over the same cache sets and thrash each other. This is assuming you don't want to actually share anything between the 2 processes, in which case this approach would fail (although a possible fix would be to split the cache space in 3 - one range for each process, and one for shared data).
Since you're probably not expected to redesign the cache and provide HW partitioning scheme (unless the question comes in the scope of computer architecture course), the simplest way to achieve this is simply by inspecting the cache size and associativity, figuring our the number of sets, and aligning the data sets of each process/thread to a different part.
For example, if your shared cache is 2MB big, and has 16 ways and 64B lines, you would have 2k sets. In such case, each process would want to align its physical addresses (assuming the cache is physically mapped) to a different half 1k sets, or a different 0x10000 out of each 0x20000. In other words, P0 would be free to use any physical address with bit 16 equals 0 , and P1 would use the addresses with bit 16 equals 1.
Note, that since that exceeds the size of a basic 4k page (alignment of 0x1000), you would either need to hack your OS to assign your pages to the appropriate physical addresses for each process, or simply use larger pages (2M would be enough).
Also note that by keeping a contiguous 0x10000 per allocation, we still enjoy spatial locality and efficient hardware prefetching (otherwise you could simply pick any other split, even even/odd sets by using bit 6, but that would leave your data fractured.
Last issue is for data sets larger than this 0x10000 quota - to make then align you'd simply have to break them into chunks up to 0x10000, and align each separately. There's also the issue of code/stack/pagemap and other types of OS/system data which you have less control over (actually code can also be aligned, or more likely in this case - shared) - I'm assuming this has negligible impact on thrashing.
Again - this attempts to answer without knowing what system you work with, what you need to achieve, or even what is the context of the course. With more context we can probably focus this to a simpler solution.

How large is a way in the cache?
For example, if you have a cache where each way is 128KiB in size, you partition your memory in such a way that for each address modulo 128KiB, process A uses the 0-64KiB region, and process B uses the lower 64KiB-128KiB region. (This assumes private L1-per-core).
If your physical page size is 4KiB (and your CPU uses physical addresses for caching, not virtual - which does occur on some CPUs), you can make this much nicer. Let's say you're mapping the same amount of memory into virtual address space for each core - 16KiB. Pages 0, 2, 4, 6 go to process A's memory map, and pages 1, 3, 5, 7 go to process B's memory map. As long as you only address memory in that carefully laid out region, the caches should never fight. Of course, you've effectively halved the size of your cache-ways by doing so, but you have multiple ways...

You'll want to utilize a lock in regards to multi-thread programming. It's hard to provide an example due to not knowing your specific situation.
When one process has access, lock all other processes out until the 'accessing' process is finished with the resource.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string