I was asked this question on an exam. We have two CPUs, or two cores in the same CPU, that share a common cache (for example, L3). On each CPU there is an MPI process (or a thread of one common process). How can we assure that these two processes don't interfere, meaning that they don't push each others entries out or use a half of the cache each or something similar. The goal is to improve the speed of memory access here.
The OS is some sort of Unix, if that is important.
Based on your comments, it seems that a "textbook answer" is expected, so I would suggest partitioning the cache between the processes. This way you guarantee that they don't compete over the same cache sets and thrash each other. This is assuming you don't want to actually share anything between the 2 processes, in which case this approach would fail (although a possible fix would be to split the cache space in 3 - one range for each process, and one for shared data).
Since you're probably not expected to redesign the cache and provide HW partitioning scheme (unless the question comes in the scope of computer architecture course), the simplest way to achieve this is simply by inspecting the cache size and associativity, figuring our the number of sets, and aligning the data sets of each process/thread to a different part.
For example, if your shared cache is 2MB big, and has 16 ways and 64B lines, you would have 2k sets. In such case, each process would want to align its physical addresses (assuming the cache is physically mapped) to a different half 1k sets, or a different 0x10000 out of each 0x20000. In other words, P0 would be free to use any physical address with bit 16 equals 0 , and P1 would use the addresses with bit 16 equals 1.
Note, that since that exceeds the size of a basic 4k page (alignment of 0x1000), you would either need to hack your OS to assign your pages to the appropriate physical addresses for each process, or simply use larger pages (2M would be enough).
Also note that by keeping a contiguous 0x10000 per allocation, we still enjoy spatial locality and efficient hardware prefetching (otherwise you could simply pick any other split, even even/odd sets by using bit 6, but that would leave your data fractured.
Last issue is for data sets larger than this 0x10000 quota - to make then align you'd simply have to break them into chunks up to 0x10000, and align each separately. There's also the issue of code/stack/pagemap and other types of OS/system data which you have less control over (actually code can also be aligned, or more likely in this case - shared) - I'm assuming this has negligible impact on thrashing.
Again - this attempts to answer without knowing what system you work with, what you need to achieve, or even what is the context of the course. With more context we can probably focus this to a simpler solution.
How large is a way in the cache?
For example, if you have a cache where each way is 128KiB in size, you partition your memory in such a way that for each address modulo 128KiB, process A uses the 0-64KiB region, and process B uses the lower 64KiB-128KiB region. (This assumes private L1-per-core).
If your physical page size is 4KiB (and your CPU uses physical addresses for caching, not virtual - which does occur on some CPUs), you can make this much nicer. Let's say you're mapping the same amount of memory into virtual address space for each core - 16KiB. Pages 0, 2, 4, 6 go to process A's memory map, and pages 1, 3, 5, 7 go to process B's memory map. As long as you only address memory in that carefully laid out region, the caches should never fight. Of course, you've effectively halved the size of your cache-ways by doing so, but you have multiple ways...
You'll want to utilize a lock in regards to multi-thread programming. It's hard to provide an example due to not knowing your specific situation.
When one process has access, lock all other processes out until the 'accessing' process is finished with the resource.
Related
Let's assume that 2 cores are trying to write different values to the same RAM address (1 byte), at the same moment of time (plus-minus eta), and without using any interlocked instructions or memory barriers. What happens in this case and what value will be written to the main RAM? The first one wins? The last one wins? Undetermined behavior?
x86 (like every other mainstream SMP CPU architecture) has coherent data caches. It's impossible for two difference caches (e.g. L1D of 2 different cores) to hold conflicting data for the same cache line.
The hardware imposes an order (by some implementation-specific mechanism to break ties in case two requests for ownership arrive in the same clock cycle from different cores). In most modern x86 CPUs, the first store won't be written to RAM, because there's a shared write-back L3 cache to absorb coherency traffic without a round-trip to memory.
Loads that appear after both the stores in the global order will see the value stored by whichever store went second.
(I'm assuming we're talking about normal (not NT) stores to cacheable memory regions (WB, not USWC, UC, or even WT). The basic idea would be the same in either case, though; one store would go first, the next would step on it. The data from the first store could be observed temporarily if a load happened to get between them in the global order, but otherwise the data from the store that the hardware chose to do 2nd would be the long-term effect.
We're talking about a single byte, so the store can't be split across two cache lines, and thus every address is naturally aligned so everything in Why is integer assignment on a naturally aligned variable atomic on x86? applies.
Coherency is maintained by requiring a core to acquire exclusive access to that cache line before it can modify it (i.e. make a store globally visible by committing it from the store queue to L1D cache).
This "acquiring exclusive access" stuff is done using (a variant of) the MESI protocol. Any given line in a cache can be Modified (dirty), Exclusive (owned by not yet written), Shared (clean copy; other caches may also have copies so an RFO (Read / Request For Ownership) is required before write), or Invalid. MESIF (Intel) / MOESI (AMD) add extra states to optimize the protocol, but don't change the fundamental logic that only one core can change a line at any one time.
If we cared about ordering of multiple changes to two different lines, then memory ordering an memory barriers would come into play. But none of that matters for this question about "which store wins" when the stores execute or retire in the same clock cycle.
When a store executes, it goes into the store queue. It can commit to L1D and become globally visible at any time after it retires, but not before; unretired instructions are treated as speculative and thus their architectural effects must not be visible outside the CPU core. Speculative loads have no architectural effect, only microarchitectural1.
So if both stores become ready to commit at "the same time" (clocks are not necessarily synchronized between cores), one or the other will have its RFO succeed first and gain exclusive access, and make its store data globally visible. Then, soon after, the other core's RFO will succeed and update the cache line with its data, so its store comes second in the global store order observed by all other cores.
x86 has a total-store-order memory model where all cores observe the same order even for stores to different cache lines (except for always seeing their own stores in program order). Some weakly-ordered architectures like PowerPC would allow some cores to see a different total order from other cores, but this reordering can only happen between stores to different lines. There is always a single modification order for a single cache line. (Reordering of loads with respect to each other and other stores means that you have to be careful how you go about observing things on a weakly ordered ISA, but there is a single order of modification for a cache line, imposed by MESI).
Which one wins the race might depend on something as prosaic as the layout of the cores on the ring bus relative to which slice of shared L3 cache that line maps to. (Note the use of the word "race": this is the kind of race which "race condition" bugs describe. It's not always wrong to write code where two unsynchronized stores update the same location and you don't care which one wins, but it's rare.)
BTW, modern x86 CPUs have hardware arbitration for the case when multiple cores contend for atomic read-modify-write to the same cache line (and thus are holding onto it for multiple clock cycles to make lock add byte [rdi], 1 atomic), but regular loads/stores only need to own a cache line for a single cycle to execute a load or commit a store. I think the arbitration for locked instructions is a different thing from which core wins when multiple cores are trying to commit stores to the same cache line. Unless you use a pause instruction, cores assume that other cores aren't modifying the same cache line, and speculatively load early, and thus will suffer memory-ordering mis-speculation if it does happen. (What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)
IDK if anything similar happens when two threads are both just storing without loading, but probably not because stores aren't speculatively reordered and are decoupled from out-of-order execution by the store queue. Once a store instruction retires, the store is definitely going to happen, so OoO exec doesn't have to wait for it to actually commit. (And in fact it has to retirem from the OoO core before it can commit, because that's how the CPU knows it's non-speculative; i.e. that no earlier instruction faulted or was a mispredicted branch)
Footnotes:
Spectre blurs that line by using a cache-timing attack to read microarchitectural state into the architectural state.
They will wind up being sequenced, likely between the L1 caches. One write will come first and the other will come second. Whichever one comes second will be the result that subsequent reads will see.
From the guide understanding linux kernel 3rd edition, chapter 8.2.10, Slab coloring-
We know from Chapter 2 that the same hardware cache line maps many different blocks of RAM. In this
chapter, we have also seen that objects of the same size end up being stored at the same offset within a cache.
Objects that have the same offset within different slabs will, with a relatively high probability, end up mapped
in the same cache line. The cache hardware might therefore waste memory cycles transferring two objects
from the same cache line back and forth to different RAM locations, while other cache lines go underutilized.
The slab allocator tries to reduce this unpleasant cache behavior by a policy called slab coloring : different
arbitrary values called colors are assigned to the slabs.
(1) I am unable to understand the issue that the slab coloring tries to solve. When a normal proccess accesses data, if it is not in the cache and a cache miss is encountered, the data is fetched into the cache along with data from the surounding address of the data the process tries to access to boost performance. How can a situation occur such that same specific cache lines keeps getting swapped? the probability that a process keeps accessing two different data addresses in same offset inside a memory area of two different memory areas is very low. And even if it does happen, cache policies usually choose lines to be swapped according to some agenda such as LRU, Random, etc. No policy exist such that chooses to evict lines according to a match in the least significant bits of the addresses being accessed.
(2) I am unable to understand how the slab coloring, which takes free bytes from end of slab to the beginning and results with different slabs with different offsets for the first objects, solve the cache-swapping issue?
[SOLVED] after a small investigation I believe I found an answer to my question. Answer been posted.
After many studying and thinking, I have got explanation seemingly more reasonable, not only by specific address examples.
Firstly, you must learn basics knowledge such as cache , tag, sets , line allocation.
It is certain that colour_off's unit is cache_line_size from linux kernel code. colour_off is the basic offset unit and colour is the number of colour_off in struct kmem_cache.
int __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
cachep->align = ralign;
cachep->colour_off = cache_line_size(); // colour_off's unit is cache_line_size
/* Offset must be a multiple of the alignment. */
if (cachep->colour_off < cachep->align)
cachep->colour_off = cachep->align;
.....
err = setup_cpu_cache(cachep, gfp);
https://elixir.bootlin.com/linux/v4.6/source/mm/slab.c#L2056
So we can analyse it in two cases.
The first is cache > slab.
You see slab 1 slab2 slab3 ... has no possibility to collide mostly because cache is big enough except slab1 vs slab5 which can collide. So colouring mechanism is not so clear to improve performance in the case. But with slab1 and slab5 we just ignore to explain it why, I am sure you will work it out after reading the following.
The second is slab > cache.
A blank line means a color_off or cache line. Clearly, slab1 and slab2 has no possibility to collide on the lines signed by tick as well as slab2 slab3.
We make sure colouring mechanism optimize two lines between two adjacent slabs, much less slab1 vs slab3 which optimize more lines, 2+2 = 4 lines, you can count it.
To summarize, colouring mechanism optimize cache performance (detailly just optimize some lines of colour_off at the beginning and end, not other lines which can still collide ) by using originally useless memory as possible as it can.
I think I got it, the answer is related to Associativity.
A cache can be divided to certain sets, each set can only cache a certain memory blocks type in it. For example, set0 will contain memory blocks with addresses of multiple of 8, set1 will contain memory blocks with addresses of multiple of 12. The reason for that is to boost cache performance, to avoid the situation where every address is searched throught the whole cache. This way only a certain set of the cache needs to be searched.
Now, from the link Understanding CPU Caching and performance
From page 377 of Henessey and Patterson, the cache placement formula is as follows:
(Block address) MOD (Number of sets in cache)
Lets take memory block address 0x10000008 (from slabX with color C) and memory block address 0x20000009 (from slabY with color Z). For most N (number of sets in cache), the calculation for <address> MOD <N> will yield a different value, hence a different set to cache the data. If the addresses were with same least significant bits values (for example 0x10000008 and 0x20000008) then for most of N the calculation will yield same value, hence the blocks will collide to the same cache set.
So, by keeping an a different offset (colors) for the objects in different slabs, the slabs objects will potentially reach different sets in cache and will not collide to the same set, and overall cache performance is increased.
EDIT: Furthermore, if the cache is a direct mapped one, then according to wikipedia, CPU Cache, no cache replacement policy exist and the modulu calculation yields the cache block to which the memory block will be stored:
Direct-mapped cache
In this cache organization, each location in main memory can go in only one entry in the cache. Therefore, a direct-mapped cache can also be called a "one-way set associative" cache. It does not have a replacement policy as such, since there is no choice of which cache entry's contents to evict. This means that if two locations map to the same entry, they may continually knock each other out. Although simpler, a direct-mapped cache needs to be much larger than an associative one to give comparable performance, and it is more unpredictable. Let x be block number in cache, y be block number of memory, and nbe number of blocks in cache, then mapping is done with the help of the equation x = y mod n.
Say you have a 256 KB cache and it uses a super-simple algorithm where it does cache line = (real address AND 0x3FFFFF).
Now if you have slabs starting on each megabyte boundary then item 20 in Slab 1 will kick Item 20 of Slab 2 out of cache because they use the same cache line tag.
By offsetting the slabs it becomes less likely that different slabs will share the same cache line tag. If Slab 1 and Slab 2 both hold 32 byte objects and Slab 2 is offset 8 bytes, its cache tags will never be exactly equal to Slab 1's.
I'm sure I have some details wrong, but take it for what it's worth.
While reading and understanding linux kernel using the guide-
http://www.johnchukwuma.com/training/UnderstandingTheLinuxKernel3rdEdition.pdf
I have something I'm trying to understand in the Buddy system for page allocaion and freeing.
The technique adopted by Linux to solve the external fragmentation
problem is based on the well-known buddy system algorithm. All free
page frames are grouped into 11 lists of blocks that contain groups of
1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 contiguous page
frames, respectively. [chapter 8.1.7]
This makes perfectly sense as now Linux can serve allocation request quickly as there are different chunk sizes ready for different chunk sizes requests.
Now, say the system starts up, and with all the available pages are free and grouped as mentioned above to those 11 groups. Now lets consider a scenario in which a process requires one page of order 1, then free it. According to the free algorithm-
while (order < 10)
{
buddy_idx = page_idx ^ (1 << order);
buddy = base + buddy_idx;
if (!page_is_buddy(buddy, order))
break;
list_del(&buddy->lru);
zone->free_area[order].nr_free--;
ClearPagePrivate(buddy);
buddy->private = 0;
page_idx &= buddy_idx;
order++;
}
so according to this and my scenario, the order 1 chunk (the first ever allocated) will be merged with another order 1 chunk into a chunk of order 2, though the two order 1 chunks have not been splitted from an order 2 chunk in allocation stage.
This way, if I keep on allocate and then free a single chunk, pretty quickly the system will reach a state in which all memory chunks are of the biggest order, which seems not efficient. I would have expected that merging two buddies will be made only when those buddies were previously splitted from a bigger order chunk, that way the initial default state will be preserved as much as possible and the whole system will be kept efficient.
Am I missing something? is it possible that this code is wrong? am I not aware of another advantage this code provides?
The assumption of such an initial state when plenty of smallest order blocks are available might be slightly dubious. If I recall correctly, allocation of a memory block should commence with looking at the smallest (the same size) group and then looking at greater-order groups to find a free block. If the block found is larger than needed, it will be split, and the corresponding groups will be updated. It's not so obvious, but this whole process might commence with the initial state when only highest-order blocks are available.
A handful of examples which I can meet in the literature draw almost the same picture of the initial state. An eloquent example may be found on Wikipedia: https://en.wikipedia.org/wiki/Buddy_memory_allocation#In_practice . The diagram and the description shed some light on a typical situation.
All in all, I can't find any proof for the assumption of the "initial default state". The idea of decrease in efficiency when splitting a smaller chunk from a larger one is the most murky and probably deserves a separate talk.
EDIT:
The initial state from the kernel's point of view might not be the same as the initial state seen by your hypothetical process. Say, the system starts, and at certain point there is one chunk of memory. However, your hypothetical process will not be alone, - for sure. The memory distribution will likely change a lot before the process will be able to begin allocating and freeing any chunks of memory. For sure, the kernel or, to be precise, most of its subsystems will request plenty of memory chunks of different sizes during initialisation, and significant amount of the chunks will be owned by the kernel for considerable amount of time (perhaps, throughout the whole uptime). So, the point is that by the time your process starts, the buddy system will likely be "warmed up" and, indeed, enough small chunks will be available. However, the buddies for the pages (acquired by your process) will still be owned by various subsystems, and, as soon as your process decides to free its pages, those buddies will not be approved as ready for merge, i.e. page_is_buddy() from the excerpt will yield false. Of course, this whole scenario might be true provided that your process indeed succeeds to take existing free chunks without splitting higher-order blocks.
So, the point is that your assumption of the initial buddy distribution might not be the true initial state. It could only be a "warmed up" state where you indeed may have small chunks but their buddies will be busy and thus shall prevent the described hypothetical process of uncontrollable merge.
P.S. Here is the description of what page_is_buddy() is for.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
This is a noob question about computer science: How is ram allocated?
Fo example, I use Windows. Can I know which adresses are used by a program? How does Windows allocate memory? Contiguous or non contiguous?
Is it the same thing on Linux OSes ?
And, can I have access to the whole ram with a program? (I don't believe in that, but...)
Do you know any good lectures/documentation on this?
First, when you think you are allocating RAM, you really are not. This is confusing, I know, but it's really not complicated once you understand how it works. Keep reading.
RAM is allocated by the operating systems in units called "pages". Usually, this means contiguous regions of 4kiB, but other sizes are possible (to complicate things further, there exists support for "large pages" (usually on the order of 1-4MiB) on modern processors, and the operating system may have an allocation granularity different from the page size, for example Windows has a page size of 4kiB with a granularity of 64kiB).
Let's ignore those additional details and just think of "pages" that have one particular size (4KiB).
If you allocate and use areas that are greater than the system's page size, you will usually not have contiguous memory, but you will nevertheless see it as contiguous, since your program can only "think" in virtual addresses. In reality you may be using two (or more) pages that are not contiguous at all, but they appear to be. These virtual addresses are transparently translated to the actual addresses by the MMU.
Further, not all memory that you believe to have allocated necessarily exists in RAM at all times, and the same virtual address may correspond to entirely different pieces of RAM at different times (for example when a page is swapped out and is later swapped in again -- your program will see it at the same address, but in reality it is most likely in a different piece of RAM).
Virtual memory is a very powerful instrument. While one address in your program can only refer to [at most] one physical address (in a particular page) in RAM, one physical page of RAM can be mapped to several different addresses in your program, and even in several independent programs.
It is for example possible to create "circular" memory regions, and code from shared libraries is often loaded into one memory location, but used by many programs at the same time (and it will have different addresses in those different programs). Or, you can share memory between programs with that technique so when one program writes to some address, the value in the other program's memory location changes (because it is the exact same memory!).
On a high level, you ask your standard library for memory (e.g. malloc), and the standard library manages a pool of regions that it has reserved in a more or less unspecified way (there are many different allocator implementations, they all have in common that you can ask them for memory, and they give back an address -- this is where you think that you are allocating RAM when you are not).
When the allocator needs more memory, it asks the operating system to reserve another block. Under Linux, this might be sbrk and mmap, under Windows, this would for example be VirtualAlloc.
Generally, there are 3 things you can do with memory, and it generally works the same under Linux and Windows (and every other modern OS), although the API functions used are different, and there are a few more minor differences.
You can reserve it, this does more or less nothing, apart from logically dividing up your address space (only your process cares about that).
Next, you can commit it, this again doesn't do much, but it somewhat influences other processes. The system has a total limit of how much memory it can commit for all processes (physical RAM plus page file size), and it keeps track of that. Which means that memory you commit counts against the same limit that another process could commit. Otherwise, again, not much happens.
Last, you can access memory. This, finally, has a noticeable effect. Upon first accessing a page, a fault occurs (because the page does not exist at all!), and the operating system either fetches some data from a file (if the page belongs to a mapping) or it clears some page (possibly after first saving it to disk). The OS then adjusts the structures in the virtual memory system so you see this physical page of RAM at the address you accessed.
From your point of view, none of that is visible. It just works as if by magic.
It is possible to inspect processes for what areas in their address space are used, and it is possible (but kind of meaningless) to translate this to physical addresses. Note that the same program run at different times might store e.g. one particular variable at a different address. Under Windows, you can for example use the VMMap tool to inspect process memory allocation.
You can only use all RAM if you write your own operating system, since there is always a little memory which the OS reserves that user processes cannot use.
Otherwise you can in principle use [almost] all memory. However, whether or not you can directly use that much depends on whether your process is 32 or 64 bits. Computers nowadays typically have more RAM than you can address with 32 bits, so either you need to use address windowing extensions or your process must be 64 bits. Also, even given an amount of RAM that is in principle addressable using 32 bits, some address space factors (e.g. fragentation, kernel reserve) may prevent you from directly using all memory.
Windows has VirtualAlloc, which allows you to reserve a contiguous region of address space, but not actually use any physical memory. Later when you want to use it (or part of it) you call VirtualAlloc again to commit the region of previously reserved pages.
This is actually really useful, but I want to eventually port my application to linux - so I don't want to use it if I can't port it later. Does linux have a way to do this?
EDIT - Use Case
I'm thinking of allocating 4 GB or some such of virtual address space, but only committing it 64K at a time. This would give me a zero-copy way to grow an array up to 4 GB. Which is important, because the typical double the array size and copy introduces seemingly random unacceptable latency for very large arrays.
mmap a special file, like /dev/zero (or use MAP_ANONYMOUS) as PROT_NONE, later use mprotect to commit.
You can turn this functionality on system-wide by using kernel overcommit. This is usually default setting on many distributions.
Here is the explanation http://www.mjmwired.net/kernel/Documentation/vm/overcommit-accounting
The Linux equivalent of VirtualAlloc() is mmap(), which provides the same behaviours. However as a commenter points out, reservation of contiguous memory is the behaviour of calls to malloc() as long as the memory is not initialized (such as by calloc(), or user code).
"seemingly random unacceptable latency
for very large arrays
You could also consider mlock() or mmap() + MAP_LOCKED to mitigate the impact of paging. Many CPUs support huge (aka large) pages, pages larger than 4kb. These larger pages can mitigate the impact of the TLB on streaming reads/writes.