Linux Buddy page frames allocation and freeing

Linux Buddy page frames allocation and freeing - linux

While reading and understanding linux kernel using the guide-
http://www.johnchukwuma.com/training/UnderstandingTheLinuxKernel3rdEdition.pdf
I have something I'm trying to understand in the Buddy system for page allocaion and freeing.
The technique adopted by Linux to solve the external fragmentation
problem is based on the well-known buddy system algorithm. All free
page frames are grouped into 11 lists of blocks that contain groups of
1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 contiguous page
frames, respectively. [chapter 8.1.7]
This makes perfectly sense as now Linux can serve allocation request quickly as there are different chunk sizes ready for different chunk sizes requests.
Now, say the system starts up, and with all the available pages are free and grouped as mentioned above to those 11 groups. Now lets consider a scenario in which a process requires one page of order 1, then free it. According to the free algorithm-
while (order < 10)
{
buddy_idx = page_idx ^ (1 << order);
buddy = base + buddy_idx;
if (!page_is_buddy(buddy, order))
break;
list_del(&buddy->lru);
zone->free_area[order].nr_free--;
ClearPagePrivate(buddy);
buddy->private = 0;
page_idx &= buddy_idx;
order++;
}
so according to this and my scenario, the order 1 chunk (the first ever allocated) will be merged with another order 1 chunk into a chunk of order 2, though the two order 1 chunks have not been splitted from an order 2 chunk in allocation stage.
This way, if I keep on allocate and then free a single chunk, pretty quickly the system will reach a state in which all memory chunks are of the biggest order, which seems not efficient. I would have expected that merging two buddies will be made only when those buddies were previously splitted from a bigger order chunk, that way the initial default state will be preserved as much as possible and the whole system will be kept efficient.
Am I missing something? is it possible that this code is wrong? am I not aware of another advantage this code provides?

The assumption of such an initial state when plenty of smallest order blocks are available might be slightly dubious. If I recall correctly, allocation of a memory block should commence with looking at the smallest (the same size) group and then looking at greater-order groups to find a free block. If the block found is larger than needed, it will be split, and the corresponding groups will be updated. It's not so obvious, but this whole process might commence with the initial state when only highest-order blocks are available.
A handful of examples which I can meet in the literature draw almost the same picture of the initial state. An eloquent example may be found on Wikipedia: https://en.wikipedia.org/wiki/Buddy_memory_allocation#In_practice . The diagram and the description shed some light on a typical situation.
All in all, I can't find any proof for the assumption of the "initial default state". The idea of decrease in efficiency when splitting a smaller chunk from a larger one is the most murky and probably deserves a separate talk.
EDIT:
The initial state from the kernel's point of view might not be the same as the initial state seen by your hypothetical process. Say, the system starts, and at certain point there is one chunk of memory. However, your hypothetical process will not be alone, - for sure. The memory distribution will likely change a lot before the process will be able to begin allocating and freeing any chunks of memory. For sure, the kernel or, to be precise, most of its subsystems will request plenty of memory chunks of different sizes during initialisation, and significant amount of the chunks will be owned by the kernel for considerable amount of time (perhaps, throughout the whole uptime). So, the point is that by the time your process starts, the buddy system will likely be "warmed up" and, indeed, enough small chunks will be available. However, the buddies for the pages (acquired by your process) will still be owned by various subsystems, and, as soon as your process decides to free its pages, those buddies will not be approved as ready for merge, i.e. page_is_buddy() from the excerpt will yield false. Of course, this whole scenario might be true provided that your process indeed succeeds to take existing free chunks without splitting higher-order blocks.
So, the point is that your assumption of the initial buddy distribution might not be the true initial state. It could only be a "warmed up" state where you indeed may have small chunks but their buddies will be busy and thus shall prevent the described hypothetical process of uncontrollable merge.
P.S. Here is the description of what page_is_buddy() is for.

Related

How many ABA tag bits are needed in lock-free data structures?

One popular solution to the ABA problem in lock-free data structures is to tag pointers with an additional monotonically incrementing tag.
struct aba {
void *ptr;
uint32_t tag;
};
However, this approach has a problem. It is really slow and has huge cache problems. I can obtain a speed-up of twice as much if I ditch the tag field. But this is unsafe?
So my next attempt stuff for 64 bit platforms stuffs bits in the ptr field.
struct aba {
uintptr __ptr;
};
uint32_t get_tag(struct aba aba) { return aba.__ptr >> 48U; }
But someone said to me that only 16 bits for the tag is unsafe. My new plan is to use pointer alignment to cache-lines to stuff more tag bits in but I want to know if that'll work.
If that fails to work my next plan is to use Linux's MAP_32BIT mmap flag to allocated data so I only need 32 bits of pointer space.
How many bits do I need for the ABA tag in lock-free data-structures?

The amount of tag bits that is practically safe can be estimated based on the preemption time and the frequency of pointer modifications.
To remind, the ABA problem happens when a thread reads the value it wants to change with compare-and-swap, gets preempted, and when it resumes the actual value of the pointer happens to be equal to what the thread read before. Therefore the compare-and-swap operation may succeed despite data structure modifications possibly done by other threads during the preemption time.
The idea of adding the monotonically incremented tag is to make each modification of the pointer unique. For it to succeed, increments must produce unique tag values during the time when a modifying thread might be preempted; i.e. for guaranteed correctness the tag may not wraparound during the whole preemption time.
Let's assume that preemption lasts a single OS scheduling time slice, which is typically tens to hundreds of milliseconds. The latency of CAS on modern systems is tens to hundreds of nanoseconds. So rough worst-case estimate is that there might be millions of pointer modifications while a thread is preempted, and so there should be 20+ bits in the tag in order for it to not wraparound.
In practice it can be possible to make a better estimate for a particular real use case, based on known frequency of CAS operations. One also need to estimate the worst-case preemption time more accurately; for example, a low-priority thread preempted by a higher-priority job might end up with much longer preemption time.

According to the paper
http://web.cecs.pdx.edu/~walpole/class/cs510/papers/11.pdf
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects (IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 6, JUNE 2004 p. 491) by PhD Maged M. Michael
tag bits should be sized to make wraparound impossible in real lockfree scenarios (I can read this as if you may have N threads running and each may access the structure, you should have N+1 different states for tags at least):
6.1.1 IBM ABA-Prevention Tags
The earliest and simplest lock-free method for node reuse is
the tag (update counter) method introduced with the
documentation of CAS on the IBM System 370 [11]. It
requires associating a tag with each location that is the
target of ABA-prone comparison operations. By incrementing
the tag when the value of the associated location is
written, comparison operations (e.g., CAS) can determine if
the location was written since it was last accessed by the
same thread, thus preventing the ABA problem.
The method requires that the tag contains enough bits to make
full wraparound impossible during the execution of any
single lock-free attempt. This method is very efficient and
allows the immediate reuse of retired nodes.

Depending on your data structure you could be able to steal some extra bits from the pointers. For example if the objects are 64 bytes and always aligned on 64 byte boundaries, the lower 6 bits of each pointer could be used for the tags (but that's probably what you already suggested for your new plan).
Another option would be to use an index into your objects instead of pointers.
In case of contiguous objects that would of course simply be an index into an array or vector. In case of lists or trees with objects allocated on the heap, you could use a custom allocator and use an index into your allocated block(s).
For say 17M objects you would only need 24 bits, leaving 40 bits for the tags.
This would need some (small and fast) extra calculation to get the address, but if the alignment is a power of 2 only a shift and an addition are needed.

How to prevent two processess from fighting for a common cache?

I was asked this question on an exam. We have two CPUs, or two cores in the same CPU, that share a common cache (for example, L3). On each CPU there is an MPI process (or a thread of one common process). How can we assure that these two processes don't interfere, meaning that they don't push each others entries out or use a half of the cache each or something similar. The goal is to improve the speed of memory access here.
The OS is some sort of Unix, if that is important.

Based on your comments, it seems that a "textbook answer" is expected, so I would suggest partitioning the cache between the processes. This way you guarantee that they don't compete over the same cache sets and thrash each other. This is assuming you don't want to actually share anything between the 2 processes, in which case this approach would fail (although a possible fix would be to split the cache space in 3 - one range for each process, and one for shared data).
Since you're probably not expected to redesign the cache and provide HW partitioning scheme (unless the question comes in the scope of computer architecture course), the simplest way to achieve this is simply by inspecting the cache size and associativity, figuring our the number of sets, and aligning the data sets of each process/thread to a different part.
For example, if your shared cache is 2MB big, and has 16 ways and 64B lines, you would have 2k sets. In such case, each process would want to align its physical addresses (assuming the cache is physically mapped) to a different half 1k sets, or a different 0x10000 out of each 0x20000. In other words, P0 would be free to use any physical address with bit 16 equals 0 , and P1 would use the addresses with bit 16 equals 1.
Note, that since that exceeds the size of a basic 4k page (alignment of 0x1000), you would either need to hack your OS to assign your pages to the appropriate physical addresses for each process, or simply use larger pages (2M would be enough).
Also note that by keeping a contiguous 0x10000 per allocation, we still enjoy spatial locality and efficient hardware prefetching (otherwise you could simply pick any other split, even even/odd sets by using bit 6, but that would leave your data fractured.
Last issue is for data sets larger than this 0x10000 quota - to make then align you'd simply have to break them into chunks up to 0x10000, and align each separately. There's also the issue of code/stack/pagemap and other types of OS/system data which you have less control over (actually code can also be aligned, or more likely in this case - shared) - I'm assuming this has negligible impact on thrashing.
Again - this attempts to answer without knowing what system you work with, what you need to achieve, or even what is the context of the course. With more context we can probably focus this to a simpler solution.

How large is a way in the cache?
For example, if you have a cache where each way is 128KiB in size, you partition your memory in such a way that for each address modulo 128KiB, process A uses the 0-64KiB region, and process B uses the lower 64KiB-128KiB region. (This assumes private L1-per-core).
If your physical page size is 4KiB (and your CPU uses physical addresses for caching, not virtual - which does occur on some CPUs), you can make this much nicer. Let's say you're mapping the same amount of memory into virtual address space for each core - 16KiB. Pages 0, 2, 4, 6 go to process A's memory map, and pages 1, 3, 5, 7 go to process B's memory map. As long as you only address memory in that carefully laid out region, the caches should never fight. Of course, you've effectively halved the size of your cache-ways by doing so, but you have multiple ways...

You'll want to utilize a lock in regards to multi-thread programming. It's hard to provide an example due to not knowing your specific situation.
When one process has access, lock all other processes out until the 'accessing' process is finished with the resource.

Does the full 64K get used for every pipe created?

How are pipes implemented re buffering? I might be creating many pipes but only ever sending/receiving a few bytes through them at a time, so don't want to waste memory unnecessarily.
Edit: I understand what buffering is, I am asking how the buffering is implemented in Linux pipes specifically, ie does the full 64K get allocated regardless of highwatermark?

Buffers are used to equal out the difference in speed between producer and consumer. If you didn't have a buffer, you would have to switch tasks after every byte produced, which would be very inefficient due to the cost of context switches, data and code caches never becoming hot etc. If your consumer can produce data about as fast as the producer consumes it, your buffer use will usually be low (but read on). If the producer is much faster than the consumer, the buffer will fill up completely and the producer will be forced to wait until more space becomes available. The reversed case of slow producer and fast consumer will use a very small part of the buffer for most of the time.
The usage also depends on whether your both processes actually run in parallel (e.g. on separate cores) or if they share a core and only due to the OS's process management are fooled into thinking that they are concurrent. If you have real concurrency (separate core/CPU), your buffer will usually be used less.
Any way, if your applications are not producing much data and their speeds are similar, the buffer will not be very full most of the time. However, I wouldn't be surprised if at OS level, the full 64 kB were allocated any way. But unless you are using an embedded device, 64 kB is not much, so even if always the maximum size is alloctaed, I wouldn't worry about it.
By the way, it is not easy to modify the size of the pipe buffer, for example in this discussion a number of tricks are suggested but they are actually workarounds which modify the way data from the buffer is consumed, not modifying the actual buffer size. You could check ulimit -p but I'm not 100% sure it will give you the control you need.
EDIT: Looking at fs/pipe.c and include/linux/pipe_fs_i.h in Linux code, it looks like the buffers do change their size. Minimum size of the buffer is a full page, though, so if you only need a few bytes, there will be waste. I'm not sure at this point, but some code that uses PIPE_DEF_BUFFERS, which is 16, giving 64 kB with 4 kB pages, makes me wonder if the buffer can fall below 64 kB (the 1 page minimum could be just an additional restriction).

efficiency issue - searching an array on parallel threads

i came across an interview question which asks
while searching a value in an array using 2 perallel threads
which method would be more efficent
(1) read each half of the array on a different thread (spliting it in half)
(2) reading the array on odd and even places (a thread which reads the odd places
and one which reads the even places in the array ).
i don't understand why one would be more efficent then the other
appricate it if someone would clearify this for me
thanks in advance.

Splitting the array in half is almost certainly the way to go. It will almost never be slower, and may be substantially faster.
The reason is fairly simple: when you're reading data from memory, the processor will normally read an entire cache line at a time. The exact size varies between processors, but doesn't matter a whole lot (though, in case you care, something like 64 bytes would be in the ballpark) -- the point is that it reads a contiguous chunk of several bytes at a time.
That means with the odd/even version, both processors running both threads will have to read all the data. By splitting the data in half, each core will read only half the data. If your split doesn't happen to be at a cache line boundary, each will read a little extra (what it needs rounded up to the size of a cache line). On average that will add half a cache line to what each needs to read though.
If the "processors" involved are really two cores on the same processor die, chances are that it won't make a whole lot of difference either way though. In this case, the bottleneck will normally be reading the data from main memory into the lowest-level processor cache. Even with only one thread, you'll (probably) be able to search through the data as fast as you can read it from memory, and adding more threads (no matter how you arrange their use of the data) isn't going to improve things much (if at all).

The difference is that in the case of half split, the memory is accessed linearly by each thread from left to right, searching from index 0 -> N/2 and N/2 -> N respectively, which maximizes the cache usage, since prefetching of memory is done linearly ahead.
In the second case (even-odd) the cache performance would be worse, not only because you would be prefetching items that you are not using (thread 0 takes element 0, 1, etc. but only uses half of them), but also because of cache ping-pong effects (in case of writing, but this is not done in your example).

Block based storage

I would like to store a couple of entries to a file (optimized for reading) and a good data structure for that seems to be a B+ tree. It offers a O(log(n)/log(b)) access time where b is the number of entries in one block.
There are many papers etc. describing B+ trees, but I still have some troubles understaning block based storage systems in general. Maybe someone can point me to the right direction or answer a couple of questions:
Do (all common) file systems create new files at the beginning of a new block? So, can I be sure that seek(0) will set the read/write head to a multiply of the device's block size?
Is it right that I only should use calls like pread(fd, buf, n * BLOCK_SIZE, p * BLOCK_SIZE) (with n, p being integers) to ensure that I always read full blocks?
Is it better to read() BLOCK_SIZE bytes into an array or mmap() those instead? Or is there only a difference if I mmap many blocks and access only a few? What is better?
Should I try to avoid keys spawning multiple blocks by adding padding bytes at the end of each block? Should I do the same for the leaf nodes by adding padding bytes between the data too?
Many thanks,
Christoph

In general, file systems create new files at the beginning of a new block because that is how the underlying device works. Hard disks are block devices and thus cannot handle anything less than a "block" or "sector". Additionally, operating systems treat memory and memory mappings in terms of pages, which are usually even larger (sectors are often 512 or 1024 bytes, pages usually 4096 bytes).
One exception to this rule that comes to mind would be ReiserFS, which puts small files directly into the filesystem structure (which, if I remember right, is incidentially a B+ tree!). For very small files this can actually a viable optimization since the data is already in RAM without another seek, but it can equally be an anti-optimization, depending on the situation.
It does not really matter, because the operating system will read data in units of full pages (normally 4kB) into the page cache anyway. Reading one byte will transfer 4kB and return a byte, reading another byte will serve you from the page cache (if it's the same page or one that was within the readahead range).
read is implemented by copying data from the page cache whereas mmap simply remaps the pages into your address space (possibly marking them copy-on-write, depending on your protection flags). Therefore, mmap will always be at least as fast and usually faster. mmap is more comfortable too, but has the disadvantage that it may block at unexpected times when it needs to fetch more pages that are not in RAM (though, that is generally true for any application or data that is not locked into memory). readon the other hand blocks when you tell it, not otherwise.
The same is true under Windows with the exception that memory mapped files under pre-Vista Windows don't scale well under high concurrency, as the cache manager serializes everything.
Generally one tries to keep data compact, because less data means fewer pages, and fewer pages means higher likelihood they're in the page cache and fit within the readahead range. Therefore I would not add padding, unless it is necessary for other reasons (alignment).

Filesystems which support delayed allocation don't create new files anywhere on disc. Lots of newer filesystems support packing very small files into their own pages or sharing them with metadata (For example, reiser puts very tiny files into the inode?). But for larger files, mostly, yes.
You can do this, but the OS page cache will always read an entire block in, and just copy the bits you requested into your app's memory.
It depends on whether you're using direct IO or non-direct IO.
If you're using direct IO, which bypasses the OS's cache, you don't use mmap. Most databases do not use mmap and use direct IO.
Direct IO means that the pages don't go through the OS's page cache, they don't get cached at all by the OS and don't push other blocks out of the OS cache. It also means that all reads and writes need to be done on block boundaries. Block boundaries can sometimes be determined by a statfs call on the filesystem.
Most databases seem to take the view that they should manage their own page cache themselves, and use the OS only for physical reads/writes. Therefore they typically use direct and synchronous IO.
Linus Torvalds famously disagrees with this approach. I think the vendors really do it to achieve better consistency of behaviour across different OSs.

Yes. Doing otherwise would cause unnecessary complications in FS design.
And the options (as an alternative to "only") are ...?
In Windows memory-mapped files work faster than file API (ReadFile). I guess on Linux it's the same, but you can conduct your own measurements

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string