Understanding how bootmem works - linux

I have been studying OS concepts and decided to look in how these stuff are actually implemented in Linux. But I am having problem understanding some thing in relation with memory management during boot process before page_allocator is turned on, more precisely how bootmem works. I do not need the exact workings of it, but just an understanding how some things are/can be solved.
So obviously, bootmem cannot use dynamic memory, meaning that size he has must be known before runtime, so appropriate steps can be taken, i.e. the maximum size of his bitmap must be known in advance. From what I understand, this is most likely solved by simply mapping enough memory during kernel initialization, if architecture changes, simply change the size of the mapped memory. Obviously, there is probably a lot more going on, but I guess I got the general idea? However, what really makes no sense to me is NUMA architecture. Everywhere I read, it says that pg_data_t is created for each memory node. This pg_data is put into a list(how can it know the size of the list? Or is the size fixed for specific arch?) and for each node, bitmap is allocated. So, basically, it sounds like it can create undefined number of these pg_data, each of which has their memory bitmap of arbitrary size. How? What am I missing?
EDIT: Sorry for not including reference. Here is bootmem code, it can also be found in mm/bootmem.c: http://lxr.free-electrons.com/source/mm/bootmem.c

It is architecture-dependent. On the x86 architecture, early on in the boot process the kernel does issue one BIOS call - the 0xe820 function of the trap at Interrupt Vector 0x15. This returns a memory map that the kernel can use to build it's memory tables, including holes for non-memory (PCI or ISA) devices, etc. Bootloaders (before the kernel) will do the same.
See: Detecting Memory

After looking into this more, I think it works this way: basically, all the necessary things are statically allocated, i.e. by using preprocessor DEFINES it is ensured certain sections of bootmem (as well as other parts of the kernel) code either exist or do not exist in the compiled code for specific architecture (even though the code itself is architecture-independent). These DEFINES are specified in architecture-dependent sources codes found under arch/ (e.g. arch/i386, arch/arm/, etc.). For NUMA architectures there is a define called MAX_NUMNODES, ensuring that list of structs (more specifically, list of pg_data_t structures) representing nodes is allocated as a static array (which is then treated as a list). The bitmaps representing memory map are obviously relatively small, since each page is represented as only one bit, taking up KBs, or maybe MBs. Whatever the case, architecture dependent head.S sets-up all the necessary structures needed for system functioning (like page-tables) and ensure enough physical memory is mapped to virtual so these bitmaps can fit in it without causing page-fault (in case of x86 arch, initial 8MB of RAM is mapped, which is more then enough for both kernel and additional structures, like bitmaps).

Related

FreeBSD ftruncate()+mmap() big hole warning

FreeBSD 13.1 manual page mmap() has the following warning:
WARNING! Extending a file with ftruncate(2), thus creating a big hole, and then filling the hole by modifying a shared mmap() can lead to severe file fragmentation. In order to avoid such fragmentation you should always pre-allocate the file's backing store by write()ing zero's into the newly extended area prior to modifying the area via your mmap(). The fragmentation problem is especially sensitive to MAP_NOSYNC pages, because pages may be flushed to disk in a totally random order.
Questions:
What is this big hole; why ftruncate() has the effect of creating it; and why write() is a proposed solution to the problem?
How does one efficiently zero-fill the hole after ftruncate() and before mmap()? Repeatedly calling write() sounds like more system calls that maybe necessary.
Does the same problem exist on other operating systems? I find no such warning on macOS or Linux.
You can avoid this problem by using posix_fallocate to preallocate desired areas.
The hole is created because files can be sparse, taking up only the space required for the actually used areas, so when using just ftruncate, the backing for the new area is virtual, it isn't reserved on disk until you allocate it or write to it.
It applies to Linux as well, it's just not mentioned. You can expect most filesystem implementations to do similarly; often they do try to be smart, and will defragment your writes if time-wise close one to another, but can't do magic.

Why can't we have a safe ISA?

Accroding to this paper: https://doi.org/10.1109/SP.2013.13, Memory corruption bugs are one of the oldest problems in computer security. The lack of memory safety and type safety has caused countless bugs, causing billions of dollars and huge efforts to fix them.
But the root of C/C++'s memory vulnerability can trace down to the ISA level. At ISA level, every instruction can access any memory address without any fine grained safe check (only corase grained check like page fault). Sure, we can implement memory safe at a higher software level, like Java (JVM), but this leads to significant cost of performance. In a word, we can't have both safety and performance at the same time on existing CPUs.
My question is, why can't we implement the safety at the hardware level? If the CPU has a safe ISA, which ensures the memory safe by, I don't know, taking the responsbilities of malloc and free, then maybe we can get rid of the performance decline of software safe checking. If anyone professional in microelectronics can tell me, is this idea realistic?
Depending on what you mean, it could make it impossible implement memory-unsafe languages like C in a normal way. e.g. every memory access would have to be to some object that has a known size? I'd guess an operating system for such a machine might have to work around that "feature" by telling it that the entire address space was one large array object. Or else you'd need some mechanism for a read system call to know the proper bounds of the object it's writing in the copy_to_user() part of its job. And then there's other OS stuff like accessing the same physical page from different virtual pages.
The OP (via asking on Reddit) found the CHERI project which is an attempt at this idea, involving "... revisit fundamental design choices in hardware and software to dramatically improve system security." Changing hardware alone can't work; compilers need to change, too. But they were able to adapt "Clang/LLVM, FreeBSD, FreeRTOS, and applications such as WebKit," so their approach could be practical. (Unlike the hypothetical versions I was imagining when writing other parts of this answer.)
CHERI uses "fine-grained memory protection", and "Language and compiler extensions" to implement memory-safe C and C++, and higher-level languages.
So it's not a drop-in replacement, and it sounds like you have to actively use the features to gain safety. As I argue in the rest of the answer, hardware can't do it alone, and it's highly non-trivial even with software cooperation. It's easy to come up with ways that wouldn't work. :P
For hardware-enforced memory-safety to be possible, hardware would have to know about every object and its size, and be able to cache that structure in a way that allows efficient lookups to find the bounds. Page tables (4k granularity, or larger in more modern ISAs) are already hard enough for hardware for hardware to cache efficiently for large programs, and that's without even considering which pointer goes with which object.
Checking a TLBs as part of every load and store can be done efficiently, but checking another structure in parallel with that might be problematic. Especially when the ranges don't have power-of-2 sizes and natural alignment, the way pages do, which makes it possible to build a TLB from content-addressable memory that checks for a match against each of several possible values for the high bits. (e.g. a page is 4k in size, always starting at a 4k alignment boundary.)
You mean it may cost too much at hardware level, like the die area?
Die area might not even be the biggest problem, especially these days. It would cost power, and/or cost latency in very important critical paths such as L1d hit load-use latency. Even if you could come up with some plausible way for software to make tables that hardware could check, or otherwise solve the other parts of this problem.
Modifying a page-table entry requires invalidating the entry, including TLB shootdown for other cores. If every free (and some malloc) cost inter-core communication to do similar things for object tables, that would be very expensive.
I think inventing a way for software to tell the hardware about objects would be an even bigger problem. malloc and free aren't something you can just build in to a CPU where memory addressing works anything like existing CPUs, or like it does in C. Software needs to manage memory, it doesn't make sense to try to build that in to a CPU. So then malloc and free (and mmap with file-backed mappings and shared memory...) need a way to tell the CPU about objects. Seems like a mess.
I think at best an ISA could provide more tools software can use to make bounds-checks cheaper. Perhaps some kind of extra semantics on loads/stores, like an extra operand for indexed addressing modes for load or store that takes a max?
At least if we want an ISA to work anything like current ones, rather than work like a JVM or a Transmeta Crusoe and internally recompile for some real ISA.
Intel's MPX ISA extension to x86 was an attempt to let software set up bound ranges, but it's been mostly abandoned due to lower performance than pure software. Intel even dropped it from their recent CPUs (Not present in 10th Gen CPUs using 10nm lithography, or later.)
This is all just off the top of my head; I haven't searched for any serious proposals for how a system could plausibly work.
I don't think memory safety is something you can easily add after the fact to languages like C that weren't originally designed with it.
Have a look to "Code for malloc and free" at SO. Those commands are very, very far away from even being defined within an instruction set.

Split stacks unneccesary on amd64

There seems to be an opinion out there that using a "split stack" runtime model is unnecessary on 64-bit architectures. I say seems to be, because I haven't seen anyone actually say that, only dance around it:
The memory usage of a typical multi-threaded program can decrease
significantly, as each thread does not require a worst-case stack
size. It becomes possible to run millions of threads (either full NPTL
threads or co-routines) in a 32-bit address space.
-- Ian Lance Taylor
...implying that a 64-bit address space can already handle it.
And...
... the constant overhead of split stacks and the narrow use case
(spawning enormous numbers of I/O-bound tasks on 32-bit architectures)
isn't acceptable...
-- bstrie
Two questions: Is this what they are saying? Second, if so, why are they unneccesary on 64-bit architectures?
Yes, that's what they are saying.
Split stacks are (currently) unnecessary on 64bit architectures because the 64bit virtual address space is so large it can contain millions of stack address ranges, each as large as an entire 32bit address space, if needed.
In the Flat memory model in use nowadays, the translation from virtual addresses to phisical memory locations is done with the support of the hardware MMU. On amd64 it turns out it's better (meaning, overall faster) to reserve big chunks of the 64bit virtual address space to each new stack you are creating, while only mapping the first page (4kB) to actual RAM. This way, the stack will be able to grow and shrink as needed, over contiguous virtual addresses (meaning less code in each function prologue, a big optimization) while the OS re-configures the MMU to map each page of virtual addresses to an actual free page of RAM, whenever the stack grows or shrinks above/below some configurable thresholds.
By choosing the thresholds smartly (see for example the theory of dynamic arrays) you can achieve O(1) complexity on the average stack operation, while retaining the benefits of millions of stacks that can grow as much as you need and only consume the memory they use.
PS: the current Go implementation is far behind any of this :-)
The Go core team is currently discussing the possibility of using contiguous stacks in a future Go version.
The split stack approach is useful because stacks can grow more flexibly but it also requires that the runtime allocates a relatively big chunk of memory to distribute these stacks across. There has been a lot of confusion about Go's memory usage, in part because of this.
Making contiguous but growable (relocatable) stacks is an option that would provide the same flexibility and maybe reduce the confusion about Go's memory usage. As well as remedying some ill corner-cases on low-memory machines (see linked thread).
As to advantages/disadvantages on 32-bit vs. 64-bit architectures, I don't think there are any directly associated solely with the use of segmented stacks.
Update Go 1.4 (Q4 2014)
Change to the runtime:
Up to Go 1.4, the runtime (garbage collector, concurrency support, interface management, maps, slices, strings, ...) was mostly written in C, with some assembler support.
In 1.4, much of the code has been translated to Go so that the garbage collector can scan the stacks of programs in the runtime and get accurate information about what variables are active.
This rewrite allows the garbage collector in 1.4 to be fully precise, meaning that it is aware of the location of all active pointers in the program. This means the heap will be smaller as there will be no false positives keeping non-pointers alive. Other related changes also reduce the heap size, which is smaller by 10%-30% overall relative to the previous release.
A consequence is that stacks are no longer segmented, eliminating the "hot split" problem. When a stack limit is reached, a new, larger stack is allocated, all active frames for the goroutine are copied there, and any pointers into the stack are updated.
Initial answer (March 2014)
The article "Contiguous stacks in Go" by Agis Anastasopoulo also addresses this issue
In such cases where the stack boundary happens to fall in a tight loop, the overhead of creating and destroying segments repeatedly becomes significant.
This is called the “hot split” problem inside the Go community.
The “hot split” will be addressed in Go 1.3 by implementing contiguous stacks.
Now when a stack needs to grow, instead of allocating a new segment the following happens:
Create a new, somewhat larger stack
Copy the contents of the old stack to the new stack
Re-adjust every copied pointer to point to the new addresses
Destroy the old stack
The following mention one problem seen mainly in 32-bit arhcitectures:
There is a certain challenge though.
The 1.2 runtime doesn’t know if a pointer-sized word in the stack is an actual pointer or not. There may be floats and most rarely integers that if interpreted as pointers, would actually point to data.
Due to the lack of such knowledge the garbage collector has to conservatively consider all the locations in the stack frames to be roots. This leaves the possibility for memory leaks especially on 32-bit architectures since their address pool is much smaller.
When copying stacks however, such cases have to be avoided and only real pointers should be taken into account when re-adjusting.
Work was done though and information about live stack pointers is now embedded in the binaries and is available to the runtime.
This means not only that the collector in 1.3 can precisely stack data but re-adjusting stack pointers is now possible.

Hazards of not protection shared variables in a threaded environment

I'm trying to understand the hazards of not locking shared variables in a threaded (or shared memory) environment. It is easy to argue that if you are doing two or more dependent operations on a variable it is important to hold some lock first. The typical example is the increment operation, which first reads the current value before adding one and writing back.
But what if you only have one writer (and lots of readers) and the write is not dependent on the previous value. So I have one thread storing a timestamp offset once every second. The offset holds the difference between local time and some other time base. A lot of readers use this offset to timestamp events and getting a read lock for each time is a little expensive. In this situation I don't care if the reader gets the value just before the write or just after, as long as the reader don't get garbage (that is an offset that was never set).
Say that the variable is a 32 bit integer. Is it possible to get a garbage read of the variable in the middle of a write? Or are writing a 32 bit integer an atomic operation? Will it depend on the Os or hardware? What a about a 64 bit integer on a 32 bit system?
What about shared memory instead of threading?
Writing a 64-bit integer on a 32-bit system is not atomic, and you could have incorrect data if you don't take a lock.
As an example, if your integer is
0x00000000 0xFFFFFFFF
and you are going to write the next int in sequence, you want to write:
0x00000001 0x00000000
But if you read the value after one of the ints is written and before the other is, then you could read
0x00000000 0x00000000
or
0x00000001 0xFFFFFFFF
which are wildly different than the correct value.
If you want to work without locks, you have to be very certain what constitutes an atomic operation on your OS/CPU/compiler combination.
In additions to the above comments, beware the register bank in a slightly more general setting. You may end up updating only the cpu register and not really write it back to main memory right away. Or the other way around where you use a cached register copy while the original value in memory has been updated. Some languages have a volatile keyword to mark a variable as "read-always-and-never-locally-register-cache".
The memory model of your language is important. It describes exactly under what conditions a given value is shared among several threads. Either this is the rules of the CPU architecture you are executing on, or it is determined by a virtual machine in which the language is running. Java for instance has a separate memory model you can look at to figure out what exactly to expect.
An 8-bit, 16-bit or 32-bit read/write is guaranteed to be atomic if it is aligned to it's size (on 486 and later) and unaligned but within a cache line (on P6 and later). Most compilers will guarantee stack (local, assuming C/C++) variables are aligned.
A 64-bit read/write is guaranteed to be atomic if it is aligned (on Pentium and later), however, this relies on the compiler generating a single instruction (for example, popping a 64-bit float from the FPU or using MMX). I expect most compilers will use two 32-bit accesses for compatibility, though it is certainly possible to check (the disassembly) and it may be possible to coerce different handling.
The next issue is caching and memory fencing. However, the effect of ignoring these is that some threads may see the old value even though it has been updated. The value won't be invalid, simply out of date (by microseconds, probably). If this is critical to your application, you will have to dig deeper, but I doubt it is.
(Source: Intel Software Developer Manual Volume 3A)
It very much depends on hardware and how you are talking to it. If you are writing assembler, you will know exactly what you get as processor manuals will tell you which operations are atomic and under what conditions. For example, in the Intel Pentium, 32-bit reads are atomic if the address is aligned, but not otherwise.
If you are working on any level above that, it will depend on how that ultimately gets translated into machine code. Be that a compiler, interpreter, or virtual machine.
The platform you run on determines the size of atomic reads/writes. Generally, a 32-bit (register) platform only supports 32-bit atomic operations. So, if you are writing more than 32-bits, you will probably have to use some other mechanism to coordinate access to that shared data.
One mechanism is to double or triple buffer the actual data and use a shared index to determine the "latest" version:
write(blah)
{
new_index= ...; // find a free entry in the global_data array.
global_data[new_index]= blah;
WriteBarrier(); // write-release
global_index= new_index;
}
read()
{
read_index= global_index;
ReadBarrier(); // read-acquire
return global_data[read_index];
}
You need the memory barriers to ensure that you don't read from global_data[...] until after you read global_index and you don't write to global_index until after you write to global_data[...].
This is a little awful since you can also run into the ABA issue with preemption, so don't use this directly.
Platforms often provide atomic read/write access (enforced at the hardware level) to primitive values (32-bit or 64-bit,as in your example) - see the Interlocked* APIs on Windows.
This can avoid the use of a heavier weight lock for threadsafe variable or member access, but should not be mixed up with other types of lock on the same instance or member. In other words, don't use a Mutex to mediate access in one place and use Interlocked* to modify or read it in another.

Is fork() copy-on-write a stable exposed behavior that can be used to implement read-only shared memory?

The man page on fork() states that it does not copy data pages, it maps them into the child process and puts a copy-on-write flag. Is that behavior:
consistent between flavors of Linux?
considered an implementation detail and therefore likely to change?
I'm wondering if I can use fork() as a means to get a shared read-only memory block on the cheap. If the memory is physically copied, it would be rather expensive - there's a lot of forking going on, and the data area is big enough - but I'm hoping not...
Linux running on machines without a MMU (memory management unit) will copy all process memory on fork().
However, those systems are usually very small and embedded and you probably don't have to worry about them.
Many services such as Apache's fork model, use the initialize and fork() method to share initialized data structures.
You should be aware that if you are using languages like Perl and Python that use reference-counted variables, or C++ shared_ptr's, this model will not work. It will not work because as the reference counts are adjusted up and down, the memory becomes unshared and gets copied.
This causes huge amounts of memory usage in Perl daemons like SpamAssassin that attempt to use an initialize and fork model.
Yes you can certainly rely on it on MMU-Linux kernels; this is almost everything.
However, the page size isn't the same everywhere.
It is possible to explicitly make a shared memory area for forked process, by using mmap() to create an anonymous map - one which is not backed by a physical file. On fork, this area will always remain shared (provided the child doesn't unmap it, or map something else in at the same address). You can mprotect it to be readonly if you want.
Memory allocated with (for example) malloc can easily end up sharing a page with something that isn't readonly, which means it gets copied anyway when another structure is modified. This includes internal structures used by the malloc implementation. So you might want to mmap a specific area for this purpose and allocate from that.
Can you rely on the fact that all Linux flavors do it this way? No. But you can rely on the fact that those who don't use an even faster method.
Therefore you should use the feature and rely on it and revisit your decision if you get a performance problem.
The success of this approach depends on how well you stick to your self-imposed "read-only" limitation. Both parent and child have to obey this stricture, else the memory gets copied.
This may not be the catastrophe you're envisioning, however. The kernel can copy as little as a single page (typically 4 KB) to implement CoW semantics. A typical Linux server will use something more complex, some sort of slab allocator, so the copied region could be much larger.
The main point is that this is decoupled from your program's conception of its memory use. If you malloc() 1 GB of RAM, fork off a child, and the child changes just the first byte of that memory block, the entire 1 GB block isn't copied. Perhaps as little as one page is copied, up to the slab size containing that first byte.
Yes
All the linux distros use the same kernel, albeit with slightly different versions and releases of it.
It's unlikely that another underlying fork(2) implementation will be faster any time soon, so it's a safe bet that copy-on-write will continue to be the mechanism. Perhaps it won't be forever, but for years, definitely.
Certainly some major software systems (for example, Phusion Passenger) use fork(2) in the same way that you want to, so you would not be the only one taking advantage of CoW.

Resources