What is the aligment requirements for sys_brk - linux

I'm using sys_brk syscall to dynamically allocate memory in the heap. I noticed that when acquiring the current break location I usually get value similar to this:
mov rax, 0x0C
mov rdi, 0x00
syscall
results in
rax 0x401000
The value usually 512 bytes aligned. So I would like to ask is there some alignment requirements on the break value? Or we can misalign it the way we want?

The kernel does track the break with byte granularity. But don't use it directly for small allocations if you care at all about performance.
There was some discussion in comments about the kernel rounding the break to a page boundary, but that's not the case. The implementation of sys_brk uses this (with my comments added so it makes sense out of context)
newbrk = PAGE_ALIGN(brk); // the syscall arg
oldbrk = PAGE_ALIGN(mm->brk); // the current break
if (oldbrk == newbrk)
goto set_brk; // no need to map / unmap any pages, just update mm->brk
This checks if the break moved to a different page, but eventually mm->brk = brk; sets the current break to the exact arg passed to the system call (if it's valid). If the current break was always page aligned, the kernel wouldn't need PAGE_ALIGN() on it.
Of course, memory protection has at least page granularity (and maybe hugepage, if the kernel chooses to use anonymous hugepages for this mapping). So you can access memory out to the end of the page containing the break without faulting. This is why the kernel code is just checking if the break moved to a different page to skip the map / unmap logic, but still updates the actual brk.
AFAIK, nothing will ever use that mapped memory above the break as scratch space, so it's not like memory below the stack pointer that can be clobbered asynchronously.
brk is just a simple memory-management system built-in to the kernel. System calls are expensive, so if you care about performance you should keep track of things in user-space and only make a system call at all when you need a new page. Using sys_brk directly for tiny allocations is terrible for performance, especially in kernels with Meltdown + Spectre mitigation enabled (making system calls much more expensive, like tens of thousands of clock cycles + TLB and branch prediction invalidation, instead of hundreds of clock cycles).

Related

Do the instruction cache and data cache sync with each other? [duplicate]

I like examples, so I wrote a bit of self-modifying code in c...
#include <stdio.h>
#include <sys/mman.h> // linux
int main(void) {
unsigned char *c = mmap(NULL, 7, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|
MAP_ANONYMOUS, -1, 0); // get executable memory
c[0] = 0b11000111; // mov (x86_64), immediate mode, full-sized (32 bits)
c[1] = 0b11000000; // to register rax (000) which holds the return value
// according to linux x86_64 calling convention
c[6] = 0b11000011; // return
for (c[2] = 0; c[2] < 30; c[2]++) { // incr immediate data after every run
// rest of immediate data (c[3:6]) are already set to 0 by MAP_ANONYMOUS
printf("%d ", ((int (*)(void)) c)()); // cast c to func ptr, call ptr
}
putchar('\n');
return 0;
}
...which works, apparently:
>>> gcc -Wall -Wextra -std=c11 -D_GNU_SOURCE -o test test.c; ./test
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
But honestly, I didn't expect it to work at all. I expected the instruction containing c[2] = 0 to be cached upon the first call to c, after which all consecutive calls to c would ignore the repeated changes made to c (unless I somehow explicitedly invalidated the cache). Luckily, my cpu appears to be smarter than that.
I guess the cpu compares RAM (assuming c even resides in RAM) with the instruction cache whenever the instruction pointer makes a large-ish jump (as with the call to the mmapped memory above), and invalidates the cache when it doesn't match (all of it?), but I'm hoping to get more precise information on that. In particular, I'd like to know if this behavior can be considered predictable (barring any differences of hardware and os), and relied on?
(I probably should refer to the Intel manual, but that thing is thousands of pages long and I tend to get lost in it...)
What you do is usually referred as self-modifying code. Intel's platforms (and probably AMD's too) do the job for you of maintaining an i/d cache-coherency, as the manual points it out (Manual 3A, System Programming)
11.6 SELF-MODIFYING CODE
A write to a memory location in a code segment that is currently cached in the
processor causes the associated cache line (or lines) to be invalidated.
But this assertion is valid as long as the same linear address is used for modifying and fetching, which is not the case for debuggers and binary loaders since they don't run in the same address-space:
Applications that include self-modifying code use the same
linear address for modifying and fetching the instruction. Systems software, such as
a debugger, that might possibly modify an instruction using a different linear address
than that used to fetch the instruction, will execute a serializing operation, such as a
CPUID instruction, before the modified instruction is executed, which will automatically
resynchronize the instruction cache and prefetch queue.
For instance, serialization operation are always requested by many other architectures such as PowerPC, where it must be done explicitely (E500 Core Manual):
3.3.1.2.1 Self-Modifying Code
When a processor modifies any memory location that can contain an instruction, software must
ensure that the instruction cache is made consistent with data memory and that the modifications
are made visible to the instruction fetching mechanism. This must be done even if the cache is
disabled or if the page is marked caching-inhibited.
It is interesting to notice that PowerPC requires the issue of a context-synchronizing instruction even when caches are disabled; I suspect it enforces a flush of deeper data processing units such as the load/store buffers.
The code you proposed is unreliable on architectures without snooping or advanced cache-coherency facilities, and therefore likely to fail.
Hope this help.
It's pretty simple; the write to an address that's in one of the cache lines in the instruction cache invalidates it from the instruction cache. No "synchronization" is involved.
The CPU handles cache invalidation automatically, you don't have to do anything manually. Software can't reasonably predict what will or will not be in CPU cache at any point in time, so it's up to the hardware to take care of this. When the CPU saw that you modified data, it updated its various caches accordingly.
By the way, many x86 processors (that I worked on) snoop not only the instruction cache but also the pipeline, instruction window - the instructions that are currently in flight. So self modifying code will take effect the very next instruction. But, you are encouraged to use a serializing instruction like CPUID to ensure that your newly written code will be executed.
I just reached this page in one of my Search and want to share my knowledge on this area of Linux kernel!
Your code executes as expected and there are no surprises for me here. The mmap() syscall and processor Cache coherency protocol does this trick for you. The flags "PROT_READ|PROT_WRITE|PROT_EXEC" asks the mmamp() to set the iTLB, dTLB of L1 Cache and TLB of L2 cache of this physical page correctly. This low level architecture specific kernel code does this differently depending on processor architecture(x86,AMD,ARM,SPARC etc...). Any kernel bug here will mess up your program!
This is just for explanation purpose.
Assume that your system is not doing much and there are no process switches between between "a[0]=0b01000000;" and start of "printf("\n"):"...
Also, assume that You have 1K of L1 iCache, 1K dCache in your processor and some L2 cache in the core, . (Now a days these are in the order of few MBs)
mmap() sets up your virtual address space and iTLB1, dTLB1 and TLB2s.
"a[0]=0b01000000;" will actually Trap(H/W magic) into kernel code and your physical address will be setup and all Processor TLBs will be loaded by the kernel. Then, You will be back into user mode and your processor will actually Load 16 bytes(H/W magic a[0] to a[3]) into L1 dCache and L2 Cache. Processor will really go into Memory again, only when you refer a[4] and and so on(Ignore the prediction loading for now!). By the time you complete "a[7]=0b11000011;", Your processor had done 2 burst READs of 16 bytes each on the eternal Bus. Still no actual WRITEs into physical memory. All WRITEs are happening within L1 dCache(H/W magic, Processor knows) and L2 cache so for and the DIRTY bit is set for the Cache-line.
"a[3]++;" will have STORE Instruction in the Assembly code, but the Processor will store that only in L1 dCache&L2 and it will not go to Physical Memory.
Let's come to the function call "a()". Again the processor do the Instruction Fetch from L2 Cache into L1 iCache and so on.
Result of this user mode program will be the same on any Linux under any processor, due to correct implementation of low level mmap() syscall and Cache coherency protocol!
If You are writing this code under any embedded processor environment without OS assistance of mmap() syscall, you will find the problem you are expecting. This is because your are not using either H/W mechanism(TLBs) or software mechanism(memory barrier instructions).

Where do FS and GS registers get mapped to in the linear address space?

I understand that in 32 bit you have segments where each segment would map to a base and limit. Therefore, a segment wouldn't be able to access another segments data.
With 64 bit, we throw away most of the segments and have a base of 0 with no limit, thus accessing the entire 64 bit address space. But I get confused when they state we have FS and GS registers for thread local storage and additional data.
If the default segment can access anything in the linear address space, then what is stopping the program from corrupting or accessing the FS/GS segments? The OS would have to keep track of FS/GS and make sure nothing else gets allocated there right? How does this work?
Also, if the default area can access anything, then why do we even have FS/GS. I guess FS makes sense because we can just switch the register during a thread switch. But why even use GS? Why not malloc memory instead? Sorry I am new to OS.
In 64-bit mode, the FS and GS "segment registers" aren't really used, per se. But using an FS or GS segment override for a memory access causes the address to be offset by the value contained in a hidden FSBASE/GSBASE register which can be set by a special instruction (possibly privileged, in which case a system call can ask the kernel to do it). So for instance, if FSBASE = 0x12340000 and RDI = 0x56789, then MOV AL, FS:[RDI] will load from linear address 0x12396789.
This is still just a linear address - it's not separate from the rest of the process's address space in any way, and it's subject to all the same paging and memory protection as any other memory access. The program could get the exact same effect with MOV AL, [0x12396789] (since DS has a base of 0). It is up to the program's usual memory allocation mechanisms to allocate an appropriate block of memory and set FSBASE to point to that block, if it intends to use FS in this way. There are no special protections needed to avoid "corrupting" this memory, any more than they are needed to prevent a program from corrupting any other part of its own memory.
So it doesn't really provide new functionality - it's more a convenience for programmers and compilers. As you say, it's nice for something like a pointer to thread-local storage, but if we didn't have FS/GS, there are plenty of other ways we could keep such a pointer in the thread's state (say, reserving R15). It's true that there's not an obvious general need for two such registers; for the most part, the other one is just there in case someone comes up with a good way to use it in a particular program.
See also:
How are the fs/gs registers used in Linux AMD64?
What is the "FS"/"GS" register intended for?

How system V ABI's red zone is implemented

how compiler makes sure that the red zone is not clobbered? Is there any overallocation of space?
And what factors lead to choosing 128 byte as the size of red zone?
The compiler doesn't, it just takes advantage of the guarantee that space below RSP won't be asynchronously clobbered (e.g. by signal handlers). Making a function call will of course synchronously clobber it.
In fact, on Linux only signal handlers run asynchronously in user-space code. (The kernel stack gets interrupts: Why can't kernel code use a Red Zone)
The kernel implements the red-zone when delivering signals to user-space. I think that's about it; it's really pretty easy to implement.
The other thing that's relevant is when a debugger runs a function when you do something like print foo(123) in GDB. GDB will actually run that function using the stack of the current thread. In an ABI with a red-zone, GDB (or any other debugger) has to respect it when invoking that function by doing rsp -= 128 after saving register state to restore when the user doing continue or single-step.
In i386 System V, print foo(123) will use space right below the current ESP, stepping on whatever was below ESP. (I think; not tested).
And what factors lead to choosing 128 byte as the size of red zone?
A signed byte displacement in an addressing mode like [rsp - 128] can reach that far. IIRC, the amd64.org mailing archive I was looking through while answering Why does Windows64 use a different calling convention from all other OSes on x86-64? actually included a message citing that as the reason for that specific choice.
You want it to be large enough that many simple leaf functions don't need to move RSP. e.g. at least 16 or 32 bytes, like the 32-byte shadow space in MS's Windows x64 calling convention.
You want it to be small enough that skipping over it to invoke a signal handler doesn't need to touch huge amounts more space, like new pages. So much less than 4kB.
A leaf function that needs more than 128 bytes of locals is probably big enough that moving RSP is a drop in the bucket. And then the +-disp8 addressing mode benefit comes into play, giving access to a whole 256 bytes of space with compact addressing modes from byte [rsp+127] to byte [rsp-128] or in dword/qword chunks.
Further reading
Reading why it's not safe to use space below ESP on Windows, or Linux without a red-zone, is illuminating.
Raymond Chen's blog: Why do we even need to define a red zone? Can’t I just use my stack for anything?
Also my SO answer covers some of the same ground: Is it valid to write below ESP? (but with more guesswork and less interesting Windows details than Raymond.)

Memory Protection Keys Memory Reordering

Reading Intel's SDM about Memory protection keys (MPK) doesn't suggest wrpkru instruction as being a serializing, or enforcing memory ordering implicitly.
First, it is surprising if it is not enforcing some sort of ordering, as one would suspect the programmer doesn't want memory accesses around a wrpkru to be executed out of order.
Second, does that mean wrpkru needs to be surrounded by lfence?
Linux and glibc don't use any sort of fence after the write. But shouldn't that be included in the SDM?
I'd assume that the CPU preserves the illusion of running a single thread in program order, as always. That's the cardinal rule of out-of-order execution. Accesses before wrpkru are done with the old PKRU, accesses after are done with the new PKRU.
Just like how modifying the MXCSR affects later FP instructions but not earlier instructions, or modifying a segment register affects later but not earlier loads/stores.
It's up to the implementation whether it wants to rename the PKRU, the MXCSR, or segment registers. If it doesn't rename the PKRU, then it has to complete all pending loads/stores before changing the PKRU and allowing later loads/stores to execute. (i.e. the microcode for wrpkru could include the uops for lfence if that's how it's implemented.)
All memory accesses have a dependency on the last wrpkru instruction, and the last write to the relevant segment register, and the last write to cr3 (the top-level page table), and the last change of privilege level (syscall / iret / whatever). Also on the last store to that location, and you never need a fence to see your own most recent stores. It's up to the CPU architects to build hardware that runs fast while preserving the illusion of program order.
e.g. Intel CPUs since at least Core2 have renamed the x87 FP control word, so old binaries that implement (int)fp_var by changing the x87 rounding mode to truncate and then back to nearest don't serialize the FPU. Some CPUs do rename segment registers according to Agner Fog's testing, but my testing shows that Skylake doesn't: Is a mov to a segmentation register slower than a mov to a general purpose register?.
I'm not familiar with MPK, but why would it be a problem for memory accesses to happen out of order as long as they all use the correct PKRU value, and they don't violate any of x86's normal memory-ordering rules?
(only StoreLoad reordering is allowed to be visible by other threads. Internally a CPU can execute loads earlier than they're "supposed to", but verify that the cache line wasn't invalidated before the point where it was architecturally allowed to load.
This is what the Memory Order Buffer does.)
In C/C++, of course you need some kind of barrier against compile-time reordering of accesses around the wrapper function. Normally a non-inline function call is sufficient, like for pthread_mutex_lock(). How does a mutex lock and unlock functions prevents CPU reordering?.
The earlier part of this answer is about ordering in assembly.

What does sys_brk call return? [duplicate]

I'm using sys_brk syscall to dynamically allocate memory in the heap. I noticed that when acquiring the current break location I usually get value similar to this:
mov rax, 0x0C
mov rdi, 0x00
syscall
results in
rax 0x401000
The value usually 512 bytes aligned. So I would like to ask is there some alignment requirements on the break value? Or we can misalign it the way we want?
The kernel does track the break with byte granularity. But don't use it directly for small allocations if you care at all about performance.
There was some discussion in comments about the kernel rounding the break to a page boundary, but that's not the case. The implementation of sys_brk uses this (with my comments added so it makes sense out of context)
newbrk = PAGE_ALIGN(brk); // the syscall arg
oldbrk = PAGE_ALIGN(mm->brk); // the current break
if (oldbrk == newbrk)
goto set_brk; // no need to map / unmap any pages, just update mm->brk
This checks if the break moved to a different page, but eventually mm->brk = brk; sets the current break to the exact arg passed to the system call (if it's valid). If the current break was always page aligned, the kernel wouldn't need PAGE_ALIGN() on it.
Of course, memory protection has at least page granularity (and maybe hugepage, if the kernel chooses to use anonymous hugepages for this mapping). So you can access memory out to the end of the page containing the break without faulting. This is why the kernel code is just checking if the break moved to a different page to skip the map / unmap logic, but still updates the actual brk.
AFAIK, nothing will ever use that mapped memory above the break as scratch space, so it's not like memory below the stack pointer that can be clobbered asynchronously.
brk is just a simple memory-management system built-in to the kernel. System calls are expensive, so if you care about performance you should keep track of things in user-space and only make a system call at all when you need a new page. Using sys_brk directly for tiny allocations is terrible for performance, especially in kernels with Meltdown + Spectre mitigation enabled (making system calls much more expensive, like tens of thousands of clock cycles + TLB and branch prediction invalidation, instead of hundreds of clock cycles).

Resources