I'm using sys_brk syscall to dynamically allocate memory in the heap. I noticed that when acquiring the current break location I usually get value similar to this:
mov rax, 0x0C
mov rdi, 0x00
syscall
results in
rax 0x401000
The value usually 512 bytes aligned. So I would like to ask is there some alignment requirements on the break value? Or we can misalign it the way we want?
The kernel does track the break with byte granularity. But don't use it directly for small allocations if you care at all about performance.
There was some discussion in comments about the kernel rounding the break to a page boundary, but that's not the case. The implementation of sys_brk uses this (with my comments added so it makes sense out of context)
newbrk = PAGE_ALIGN(brk); // the syscall arg
oldbrk = PAGE_ALIGN(mm->brk); // the current break
if (oldbrk == newbrk)
goto set_brk; // no need to map / unmap any pages, just update mm->brk
This checks if the break moved to a different page, but eventually mm->brk = brk; sets the current break to the exact arg passed to the system call (if it's valid). If the current break was always page aligned, the kernel wouldn't need PAGE_ALIGN() on it.
Of course, memory protection has at least page granularity (and maybe hugepage, if the kernel chooses to use anonymous hugepages for this mapping). So you can access memory out to the end of the page containing the break without faulting. This is why the kernel code is just checking if the break moved to a different page to skip the map / unmap logic, but still updates the actual brk.
AFAIK, nothing will ever use that mapped memory above the break as scratch space, so it's not like memory below the stack pointer that can be clobbered asynchronously.
brk is just a simple memory-management system built-in to the kernel. System calls are expensive, so if you care about performance you should keep track of things in user-space and only make a system call at all when you need a new page. Using sys_brk directly for tiny allocations is terrible for performance, especially in kernels with Meltdown + Spectre mitigation enabled (making system calls much more expensive, like tens of thousands of clock cycles + TLB and branch prediction invalidation, instead of hundreds of clock cycles).
Related
I understand that in 32 bit you have segments where each segment would map to a base and limit. Therefore, a segment wouldn't be able to access another segments data.
With 64 bit, we throw away most of the segments and have a base of 0 with no limit, thus accessing the entire 64 bit address space. But I get confused when they state we have FS and GS registers for thread local storage and additional data.
If the default segment can access anything in the linear address space, then what is stopping the program from corrupting or accessing the FS/GS segments? The OS would have to keep track of FS/GS and make sure nothing else gets allocated there right? How does this work?
Also, if the default area can access anything, then why do we even have FS/GS. I guess FS makes sense because we can just switch the register during a thread switch. But why even use GS? Why not malloc memory instead? Sorry I am new to OS.
In 64-bit mode, the FS and GS "segment registers" aren't really used, per se. But using an FS or GS segment override for a memory access causes the address to be offset by the value contained in a hidden FSBASE/GSBASE register which can be set by a special instruction (possibly privileged, in which case a system call can ask the kernel to do it). So for instance, if FSBASE = 0x12340000 and RDI = 0x56789, then MOV AL, FS:[RDI] will load from linear address 0x12396789.
This is still just a linear address - it's not separate from the rest of the process's address space in any way, and it's subject to all the same paging and memory protection as any other memory access. The program could get the exact same effect with MOV AL, [0x12396789] (since DS has a base of 0). It is up to the program's usual memory allocation mechanisms to allocate an appropriate block of memory and set FSBASE to point to that block, if it intends to use FS in this way. There are no special protections needed to avoid "corrupting" this memory, any more than they are needed to prevent a program from corrupting any other part of its own memory.
So it doesn't really provide new functionality - it's more a convenience for programmers and compilers. As you say, it's nice for something like a pointer to thread-local storage, but if we didn't have FS/GS, there are plenty of other ways we could keep such a pointer in the thread's state (say, reserving R15). It's true that there's not an obvious general need for two such registers; for the most part, the other one is just there in case someone comes up with a good way to use it in a particular program.
See also:
How are the fs/gs registers used in Linux AMD64?
What is the "FS"/"GS" register intended for?
how compiler makes sure that the red zone is not clobbered? Is there any overallocation of space?
And what factors lead to choosing 128 byte as the size of red zone?
The compiler doesn't, it just takes advantage of the guarantee that space below RSP won't be asynchronously clobbered (e.g. by signal handlers). Making a function call will of course synchronously clobber it.
In fact, on Linux only signal handlers run asynchronously in user-space code. (The kernel stack gets interrupts: Why can't kernel code use a Red Zone)
The kernel implements the red-zone when delivering signals to user-space. I think that's about it; it's really pretty easy to implement.
The other thing that's relevant is when a debugger runs a function when you do something like print foo(123) in GDB. GDB will actually run that function using the stack of the current thread. In an ABI with a red-zone, GDB (or any other debugger) has to respect it when invoking that function by doing rsp -= 128 after saving register state to restore when the user doing continue or single-step.
In i386 System V, print foo(123) will use space right below the current ESP, stepping on whatever was below ESP. (I think; not tested).
And what factors lead to choosing 128 byte as the size of red zone?
A signed byte displacement in an addressing mode like [rsp - 128] can reach that far. IIRC, the amd64.org mailing archive I was looking through while answering Why does Windows64 use a different calling convention from all other OSes on x86-64? actually included a message citing that as the reason for that specific choice.
You want it to be large enough that many simple leaf functions don't need to move RSP. e.g. at least 16 or 32 bytes, like the 32-byte shadow space in MS's Windows x64 calling convention.
You want it to be small enough that skipping over it to invoke a signal handler doesn't need to touch huge amounts more space, like new pages. So much less than 4kB.
A leaf function that needs more than 128 bytes of locals is probably big enough that moving RSP is a drop in the bucket. And then the +-disp8 addressing mode benefit comes into play, giving access to a whole 256 bytes of space with compact addressing modes from byte [rsp+127] to byte [rsp-128] or in dword/qword chunks.
Further reading
Reading why it's not safe to use space below ESP on Windows, or Linux without a red-zone, is illuminating.
Raymond Chen's blog: Why do we even need to define a red zone? Can’t I just use my stack for anything?
Also my SO answer covers some of the same ground: Is it valid to write below ESP? (but with more guesswork and less interesting Windows details than Raymond.)
Reading Intel's SDM about Memory protection keys (MPK) doesn't suggest wrpkru instruction as being a serializing, or enforcing memory ordering implicitly.
First, it is surprising if it is not enforcing some sort of ordering, as one would suspect the programmer doesn't want memory accesses around a wrpkru to be executed out of order.
Second, does that mean wrpkru needs to be surrounded by lfence?
Linux and glibc don't use any sort of fence after the write. But shouldn't that be included in the SDM?
I'd assume that the CPU preserves the illusion of running a single thread in program order, as always. That's the cardinal rule of out-of-order execution. Accesses before wrpkru are done with the old PKRU, accesses after are done with the new PKRU.
Just like how modifying the MXCSR affects later FP instructions but not earlier instructions, or modifying a segment register affects later but not earlier loads/stores.
It's up to the implementation whether it wants to rename the PKRU, the MXCSR, or segment registers. If it doesn't rename the PKRU, then it has to complete all pending loads/stores before changing the PKRU and allowing later loads/stores to execute. (i.e. the microcode for wrpkru could include the uops for lfence if that's how it's implemented.)
All memory accesses have a dependency on the last wrpkru instruction, and the last write to the relevant segment register, and the last write to cr3 (the top-level page table), and the last change of privilege level (syscall / iret / whatever). Also on the last store to that location, and you never need a fence to see your own most recent stores. It's up to the CPU architects to build hardware that runs fast while preserving the illusion of program order.
e.g. Intel CPUs since at least Core2 have renamed the x87 FP control word, so old binaries that implement (int)fp_var by changing the x87 rounding mode to truncate and then back to nearest don't serialize the FPU. Some CPUs do rename segment registers according to Agner Fog's testing, but my testing shows that Skylake doesn't: Is a mov to a segmentation register slower than a mov to a general purpose register?.
I'm not familiar with MPK, but why would it be a problem for memory accesses to happen out of order as long as they all use the correct PKRU value, and they don't violate any of x86's normal memory-ordering rules?
(only StoreLoad reordering is allowed to be visible by other threads. Internally a CPU can execute loads earlier than they're "supposed to", but verify that the cache line wasn't invalidated before the point where it was architecturally allowed to load.
This is what the Memory Order Buffer does.)
In C/C++, of course you need some kind of barrier against compile-time reordering of accesses around the wrapper function. Normally a non-inline function call is sufficient, like for pthread_mutex_lock(). How does a mutex lock and unlock functions prevents CPU reordering?.
The earlier part of this answer is about ordering in assembly.
I'm using sys_brk syscall to dynamically allocate memory in the heap. I noticed that when acquiring the current break location I usually get value similar to this:
mov rax, 0x0C
mov rdi, 0x00
syscall
results in
rax 0x401000
The value usually 512 bytes aligned. So I would like to ask is there some alignment requirements on the break value? Or we can misalign it the way we want?
The kernel does track the break with byte granularity. But don't use it directly for small allocations if you care at all about performance.
There was some discussion in comments about the kernel rounding the break to a page boundary, but that's not the case. The implementation of sys_brk uses this (with my comments added so it makes sense out of context)
newbrk = PAGE_ALIGN(brk); // the syscall arg
oldbrk = PAGE_ALIGN(mm->brk); // the current break
if (oldbrk == newbrk)
goto set_brk; // no need to map / unmap any pages, just update mm->brk
This checks if the break moved to a different page, but eventually mm->brk = brk; sets the current break to the exact arg passed to the system call (if it's valid). If the current break was always page aligned, the kernel wouldn't need PAGE_ALIGN() on it.
Of course, memory protection has at least page granularity (and maybe hugepage, if the kernel chooses to use anonymous hugepages for this mapping). So you can access memory out to the end of the page containing the break without faulting. This is why the kernel code is just checking if the break moved to a different page to skip the map / unmap logic, but still updates the actual brk.
AFAIK, nothing will ever use that mapped memory above the break as scratch space, so it's not like memory below the stack pointer that can be clobbered asynchronously.
brk is just a simple memory-management system built-in to the kernel. System calls are expensive, so if you care about performance you should keep track of things in user-space and only make a system call at all when you need a new page. Using sys_brk directly for tiny allocations is terrible for performance, especially in kernels with Meltdown + Spectre mitigation enabled (making system calls much more expensive, like tens of thousands of clock cycles + TLB and branch prediction invalidation, instead of hundreds of clock cycles).
I am trying to allocate some memory using sys_brk in NASM/x86 assembly. sys_break returns the new address of break, which is the end of the data segment right? So where does my newly allocated memory reside? I assumed that it is in between the old break value and the new break value. So if I allocate 64bytes of memory with sys_brk i can use the next 64 bytes starting from the old break value that i stored before calling sys_brk. Am I right?
My Assembly code that will allocate memory will look somewhat like this.https://gist.github.com/nikAizuddin/f4132721126257ec4345
And another side question is;
I am supposed to write a function in Assembly that returns the pointer to the dynamically allocated memory and that function will be called from a C program. How can i free this block of memory from C side of my program? Would just calling free() be enough?
The brk(2) man page (section: C library/kernel ABI differences) describes how the glibc wrapper is implemented on top of Linux's system call, which returns the new brk on success, or the old brk on failure.
As I understand it, memory beyond the current break is unmapped. Addresses below the current break are part of the data segment (in the sense of data+bss+heap). The docs aren't clear on whether the break has to be page-aligned. (i.e. can you sbrk(64), or only sbrk(4096)?) If ASLR is enabled, the initial break will be some random distance past the end of the BSS.
See: What does the brk() system call do? An answer on that question has an example of using sbrk to replace malloc for code-golf. So yes, the old break is the address to return. And apparently you can sbrk any increment you want, not just pages.
You're the one writing the memory allocator. sbrk just lets you get more from the OS, like mmap(MAP_ANONYMOUS) but less flexible. It doesn't help you keep track of free blocks so you can use them for future allocations instead of always getting more from the OS.
The way to give back memory you got with sbrk is by calling sbrk with a negative argument. Obviously this requires a last-in-first-out usage pattern, which is why glibc's malloc only uses sbrk for small allocations (that can be put on the free-list when freed, to be handed out for future mallocs). Big allocations are best returned to the OS right away, instead of being kept mapped, so glibc's malloc uses mmap for those.
Never call free(3) on memory you didn't get from malloc(3) (or an associated function, like strdup(3), that says in the docs you can and should free(3) the memeory.) IDK what would happen if you called munmap on a page of memory below the program break. Probably it would just work, but then you'd have a hole in your data segment that could cause problems if the break ever decreased to there.
In assembly, the Linux brk system call takes an address where you want to set the break. As the man page notes, it either returns that for success, or returns the old break on failure, never a -errno code like -ENOMEM.
See Assembly x86 brk() call use for an x86-64 example.
The POSIX API where you can use positive or negative integer offsets is something you can implement by always calling twice, or like glibc keeping track of the current break in a global variable. To init that variable, use brk once with a requested address of 0, which will fail, as shown in the strace output below.
This is similar to what you'd do with the POSIX API, calling sbrk with increment = 0.
This is what glibc's malloc(3) does internally:
$ strace -e brk ls 2>&1 | m
brk(0) = 0x650000
brk(0) = 0x650000
brk(0x671000) = 0x671000
The brk man page mentions end(3). Apparently there are globals which are located at the end of the text, data, and bss segments. However, &end is only "somewhere near" the program break, which is why malloc still has to make a system call to get the initial break. IDK why there's a redundant brk(0). These are raw system calls, not library function calls, so an sbrk(0) probably doesn't explain it.