Where do FS and GS registers get mapped to in the linear address space? - linux

I understand that in 32 bit you have segments where each segment would map to a base and limit. Therefore, a segment wouldn't be able to access another segments data.
With 64 bit, we throw away most of the segments and have a base of 0 with no limit, thus accessing the entire 64 bit address space. But I get confused when they state we have FS and GS registers for thread local storage and additional data.
If the default segment can access anything in the linear address space, then what is stopping the program from corrupting or accessing the FS/GS segments? The OS would have to keep track of FS/GS and make sure nothing else gets allocated there right? How does this work?
Also, if the default area can access anything, then why do we even have FS/GS. I guess FS makes sense because we can just switch the register during a thread switch. But why even use GS? Why not malloc memory instead? Sorry I am new to OS.

In 64-bit mode, the FS and GS "segment registers" aren't really used, per se. But using an FS or GS segment override for a memory access causes the address to be offset by the value contained in a hidden FSBASE/GSBASE register which can be set by a special instruction (possibly privileged, in which case a system call can ask the kernel to do it). So for instance, if FSBASE = 0x12340000 and RDI = 0x56789, then MOV AL, FS:[RDI] will load from linear address 0x12396789.
This is still just a linear address - it's not separate from the rest of the process's address space in any way, and it's subject to all the same paging and memory protection as any other memory access. The program could get the exact same effect with MOV AL, [0x12396789] (since DS has a base of 0). It is up to the program's usual memory allocation mechanisms to allocate an appropriate block of memory and set FSBASE to point to that block, if it intends to use FS in this way. There are no special protections needed to avoid "corrupting" this memory, any more than they are needed to prevent a program from corrupting any other part of its own memory.
So it doesn't really provide new functionality - it's more a convenience for programmers and compilers. As you say, it's nice for something like a pointer to thread-local storage, but if we didn't have FS/GS, there are plenty of other ways we could keep such a pointer in the thread's state (say, reserving R15). It's true that there's not an obvious general need for two such registers; for the most part, the other one is just there in case someone comes up with a good way to use it in a particular program.
See also:
How are the fs/gs registers used in Linux AMD64?
What is the "FS"/"GS" register intended for?

Related

How system V ABI's red zone is implemented

how compiler makes sure that the red zone is not clobbered? Is there any overallocation of space?
And what factors lead to choosing 128 byte as the size of red zone?
The compiler doesn't, it just takes advantage of the guarantee that space below RSP won't be asynchronously clobbered (e.g. by signal handlers). Making a function call will of course synchronously clobber it.
In fact, on Linux only signal handlers run asynchronously in user-space code. (The kernel stack gets interrupts: Why can't kernel code use a Red Zone)
The kernel implements the red-zone when delivering signals to user-space. I think that's about it; it's really pretty easy to implement.
The other thing that's relevant is when a debugger runs a function when you do something like print foo(123) in GDB. GDB will actually run that function using the stack of the current thread. In an ABI with a red-zone, GDB (or any other debugger) has to respect it when invoking that function by doing rsp -= 128 after saving register state to restore when the user doing continue or single-step.
In i386 System V, print foo(123) will use space right below the current ESP, stepping on whatever was below ESP. (I think; not tested).
And what factors lead to choosing 128 byte as the size of red zone?
A signed byte displacement in an addressing mode like [rsp - 128] can reach that far. IIRC, the amd64.org mailing archive I was looking through while answering Why does Windows64 use a different calling convention from all other OSes on x86-64? actually included a message citing that as the reason for that specific choice.
You want it to be large enough that many simple leaf functions don't need to move RSP. e.g. at least 16 or 32 bytes, like the 32-byte shadow space in MS's Windows x64 calling convention.
You want it to be small enough that skipping over it to invoke a signal handler doesn't need to touch huge amounts more space, like new pages. So much less than 4kB.
A leaf function that needs more than 128 bytes of locals is probably big enough that moving RSP is a drop in the bucket. And then the +-disp8 addressing mode benefit comes into play, giving access to a whole 256 bytes of space with compact addressing modes from byte [rsp+127] to byte [rsp-128] or in dword/qword chunks.
Further reading
Reading why it's not safe to use space below ESP on Windows, or Linux without a red-zone, is illuminating.
Raymond Chen's blog: Why do we even need to define a red zone? Can’t I just use my stack for anything?
Also my SO answer covers some of the same ground: Is it valid to write below ESP? (but with more guesswork and less interesting Windows details than Raymond.)

Segmentation registers use

I am trying to understand how memory management goes on low level and have a couple of questions.
1) A book about assembly language by by Kip R. Irvine says that in the real mode first three segment registers are loaded with base addresses of code, data, and stack segment when the program starts. This is a bit ambigous to me. Are these values specified manually or does the assembler generates instructions to write the values into registers? If it happens automatically, how it finds out what is the size of these segments?
2) I know that Linux uses flat linear model, i.e. uses segmentation in a very limited way. Also, according to "Understanding the Linux Kernel" by Daniel P. Bovet and Marco Cesati there are four main segments: user data, user code, kernel data and kernel code in GDT. All four segments have the same size and base address. I do not understand why there is need in four of them if they differ only in type and access rights (they all produce the same linear address, right?). Why not use just one of them and write its descriptor to all segment registers?
3) How operating systems that do not use segmentation divide programs into logical segments? For example, how they differentiate stack from code without segment descriptors. I read that paging can be used to handle such things, but don't understand how.
You must have read some really old books because nobody program for real-mode anymore ;-) In real-mode, you can get the physical address of a memory access with physical address = segment register * 0x10 + offset, the offset being a value inside one of the general-purpose registers. Because these registers are 16 bit wide, a segment will be 64kb long and there is nothing you can do about its size, just because there is no attribute! With the * 0x10 multiplication, 1mb of memory become available, but there are overlapping combinations depending on what you put in the segment registers and the address register. I haven't compiled any code for real-mode, but I think it's up to the OS to setup the segment registers during the the binary loading, just like a loader would allocate some pages when loading an ELF binary. However I do have compiled bare-metal kernel code, and I had to setup these registers by myself.
Four segments are mandatory in the flat model because of architecture constraints. In protected-mode the segment registers no more contains the segment base address, but a segment selector which is basically an offset into the GDT. Depending on the value of the segment selector, the CPU will be in a given level of privilege, this is the CPL (Current Privilege Level). The segment selector points to a segment descriptor which has a DPL (Descriptor Privilege Level), which is eventually the CPL if the segment register is filled with with this selector (at least true for the code-segment selector). Therefore you need at least a pair of segment selectors to differentiate the kernel from the userland. Moreover, segments are either code segment or data segment, so you eventually end up with four segment descriptors in the GDT.
I don't have any example of serious OS which make any use of segmentation, just because segmentation is still present for backward compliancy. Using the flat model approach is nothing but a mean to get rid of it. Anyway, you're right, paging is way more efficient and versatile, and available on almost all architecture (the concepts at least). I can't explain here paging internals, but all the information you need to know are inside the excellent Intel man: Intel® 64 and IA-32 Architectures
Software Developer’s Manual
Volume 3A:
System Programming Guide, Part 1
Expanding on Benoit's answer to question 3...
The division of programs into logical parts such as code, constant data, modifiable data and stack is done by different agents at different points in time.
First, your compiler (and linker) creates executable files where this division is specified. If you look at a number of executable file formats (PE, ELF, etc), you'll see that they support some kind of sections or segments or whatever you want to call it. Besides addresses and sizes and locations within the file, those sections bear attributes telling the OS the purpose of these sections, e.g. this section contains code (and here's the entry point), this - initialized constant data, that - uninitialized data (typically not taking space in the file), here's something about the stack, over there is the list of dependencies (e.g. DLLs), etc.
Next, when the OS starts executing the program, it parses the file to see how much memory the program needs, where and what memory protection is needed for every section. The latter is commonly done via page tables. The code pages are marked as executable and read-only, the constant data pages are marked as not executable and read-only, other data pages (including those of the stack) are marked as not executable and read-write. This is how it ought to be normally.
Often times programs need read-write and, at the same time, executable regions for dynamically generated code or just to be able to modify the existing code. The combined RWX access can be either specified in the executable file or requested at run time.
There can be other special pages such as guard pages for dynamic stack expansion, they're placed next to the stack pages. For example, your program starts with enough pages allocated for a 64KB stack and then when the program tries to access beyond that point, the OS intercepts access to those guard pages, allocates more pages for the stack (up to the maximum supported size) and moves the guard pages further. These pages don't need to be specified in the executable file, the OS can handle them on its own. The file should only specify the stack size(s) and perhaps the location.
If there's no hardware or code in the OS to distinguish code memory from data memory or to enforce memory access rights, the division is very formal. 16-bit real-mode DOS programs (COM and EXE) didn't have code, data and stack segments marked in some special way. COM programs had everything in one common 64KB segment and they started with IP=0x100 and SP=0xFFxx and the order of code and data could be arbitrary inside, they could intertwine practically freely. DOS EXE files only specified the starting CS:IP and SS:SP locations and beyond that the code, data and stack segments were indistinguishable to DOS. All it needed to do was load the file, perform relocation (for EXEs only), set up the PSP (Program Segment Prefix, containing the command line parameter and some other control info), load SS:SP and CS:IP. It could not protect memory because memory protection isn't available in the real address mode, and so the 16-bit DOS executable formats were very simple.
Wikipedia is your friend in this case. http://en.wikipedia.org/wiki/Memory_segmentation and http://en.wikipedia.org/wiki/X86_memory_segmentation should be good starting points.
I'm sure there are others here who can personally provide in-depth explanations, though.

Virtual Memory and sbrk

On a 32 bit Linux system, a process can access up to 4 GB of virtual address space; however, processes seem to be conservative in varying degrees in reserving any of that. So a program that uses malloc will occasionally grow its data segment by a syscall sbrk/brk. Even those pages aren't claimed in physical memory yet. What I don't fully understand is why we need to sbrk in the first place, why not just give me 4 GB address space avoiding any sbrk call, as until we touch/claim those blocks, it is essentially a free operation right?
What happens if you memory-map a file (a very common thing to do under Linux)? It has to go somewhere in the address space, so there must be some means of defining "used" and "not used" parts.
Shared memory (which is really just mapping a file without an actual file) is the same. It has to go somewhere, and the OS must be sure it can place it without overwriting something.
Also, it is preferrable to maintain locality of reference for obvious (and less obvious) efficiency reasons. If you were allowed to just write to and read from any location in your address space, you can bet that some people would do just that.
There's a couple of reasons that come to mind:
You'd no longer get segfaults when accessing unmapped memory
The Translation lookaside buffer (TLB) would be larger, possibly requiring more time to set it up
You'd have to unmap some of that memory anyway if you load in a new shared library or mmap() something

program life in terms of paged segmentation memory

I have a confusing notion about the process of segmentation & paging in x86 linux machines. Will be glad if some clarify all the steps involved from the start to the end.
x86 uses paged segmentation memory technique for memory management.
Can any one please explain what happens from the moment an executable .elf format file is loaded from hard disk in to main memory to the time it dies. when compiled the executable has different sections in it (text, data, stack, heap, bss). how will this be loaded ? how will they be set up under paged segmentation memory technique.
Wanted to know how the page tables get set up for the loaded program ? Wanted to know how GDT table gets set up. how the registers are loaded ? and why it is said that logical addresses (the ones that are processed by segmentation unit of MMU are 48 bits (16 bits of segment selector + 32 bit offset) when it is a bit 32 bit machine. how will other 16 bits be stored ? any thing accessed from ram must be 32 bits or 4 bytes how does the rest of 16 bits be accessed (to be loaded into segment registers) ?
Thanks in advance. the question can have a lot of things. but wanted to get clarification about the entire life cycle of an executable. Will be glad if some answers and pulls up a discussion on this.
Unix traditionally has implemented protection via paging. 286+ provides segmentation, and 386+ provides paging. Everyone uses paging, few make any real use of segmentation.
In x86, every memory operand has an implicit segment (so the address is really 16 bit selector + 32 bit offset), depending on the register used. So if you access [ESP + 8] the implied segment register is SS, if you access [ESI] the implied segment register is DS, if you access [EDI+4] the implied segment register is ES,... You can override this via segment prefix overrides.
Linux, and virtually every modern x86 OS, uses a flat memory model (or something similar). Under a flat memory model each segment provides access to the whole memory, with a base of 0 and a limit of 4Gb, so you don't have to worry about the complications segmentation brings about. Basically there are 4 segments: kernelspace code (RX), kernelspace data (RW), userspace code (RX), userspace data (RW).
An ELF file consists of some headers that pont to "program segments" and "sections". Section are used for linking. Program segments are used for loading. Program segments are mapped into memory via mmap(), this setups page-table entries with appropriate permissions.
Now, older x86 CPUs' paging mechanism only provided RW access control (read permission implies execute permission), while segmentation provided RWX access control. The end permission takes into account both segmentation and paging (e.g: RW (data segment) + R (read only page) = R (read only), while RX (code segment) + R (read only page) = RX (read and execute)).
So there are some patches that provide execution prevention via segmentation: e.g. OpenWall provided a non-executable stack by shrinking the code segment (the one with execute permission), and having special emulation in the page fault handler for anything that needed execution from a high memory address (e.g: GCC trampolines, self-modified code created on the stack to efficiently implement nested functions).
There's no such thing as paged segmentation, not in the official documentation at least. There are two different mechanisms working together and more or less independently of each other:
Translation of a logical address of the form 16-bit segment selector value:16/32/64-bit segment offset value, that is, a pair of 2 numbers into a 32/64-bit virtual address.
Translation of the virtual address into a 32/64-bit physical address.
Logical addresses is what your applications operate directly with. Then follows the above 2-step translation of them into what the RAM will understand, physical addresses.
In the first step the GDT (or it can be LDT, depends on the selector value) is indexed by the selector to find the relevant segment's base address and size. The virtual address will be the sum of the segment base address and the offset. The segment size and other things in segment descriptors are needed to provide protection.
In the second step the page tables are indexed by different parts of the virtual address and the last indexed table in the hierarchy gives the final, physical address that goes out on the address bus for the RAM to see. Just like with segment descriptors, page table entries contain not only addresses but also protection control bits.
That's about it on the mechanisms.
Now, in many x86 OSes the segment selectors that are used for applications are fixed, they are the same in all of them, they never change and they point to segment descriptors that have base addresses equal to 0 and sizes equal to the possible maximum (e.g. 4GB in non-64-bit modes). Such a GDT setup effectively means that the first step does no useful work and the offset part of the logical address translates into numerically equal virtual address.
This makes the segment selector values practically useless. They still have to be loaded into the CPU's segment registers (in non-64-bit modes into at least CS, SS, DS and ES), but beyond that point they can be forgotten about.
This all (except Linux-related details and the ELF format) is explained in or directly follows from Intel's and AMD's x86 CPU manuals. You'll find many more details there.
Perhaps read the Assembly HOWTO. When a Linux process starts to execute an ELF executable using the execve system call, it is essentially (sort of) mmap-ing some segments (and initializing registers, and a tiny part of the stack). Read also the SVR4 x86 ABI supplement and its x86-64 variant. Don't forget that a Linux process only see memory mapping for its address space and only cares about virtual memory
There are many good books on Operating Systems (=O.S.) kernels, notably by A.Tanenbaum & by M.Bach, and some on the linux kernel
NB: segment registers are nearly (almost) unused on Linux.

Why is the ELF execution entry point virtual address of the form 0x80xxxxx and not zero 0x0?

When executed, program will start running from virtual address 0x80482c0. This address doesn't point to our main() procedure, but to a procedure named _start which is created by the linker.
My Google research so far just led me to some (vague) historical speculations like this:
There is folklore that 0x08048000 once was STACK_TOP (that is, the stack grew downwards from near 0x08048000 towards 0) on a port of *NIX to i386 that was promulgated by a group from Santa Cruz, California. This was when 128MB of RAM was expensive, and 4GB of RAM was unthinkable.
Can anyone confirm/deny this?
As Mads pointed out, in order to catch most accesses through null pointers, Unix-like systems tend to make the page at address zero "unmapped". Thus, accesses immediately trigger a CPU exception, in other words a segfault. This is quite better than letting the application go rogue. The exception vector table, however, can be at any address, at least on x86 processors (there is a special register for that, loaded with the lidt opcode).
The starting point address is part of a set of conventions which describe how memory is laid out. The linker, when it produces an executable binary, must know these conventions, so they are not likely to change. Basically, for Linux, the memory layout conventions are inherited from the very first versions of Linux, in the early 90's. A process must have access to several areas:
The code must be in a range which includes the starting point.
There must be a stack.
There must be a heap, with a limit which is increased with the brk() and sbrk() system calls.
There must be some room for mmap() system calls, including shared library loading.
Nowadays, the heap, where malloc() goes, is backed by mmap() calls which obtain chunks of memory at whatever address the kernel sees fit. But in older times, Linux was like previous Unix-like systems, and its heap required a big area in one uninterrupted chunk, which could grow towards increasing addresses. So whatever was the convention, it had to stuff code and stack towards low addresses, and give every chunk of the address space after a given point to the heap.
But there is also the stack, which is usually quite small but could grow quite dramatically in some occasions. The stack grows down, and when the stack is full, we really want the process to predictably crash rather than overwriting some data. So there had to be a wide area for the stack, with, at the low end of that area, an unmapped page. And lo! There is an unmapped page at address zero, to catch null pointer dereferences. Hence it was defined that the stack would get the first 128 MB of address space, except for the first page. This means that the code had to go after those 128 MB, at an address similar to 0x080xxxxx.
As Michael points out, "losing" 128 MB of address space was no big deal because the address space was very large with regards to what could be actually used. At that time, the Linux kernel was limiting the address space for a single process to 1 GB, over a maximum of 4 GB allowed by the hardware, and that was not considered to be a big issue.
Why not start at address 0x0? There's at least two reasons for this:
Because address zero is famously known as a NULL pointer, and used by programming languages to sane check pointers. You can't use an address value for that, if you're going to execute code there.
The actual contents at address 0 is often (but not always) the exception vector table, and is hence not accessible in non-privileged modes. Consult the documentation of your specific architecture.
As for the entrypoint _start vs main:
If you link against the C runtime (the C standard libraries), the library wraps the function named main, so it can initialize the environment before main is called. On Linux, these are the argc and argv parameters to the application, the env variables, and probably some synchronization primitives and locks. It also makes sure that returning from main passes on the status code, and calls the _exit function, which terminates the process.

Resources