Linux multiple page bounds and cpu segments - linux

I am puzzled as to how Linux is able to have so many segments and it can still have bounds checking. To my knowledge, modern CPUs have a couple of segment data registers (code, data, etc).
But Linux has multiple segments of its own: Stack, BSS, heap, code, globals, and many more (especially if the heap is large and composed of many segments). Not every CPU has enough registers to track all these segments.
If I am not mistaken, Linux stores each segment in a separate page, so how is it able to prevent one of these pages from reading or writing out of bounds?
My only possible explanations are that Linux:
performs some manual checking on every write
places all the pages close together in a way such that they can be tracked with a few registers

With the advent of 64-bit Intel, the concept of hardware segments has died the death that should have taken place in the 1970's.
But Linux has multiple segments of its own: Stack, BSS, heap, code, globals, and many more (especially if the heap is large and composed of many segments).
These are pedagogical concepts that have little relationship to reality outside the implementation of linkers--but which bad books on operating systems persist in using.
A stack is just memory. Heap is just memory. The operating system has no knowledge if memory is being used for a stack whether it is being used for a heap. The operating system simply allocates memory to a process with different attributes (e.g., read/write, read only, read/execute). What the process does with that memory is its own business.

Related

How are stack and heap segment managed in x86 without utilizing the segmentation mechanism?

From Understanding the Linux Kernel:
Segmentation has been included in 80x86 microprocessors to encourage programmers to split their applications into logically related entities, such as subroutines or global and local data areas. However, Linux uses segmentation in a very limited way. In fact, segmentation and paging are somewhat redundant, because both can be used to separate the physical address spaces of processes: segmentation can assign a different linear address space to each process, while paging can map the same linear address space into different physical address spaces. Linux prefers paging to segmentation for the following reasons:
Memory management is simpler when all processes use the same segment register values—that is, when they share the same set of linear addresses.
One of the design objectives of Linux is portability to a wide range of architectures; RISC architectures, in particular, have limited support for segmentation.
The 2.6 version of Linux uses segmentation only when required by the 80x86 architecture.
The x86-64 architecture does not use segmentation in long mode (64-bit mode). As the x86 has segments, it is not possible to not use them. Four of the segment registers: CS, SS, DS, and ES are forced to 0, and the limit to 2^64. If so, two questions have been raised:
Stack data (stack segment) and heap data (data segment) are mixed together, then pop from the stack and increase the ESP register is not available.
How does the operating system know which type of data is (stack or heap) in a specific virtual memory address?
How do different programs share the kernel code by sharing memory?
Stack data (stack segment) and heap data (data segment) are mixed together, then pop from the stack and increase the ESP register is not available.
As Peter states in the above comment, even though CS, SS, ES and DS are all treated as having zero base, this does not change the behavior of PUSH/POP in any way. It is no different than any other segment descriptor usage really. You could get overlapping segments even in 32-bit multi-segment mode if you point multiple selectors to the same descriptor. The only thing that "changes" in 64-bit mode is that you have a base forced by the CPU, and RSP can be used to point anywhere in addressable memory. PUSH/POP operations will work as usual.
How does the operating system know which type of data is (stack or heap) in a specific virtual memory address?
User-space programs can (and will) move the stack and heap around as they please. The operating system doesn't really need to know where stack and heap are, but it can keep track of those to some extent, assuming the user-space application does everything according to convention, that is uses the stack allocated by the kernel at program startup and the program break as heap.
Using the stack allocated by the kernel at program startup, or a memory area obtained through mmap(2) with MAP_GROWSDOWN, the kernel tries to help by automatically growing the memory area when its size is exceeded (i.e. stack overflow), but this has its limits. Manual MAP_GROWSDOWN mappings are rarely used in practice (see 1, 2, 3, 4). POSIX threads and other more modern implementations use fixed-size mappings for threads.
"Heap" is a pretty abstract concept in modern user-space applications. Linux provides user-space applications with the basic ability to manipulate the program break through brk(2) and sbrk(2), but this is rarely in a 1-to-1 correspondence with what we got used to call "heap" nowadays. So in general the kernel does not know where the heap of an application resides.
How do different programs share the kernel code by sharing memory?
This is simply done through paging. You could say there is one hierarchy of page tables for the kernel and many others for user-space processes (one for each task). When switching to kernel-space (e.g. through a syscall) the kernel changes the value of the CR3 register to make it point to the kernel's page global directory. When switching back to user-space, CR3 is changed back to point to the current process' page global directory before giving control to user-space code.

Does using the program stack involves syscalls?

I'm studying operating system theory, and I know that heap allocation involves a specific syscall and I know that compilers usually optimize for this requesting more than needed beforehand.
But I don't find information about stack allocation. What about it? It involves a specific syscall every time you read from it or write to it (for example when you call a function with some parameters)? Or there is some other mechanism that don't involve syscall perhaps?
Typically when the OS starts your program it examines the executable file's headers and arranges various areas for various things (an area for your executable's code, and area for your executable's data, etc). This includes setting up an initial stack (and a lot more - e.g. finding shared libraries and doing dynamic linking).
After the OS has done all this, your executable starts executing. At this point you already have memory for a stack and can just use it without any system calls.
Note 1: If you create threads, then there will probably be a system call involved to create the thread and that system call will probably allocate memory for the new thread's stack.
Note 2: Typically there's "virtual memory" (what your program sees) and "physical memory" (what the hardware sees); and in between typically the OS does lots of tricks to improve performance and avoid wasting physical memory, and to hide resource limits (so you don't have to worry so much about running out of physical memory). One of these tricks is to allocate virtual memory (e.g. for a large stack) without allocating any actual physical memory, and then allocate the physical memory if/when the virtual memory is first modified. Other tricks include various "swap space" schemes, and memory mapped files. These tricks rely on requests generated by the CPU on your program's behalf (e.g. page fault exceptions) which aren't system calls, but have similar ("ask kernel to do something") characteristics.
Note 3: All of the above depends on which OS. Different operating systems do things differently. I've chosen words carefully - e.g. "Typically" means that most modern operating systems work like I've described (but "typically" does not imply that all possible operating systems work like that; and some operating systems do not work like I've described).
No, stack is normal memory. For process point of view, there is no difference (and so the nasty security bug, where you return a pointer to a data in stack, but stack now is changed.
As Brendan wrote, OS will setup stack for the process at program loading. But if you access a non-allocated page of stack (e.g. if your stack if growing), kernel may allocate automatically for you a new stack page. (not much different as when you try to allocate new memory in heap, and there is no more memory available on program space: but in this case you explicitly do a syscall to tell kernel you want more heap memory).
You will notice that usually stack go in one direction and heap (allocated memory) in the other direction (usually toward each others). So if you program need more stack you have space, but if you program do not need much stack, you can use memory for e.g. huge array. Or the contrary: if you do a lot of recursion, you allocate much stack (but you probably need less heap memory).
Two additional consideration: CPU may have special stack instruction. But you can see them as syntactic sugar (you can simulate PUSH and POP with MOV. CALL and RET with JMP (and simulated PUSH and POP).
And kernel may use a special stack for his own purposes (especially important for interrupts).

What is the difference between the stack and the heap?

A program is divided into 4 parts: Stack, data, code, heap.
I know what each of them are as data structures (like used in Java), but what is their difference (and definition) in operating systems?
A program is divided into 4 parts: Stack, data, code, heap.
That is not an accurate starting point.
A program is divided into program sections with various attributes.
Read Only/No execute (which you call data)
Read Only/Execute (which you call code)
Read/Write (which encompasses both heap and stack).
A stack is simply a block of memory that is allocated and freed using push and pop operations.The allocations and frees are usually implemented using a stack pointer register.
A heap is one or more blocks of memory that can be allocated and freed in any order and various sizes. The operating system has no knowledge at all of program heaps. The are managed by libraries linked to the code (although the operating system will have heaps of its own). The operating system simply sees these a blocks of memory.

Split stacks unneccesary on amd64

There seems to be an opinion out there that using a "split stack" runtime model is unnecessary on 64-bit architectures. I say seems to be, because I haven't seen anyone actually say that, only dance around it:
The memory usage of a typical multi-threaded program can decrease
significantly, as each thread does not require a worst-case stack
size. It becomes possible to run millions of threads (either full NPTL
threads or co-routines) in a 32-bit address space.
-- Ian Lance Taylor
...implying that a 64-bit address space can already handle it.
And...
... the constant overhead of split stacks and the narrow use case
(spawning enormous numbers of I/O-bound tasks on 32-bit architectures)
isn't acceptable...
-- bstrie
Two questions: Is this what they are saying? Second, if so, why are they unneccesary on 64-bit architectures?
Yes, that's what they are saying.
Split stacks are (currently) unnecessary on 64bit architectures because the 64bit virtual address space is so large it can contain millions of stack address ranges, each as large as an entire 32bit address space, if needed.
In the Flat memory model in use nowadays, the translation from virtual addresses to phisical memory locations is done with the support of the hardware MMU. On amd64 it turns out it's better (meaning, overall faster) to reserve big chunks of the 64bit virtual address space to each new stack you are creating, while only mapping the first page (4kB) to actual RAM. This way, the stack will be able to grow and shrink as needed, over contiguous virtual addresses (meaning less code in each function prologue, a big optimization) while the OS re-configures the MMU to map each page of virtual addresses to an actual free page of RAM, whenever the stack grows or shrinks above/below some configurable thresholds.
By choosing the thresholds smartly (see for example the theory of dynamic arrays) you can achieve O(1) complexity on the average stack operation, while retaining the benefits of millions of stacks that can grow as much as you need and only consume the memory they use.
PS: the current Go implementation is far behind any of this :-)
The Go core team is currently discussing the possibility of using contiguous stacks in a future Go version.
The split stack approach is useful because stacks can grow more flexibly but it also requires that the runtime allocates a relatively big chunk of memory to distribute these stacks across. There has been a lot of confusion about Go's memory usage, in part because of this.
Making contiguous but growable (relocatable) stacks is an option that would provide the same flexibility and maybe reduce the confusion about Go's memory usage. As well as remedying some ill corner-cases on low-memory machines (see linked thread).
As to advantages/disadvantages on 32-bit vs. 64-bit architectures, I don't think there are any directly associated solely with the use of segmented stacks.
Update Go 1.4 (Q4 2014)
Change to the runtime:
Up to Go 1.4, the runtime (garbage collector, concurrency support, interface management, maps, slices, strings, ...) was mostly written in C, with some assembler support.
In 1.4, much of the code has been translated to Go so that the garbage collector can scan the stacks of programs in the runtime and get accurate information about what variables are active.
This rewrite allows the garbage collector in 1.4 to be fully precise, meaning that it is aware of the location of all active pointers in the program. This means the heap will be smaller as there will be no false positives keeping non-pointers alive. Other related changes also reduce the heap size, which is smaller by 10%-30% overall relative to the previous release.
A consequence is that stacks are no longer segmented, eliminating the "hot split" problem. When a stack limit is reached, a new, larger stack is allocated, all active frames for the goroutine are copied there, and any pointers into the stack are updated.
Initial answer (March 2014)
The article "Contiguous stacks in Go" by Agis Anastasopoulo also addresses this issue
In such cases where the stack boundary happens to fall in a tight loop, the overhead of creating and destroying segments repeatedly becomes significant.
This is called the “hot split” problem inside the Go community.
The “hot split” will be addressed in Go 1.3 by implementing contiguous stacks.
Now when a stack needs to grow, instead of allocating a new segment the following happens:
Create a new, somewhat larger stack
Copy the contents of the old stack to the new stack
Re-adjust every copied pointer to point to the new addresses
Destroy the old stack
The following mention one problem seen mainly in 32-bit arhcitectures:
There is a certain challenge though.
The 1.2 runtime doesn’t know if a pointer-sized word in the stack is an actual pointer or not. There may be floats and most rarely integers that if interpreted as pointers, would actually point to data.
Due to the lack of such knowledge the garbage collector has to conservatively consider all the locations in the stack frames to be roots. This leaves the possibility for memory leaks especially on 32-bit architectures since their address pool is much smaller.
When copying stacks however, such cases have to be avoided and only real pointers should be taken into account when re-adjusting.
Work was done though and information about live stack pointers is now embedded in the binaries and is available to the runtime.
This means not only that the collector in 1.3 can precisely stack data but re-adjusting stack pointers is now possible.

How does the amount of memory for a process get determined?

From my understanding, when a process is under execution it has some amount of memory at it's disposal. As the stack increases in size it builds from one end of the process (disregarding global variables that come before the stack), while the heap builds from another end. If you keep adding to the stack or heap, eventually all the memory will be used up for this process.
How does the amount of memory the process is given get determined? I can only imagine it depends on a bunch of different variables, but an as-general-as-possible response would be great. If things have to get specific, I'm interested in linux processes written in C++.
On most platforms you will encounter, Linux runs with virtual memory enabled. This means that each process has its own virtual address space, the size of which is determined only by the hardware and the way the kernel has configured it.
For example, on the x86 architecture with a "3/1" split configuration, every userspace process has 3GB of address space available to it, within which the heap and stack are allocated. This is regardless of how much physical memory is available in the system. On the x86-64 architecture, 128TB of address space is typically available to each userspace process.
Physical memory is separately allocated to back that virtual memory. The amount of this available to a process depends upon the configuration of the system, but in general it's simply supplied "on-demand" - limited mostly how much physical memory and swap file space exists, and how much is currently in use for other purposes.
The stack does not magically grow. It's size is static and the size is determined at linking time. So when you take enough space from the stack, it overflows (stack overflow ;)
On the other hand, the heap area 'magically' grows. Meaning that when ever more memory is needed for heap, the program asks operating system for more memory.
EDIT: As Mat pointed out below, the stack actually can increase during runtime on modern operating systems.

Resources