"Red zone": Memory management on linux - linux

Regarding memory management on linux:
On linux, there is a non-volatile area called "the red zone", an area below the stack (you reach it when pushing onto the stack), that is reserved and safe to use (that is, the OS will not overwrite it).
The questions are:
How big is the area beyond (after) the red zone, that CAN be overwritten (is volatile), before you can safely place data again (using an mmap call or similar)? (E.g. for allocating memory for a coroutine stack.)
What could possibly alter this volatile memory area? Wikipedia lists "interrupt/exception/signal handlers".
Assume a language that doesn't have language-level exception handling (e.g. assembly). Seems like "exception handlers" are just signal handlers for hardware exceptions such as divide by zero. If so, we can reduce it to "interrupt/signal handlers".
Does the kernel simply write into user-space memory on an interrupt? That seems "wrong". Shouldn't each thread have its own kernel stack in kernel space?
Regarding signals:
"If a signal handler is not installed for a particular signal, the default handler is used."
Does this apply even in assembly (does the kernel supply a default handler), or is the "default handler" really just a part of the c runtime?

Related

How are stack and heap segment managed in x86 without utilizing the segmentation mechanism?

From Understanding the Linux Kernel:
Segmentation has been included in 80x86 microprocessors to encourage programmers to split their applications into logically related entities, such as subroutines or global and local data areas. However, Linux uses segmentation in a very limited way. In fact, segmentation and paging are somewhat redundant, because both can be used to separate the physical address spaces of processes: segmentation can assign a different linear address space to each process, while paging can map the same linear address space into different physical address spaces. Linux prefers paging to segmentation for the following reasons:
Memory management is simpler when all processes use the same segment register values—that is, when they share the same set of linear addresses.
One of the design objectives of Linux is portability to a wide range of architectures; RISC architectures, in particular, have limited support for segmentation.
The 2.6 version of Linux uses segmentation only when required by the 80x86 architecture.
The x86-64 architecture does not use segmentation in long mode (64-bit mode). As the x86 has segments, it is not possible to not use them. Four of the segment registers: CS, SS, DS, and ES are forced to 0, and the limit to 2^64. If so, two questions have been raised:
Stack data (stack segment) and heap data (data segment) are mixed together, then pop from the stack and increase the ESP register is not available.
How does the operating system know which type of data is (stack or heap) in a specific virtual memory address?
How do different programs share the kernel code by sharing memory?
Stack data (stack segment) and heap data (data segment) are mixed together, then pop from the stack and increase the ESP register is not available.
As Peter states in the above comment, even though CS, SS, ES and DS are all treated as having zero base, this does not change the behavior of PUSH/POP in any way. It is no different than any other segment descriptor usage really. You could get overlapping segments even in 32-bit multi-segment mode if you point multiple selectors to the same descriptor. The only thing that "changes" in 64-bit mode is that you have a base forced by the CPU, and RSP can be used to point anywhere in addressable memory. PUSH/POP operations will work as usual.
How does the operating system know which type of data is (stack or heap) in a specific virtual memory address?
User-space programs can (and will) move the stack and heap around as they please. The operating system doesn't really need to know where stack and heap are, but it can keep track of those to some extent, assuming the user-space application does everything according to convention, that is uses the stack allocated by the kernel at program startup and the program break as heap.
Using the stack allocated by the kernel at program startup, or a memory area obtained through mmap(2) with MAP_GROWSDOWN, the kernel tries to help by automatically growing the memory area when its size is exceeded (i.e. stack overflow), but this has its limits. Manual MAP_GROWSDOWN mappings are rarely used in practice (see 1, 2, 3, 4). POSIX threads and other more modern implementations use fixed-size mappings for threads.
"Heap" is a pretty abstract concept in modern user-space applications. Linux provides user-space applications with the basic ability to manipulate the program break through brk(2) and sbrk(2), but this is rarely in a 1-to-1 correspondence with what we got used to call "heap" nowadays. So in general the kernel does not know where the heap of an application resides.
How do different programs share the kernel code by sharing memory?
This is simply done through paging. You could say there is one hierarchy of page tables for the kernel and many others for user-space processes (one for each task). When switching to kernel-space (e.g. through a syscall) the kernel changes the value of the CR3 register to make it point to the kernel's page global directory. When switching back to user-space, CR3 is changed back to point to the current process' page global directory before giving control to user-space code.

Does using the program stack involves syscalls?

I'm studying operating system theory, and I know that heap allocation involves a specific syscall and I know that compilers usually optimize for this requesting more than needed beforehand.
But I don't find information about stack allocation. What about it? It involves a specific syscall every time you read from it or write to it (for example when you call a function with some parameters)? Or there is some other mechanism that don't involve syscall perhaps?
Typically when the OS starts your program it examines the executable file's headers and arranges various areas for various things (an area for your executable's code, and area for your executable's data, etc). This includes setting up an initial stack (and a lot more - e.g. finding shared libraries and doing dynamic linking).
After the OS has done all this, your executable starts executing. At this point you already have memory for a stack and can just use it without any system calls.
Note 1: If you create threads, then there will probably be a system call involved to create the thread and that system call will probably allocate memory for the new thread's stack.
Note 2: Typically there's "virtual memory" (what your program sees) and "physical memory" (what the hardware sees); and in between typically the OS does lots of tricks to improve performance and avoid wasting physical memory, and to hide resource limits (so you don't have to worry so much about running out of physical memory). One of these tricks is to allocate virtual memory (e.g. for a large stack) without allocating any actual physical memory, and then allocate the physical memory if/when the virtual memory is first modified. Other tricks include various "swap space" schemes, and memory mapped files. These tricks rely on requests generated by the CPU on your program's behalf (e.g. page fault exceptions) which aren't system calls, but have similar ("ask kernel to do something") characteristics.
Note 3: All of the above depends on which OS. Different operating systems do things differently. I've chosen words carefully - e.g. "Typically" means that most modern operating systems work like I've described (but "typically" does not imply that all possible operating systems work like that; and some operating systems do not work like I've described).
No, stack is normal memory. For process point of view, there is no difference (and so the nasty security bug, where you return a pointer to a data in stack, but stack now is changed.
As Brendan wrote, OS will setup stack for the process at program loading. But if you access a non-allocated page of stack (e.g. if your stack if growing), kernel may allocate automatically for you a new stack page. (not much different as when you try to allocate new memory in heap, and there is no more memory available on program space: but in this case you explicitly do a syscall to tell kernel you want more heap memory).
You will notice that usually stack go in one direction and heap (allocated memory) in the other direction (usually toward each others). So if you program need more stack you have space, but if you program do not need much stack, you can use memory for e.g. huge array. Or the contrary: if you do a lot of recursion, you allocate much stack (but you probably need less heap memory).
Two additional consideration: CPU may have special stack instruction. But you can see them as syntactic sugar (you can simulate PUSH and POP with MOV. CALL and RET with JMP (and simulated PUSH and POP).
And kernel may use a special stack for his own purposes (especially important for interrupts).

Performance Read() and Write() to/from Linux SKB's

Based on a standard Linux system, where there is a userland application and the kernel network stack. Ive read that moving frames from user space to kernel space (and vica-versa) can be expensive in terms of CPU cycles.
My questions are,
Why? and is moving the frame in one direction (i.e from user to
kernel) have a higher impact.
Also, how do things differ when you
move into TAP based interfaces. As the frame will still be going
between user/kernel space. Do the space concerns apply, or is there some form of zero-copy in play?
Addressing questions in-line:
Why? and is moving the frame in one direction (i.e from user to
kernel) have a higher impact.
Moving to/from user/kernel spaces is expensive because the OS has to:
Validate the pointers for the copy operation.
Transfer the actual data.
Incur the usual costs involved in transitioning between user/kernel mode.
There are some exceptions to this, such as if your driver implements a strategy such as "page flipping", which effectively remaps a chunk/page of memory so that it is accessible to a userspace application. This is "close enough" to a zero copy operation.
With respect to copy_to_user/copy_from_user performance, the performance of the two functions is apparently comparable.
Also, how do things differ when you move into TAP based interfaces. As
the frame will still be going between user/kernel space. Do the space
concerns apply, or is there some form of zero-copy in play?
With TUN/TAP based interfaces, the same considerations apply, unless you're utilizing some sort of DMA, page flipping, etc; logic.
Context Switch
Moving frames from user space to kernel space is called context switch, which is usually caused by system call (which invoke the int 0x80 interrupt).
Interrupt happens, entering kernel space;
When interrupt happens, os will store all of the registers' value into the kernel stack of a thread: ds, es, fs, eax, cr3 etc
Then it jumps to IRQ handler like a function call;
Through some common IRQ execution path, it will choose next thread to run by some algorithm;
The runtime info (all the registers) is loaded from next thread;
Back to user space;
As we can see, we will do a lot of works when moving frame into/out kernel, which is much more work than a simple function call (just setting ebp, esp, eip). That is why this behavior is relatively time-consuming.
Virtual Devices
As a virtual network devices, writing to TAP has no differences compared with writing to a /dev/xxx.
If you write to TAP, os will be interrupted like upper description, then it will copy your arguments into kernel and block your current thread (in blocking IO). Kernel driver thread will be notified in some ways (e.g. message queue) to receive the arguments and consume it.
In Andorid, there exists some zero-copy system call, and in my demo implementations, this can be done through the address translation between the user and kernel. Because kernel and user thread not share same address space and user thread's data may be changed, we usually copy data into kernel. So if we meet the condition, we can avoid copy:
this system call must be blocked, i.e. data won't change;
translate between addresses by page tables, i.e. kernel can refer to right data;
Code
The following are codes from my demo os, which is related to this question if you are interested in detail:
interrupt handle procedure: do_irq.S, irq_handle.c
system call: syscall.c, ide.c
address translation: MM_util.c

Where are the stacks for the other threads located in a process virtual address space?

The following image shows where the sections of a process are laid out in the process's virtual address space (in Linux):
You can see that there is only one stack section (since this process only has one thread I assume).
But what if this process has another thread, where will the stack for this second thread be located? will it be located immediately below the first stack?
Stack space for a new thread is created by the parent thread with mmap(MAP_ANONYMOUS|MAP_STACK). So they're in the "memory map segment", as your diagram labels it. It can end up anywhere that a large malloc() could go. (glibc malloc(3) uses mmap(MAP_ANONYMOUS) for large allocations.)
(MAP_STACK is currently a no-op, and exists in case some future architecture needs special handling).
You pass a pointer to the new thread's stack space to the clone(2) system call which actually creates the thread. (Try using strace -f on a multi-threaded process sometime). See also this blog post about creating a thread using raw Linux syscalls.
See this answer on a related question for some more details about mmaping stacks. e.g. MAP_GROWSDOWN doesn't prevent another mmap() from picking the address right below the thread stack, so you can't depend on it to dynamically grow a small stack the way you can for the main thread's stack (where the kernel reserves the address space even though it's not mapped yet).
So even though mmap(MAP_GROWSDOWN) was designed for allocating stacks, it's so bad that Ulrich Drepper proposed removing it in 2.6.29.
Also, note that your memory-map diagram is for a 32-bit kernel. A 64-bit kernel doesn't have to reserve any user virtual-address space for mapping kernel memory, so a 32-bit process running on an amd64 kernel can use the full 4GB of virtual address space. (Except for the low 64k by default (sysctl vm.mmap_min_addr = 65536), so NULL-pointer dereference does actually fault. And the top page is also reserved as error codes, not valid pointers.)
Related:
See Relation between stack limit and threads for more about stack-size for pthreads. getrlimit(RLIMIT_STACK) is the main thread's stack size. Linux pthreads uses RLIMIT_STACK as the stack size for new threads, too.

Detect Stack overflows

How do operating systems detect stack overflows of user-space programs [and then send SIGTERM or SIGSEGV to those userspace programs] ?
Guard pages. When the OS creates the stack for the program it will allocate a little bit more than is specified. The memory is allocated in pages (usually 4KB each), and the extra page will have settings such that any attempt to access it will result in an exception being thrown.
The answer will depend on the target architecture and the particular OS. Since the question is tagged Linux, you have rather biased the question which on the face of it seems more general.
In a sophisticated OS or RTOS such as Linux or QNX Neutrino, with MMU protection support, memory protection mechanisms may be used such as the guard pages already mentioned. Such OSs require a target with an MMU of course.
Simpler OSs and typical RTOS scheduling kernels without MMU support may use a number of methods. The simplest is to place a guard signature at the top of the stack, which is checked for modification when the scheduler runs. This is a bit hit-and-miss, it requires that the stack-overflow actually modifies the signature, and that the resulting corruption does not cause a crash before the scheduler next runs. Some systems with on-chip debug resources may be able to place an access break-point on the signature word and cause an exception when it is hit.
In development a common technique is to initially fill each thread stack with a signature and to have a thread periodically check for the "high-tide" and issue a warning if it exceeds a certain percentage level.
As well as guard pages mentioned in another answer, some smaller (MMU-less) embedded microcontrollers have specific exceptions for stack overflow (and underflow).

Resources