What is "stack hog" - linux

What does "stack hog" means when talking about Linux kernel?
I read this notion on some linux kernel books(Professional Linux Kernel Architecture by Wolfgang Mauerer), but what exactly does "stack hog" means? Thanks.

"Stack hog" is an informal name used to describe functions that use significant amounts of automatic storage (AKA "the stack"). What exactly counts as "hogging" varies by the execution environment: in general, kernel-level functions have tighter limits on the stack space - just a few kilobytes, so a function considered a "stack hog" in kernel mode may become a "good citizen" in user mode.
A common reason for a function to become a stack hog is allocating buffers or other arrays in the automatic memory. This is more convenient, because you do not need to remember to free the memory and check the results of the allocation. You could also save some CPU cycles on the allocation itself. The downside is a possibility of overflowing the stack, wich results in panic for kernel-level programs. That is why a common remedy of "stack hogging" is moving some of your buffers into dynamic memory.

The Linux kernel uses 4K stacks. Using an inordinate amount of that small space is considered being a hog. If you are "lazy" and allocate a buffer on the stack or have a function with a large number of parameters that is being a hog.
The stack must hold any sequence of calls needed to service a system call as well as any interrupt handlers that may be called. So it is very important to conserve stack space.

Related

Does using the program stack involves syscalls?

I'm studying operating system theory, and I know that heap allocation involves a specific syscall and I know that compilers usually optimize for this requesting more than needed beforehand.
But I don't find information about stack allocation. What about it? It involves a specific syscall every time you read from it or write to it (for example when you call a function with some parameters)? Or there is some other mechanism that don't involve syscall perhaps?
Typically when the OS starts your program it examines the executable file's headers and arranges various areas for various things (an area for your executable's code, and area for your executable's data, etc). This includes setting up an initial stack (and a lot more - e.g. finding shared libraries and doing dynamic linking).
After the OS has done all this, your executable starts executing. At this point you already have memory for a stack and can just use it without any system calls.
Note 1: If you create threads, then there will probably be a system call involved to create the thread and that system call will probably allocate memory for the new thread's stack.
Note 2: Typically there's "virtual memory" (what your program sees) and "physical memory" (what the hardware sees); and in between typically the OS does lots of tricks to improve performance and avoid wasting physical memory, and to hide resource limits (so you don't have to worry so much about running out of physical memory). One of these tricks is to allocate virtual memory (e.g. for a large stack) without allocating any actual physical memory, and then allocate the physical memory if/when the virtual memory is first modified. Other tricks include various "swap space" schemes, and memory mapped files. These tricks rely on requests generated by the CPU on your program's behalf (e.g. page fault exceptions) which aren't system calls, but have similar ("ask kernel to do something") characteristics.
Note 3: All of the above depends on which OS. Different operating systems do things differently. I've chosen words carefully - e.g. "Typically" means that most modern operating systems work like I've described (but "typically" does not imply that all possible operating systems work like that; and some operating systems do not work like I've described).
No, stack is normal memory. For process point of view, there is no difference (and so the nasty security bug, where you return a pointer to a data in stack, but stack now is changed.
As Brendan wrote, OS will setup stack for the process at program loading. But if you access a non-allocated page of stack (e.g. if your stack if growing), kernel may allocate automatically for you a new stack page. (not much different as when you try to allocate new memory in heap, and there is no more memory available on program space: but in this case you explicitly do a syscall to tell kernel you want more heap memory).
You will notice that usually stack go in one direction and heap (allocated memory) in the other direction (usually toward each others). So if you program need more stack you have space, but if you program do not need much stack, you can use memory for e.g. huge array. Or the contrary: if you do a lot of recursion, you allocate much stack (but you probably need less heap memory).
Two additional consideration: CPU may have special stack instruction. But you can see them as syntactic sugar (you can simulate PUSH and POP with MOV. CALL and RET with JMP (and simulated PUSH and POP).
And kernel may use a special stack for his own purposes (especially important for interrupts).

Linux multiple page bounds and cpu segments

I am puzzled as to how Linux is able to have so many segments and it can still have bounds checking. To my knowledge, modern CPUs have a couple of segment data registers (code, data, etc).
But Linux has multiple segments of its own: Stack, BSS, heap, code, globals, and many more (especially if the heap is large and composed of many segments). Not every CPU has enough registers to track all these segments.
If I am not mistaken, Linux stores each segment in a separate page, so how is it able to prevent one of these pages from reading or writing out of bounds?
My only possible explanations are that Linux:
performs some manual checking on every write
places all the pages close together in a way such that they can be tracked with a few registers
With the advent of 64-bit Intel, the concept of hardware segments has died the death that should have taken place in the 1970's.
But Linux has multiple segments of its own: Stack, BSS, heap, code, globals, and many more (especially if the heap is large and composed of many segments).
These are pedagogical concepts that have little relationship to reality outside the implementation of linkers--but which bad books on operating systems persist in using.
A stack is just memory. Heap is just memory. The operating system has no knowledge if memory is being used for a stack whether it is being used for a heap. The operating system simply allocates memory to a process with different attributes (e.g., read/write, read only, read/execute). What the process does with that memory is its own business.

Split stacks unneccesary on amd64

There seems to be an opinion out there that using a "split stack" runtime model is unnecessary on 64-bit architectures. I say seems to be, because I haven't seen anyone actually say that, only dance around it:
The memory usage of a typical multi-threaded program can decrease
significantly, as each thread does not require a worst-case stack
size. It becomes possible to run millions of threads (either full NPTL
threads or co-routines) in a 32-bit address space.
-- Ian Lance Taylor
...implying that a 64-bit address space can already handle it.
And...
... the constant overhead of split stacks and the narrow use case
(spawning enormous numbers of I/O-bound tasks on 32-bit architectures)
isn't acceptable...
-- bstrie
Two questions: Is this what they are saying? Second, if so, why are they unneccesary on 64-bit architectures?
Yes, that's what they are saying.
Split stacks are (currently) unnecessary on 64bit architectures because the 64bit virtual address space is so large it can contain millions of stack address ranges, each as large as an entire 32bit address space, if needed.
In the Flat memory model in use nowadays, the translation from virtual addresses to phisical memory locations is done with the support of the hardware MMU. On amd64 it turns out it's better (meaning, overall faster) to reserve big chunks of the 64bit virtual address space to each new stack you are creating, while only mapping the first page (4kB) to actual RAM. This way, the stack will be able to grow and shrink as needed, over contiguous virtual addresses (meaning less code in each function prologue, a big optimization) while the OS re-configures the MMU to map each page of virtual addresses to an actual free page of RAM, whenever the stack grows or shrinks above/below some configurable thresholds.
By choosing the thresholds smartly (see for example the theory of dynamic arrays) you can achieve O(1) complexity on the average stack operation, while retaining the benefits of millions of stacks that can grow as much as you need and only consume the memory they use.
PS: the current Go implementation is far behind any of this :-)
The Go core team is currently discussing the possibility of using contiguous stacks in a future Go version.
The split stack approach is useful because stacks can grow more flexibly but it also requires that the runtime allocates a relatively big chunk of memory to distribute these stacks across. There has been a lot of confusion about Go's memory usage, in part because of this.
Making contiguous but growable (relocatable) stacks is an option that would provide the same flexibility and maybe reduce the confusion about Go's memory usage. As well as remedying some ill corner-cases on low-memory machines (see linked thread).
As to advantages/disadvantages on 32-bit vs. 64-bit architectures, I don't think there are any directly associated solely with the use of segmented stacks.
Update Go 1.4 (Q4 2014)
Change to the runtime:
Up to Go 1.4, the runtime (garbage collector, concurrency support, interface management, maps, slices, strings, ...) was mostly written in C, with some assembler support.
In 1.4, much of the code has been translated to Go so that the garbage collector can scan the stacks of programs in the runtime and get accurate information about what variables are active.
This rewrite allows the garbage collector in 1.4 to be fully precise, meaning that it is aware of the location of all active pointers in the program. This means the heap will be smaller as there will be no false positives keeping non-pointers alive. Other related changes also reduce the heap size, which is smaller by 10%-30% overall relative to the previous release.
A consequence is that stacks are no longer segmented, eliminating the "hot split" problem. When a stack limit is reached, a new, larger stack is allocated, all active frames for the goroutine are copied there, and any pointers into the stack are updated.
Initial answer (March 2014)
The article "Contiguous stacks in Go" by Agis Anastasopoulo also addresses this issue
In such cases where the stack boundary happens to fall in a tight loop, the overhead of creating and destroying segments repeatedly becomes significant.
This is called the “hot split” problem inside the Go community.
The “hot split” will be addressed in Go 1.3 by implementing contiguous stacks.
Now when a stack needs to grow, instead of allocating a new segment the following happens:
Create a new, somewhat larger stack
Copy the contents of the old stack to the new stack
Re-adjust every copied pointer to point to the new addresses
Destroy the old stack
The following mention one problem seen mainly in 32-bit arhcitectures:
There is a certain challenge though.
The 1.2 runtime doesn’t know if a pointer-sized word in the stack is an actual pointer or not. There may be floats and most rarely integers that if interpreted as pointers, would actually point to data.
Due to the lack of such knowledge the garbage collector has to conservatively consider all the locations in the stack frames to be roots. This leaves the possibility for memory leaks especially on 32-bit architectures since their address pool is much smaller.
When copying stacks however, such cases have to be avoided and only real pointers should be taken into account when re-adjusting.
Work was done though and information about live stack pointers is now embedded in the binaries and is available to the runtime.
This means not only that the collector in 1.3 can precisely stack data but re-adjusting stack pointers is now possible.

How is memory allocated on heap without a system call?

I was wondering that if the space required on heap is not large enough
such that there is no need for a brk/sbrk system all (to shift the break pointer (brk) of data segment), how does a library function (such as malloc) allocates space on heap.
I am not asking about the data-structures and algorithms for heap management. I am just asking how does malloc get the address of the first location of the heap if it doesn't invoke a system call. I am asking this because I have heard that it is not always necessary to invoke a system call (brk/sbrk) as these are only required to expand the space.Please correct me if I am wrong.
The basic idea is that when your program starts, the heap is very small, but not necessarily zero. If you only allocate (malloc) a small amount of memory, the library is able to handle it within the small amount of space it has when it is loaded. However, when malloc runs out of that space, it needs to make a system call to get more memory.
That system call is often sbrk(), which moves the top of the heap's memory region up by a certain amount. Usually, the malloc library routine increases the heap by larger than what is needed for the current allocation, with the hope that future allocations can be performed w/o making a system call.
Other implementations of malloc use mmap() instead -- this allows the program to create a sparse virtual memory mapping. However, mmap() based malloc implementations do the same thing as the sbrk()-based ones: each system call reserves more memory than what is necessarily needed for the current call.
One way to look at this is to trace a program that uses malloc: you'll see that for N calls to malloc, you will see M system calls (where M is much smaller than N).
The short answer is that it uses sbrk() to allocate a big hunk, which at that point belongs to your app process. It can then further parcel out subsections of that as individual malloc calls without needing to ask the system for anything, until it exhausts that space and needs to resort sbrk() again.
You said you didn't want the details on the data structures, but suffice it to say that the implementation of malloc (i.e. your own process, not the OS kernel) is keeping track of which space in the region it got from the system is spoken for and which is still available to dole out as individual mallocs. It's like buying a big tract of land, then subdividing it into lots for individual houses.
Use sbrk() or mmap() — http://linux.die.net/man/2/sbrk, http://linux.die.net/man/2/mmap

How does the amount of memory for a process get determined?

From my understanding, when a process is under execution it has some amount of memory at it's disposal. As the stack increases in size it builds from one end of the process (disregarding global variables that come before the stack), while the heap builds from another end. If you keep adding to the stack or heap, eventually all the memory will be used up for this process.
How does the amount of memory the process is given get determined? I can only imagine it depends on a bunch of different variables, but an as-general-as-possible response would be great. If things have to get specific, I'm interested in linux processes written in C++.
On most platforms you will encounter, Linux runs with virtual memory enabled. This means that each process has its own virtual address space, the size of which is determined only by the hardware and the way the kernel has configured it.
For example, on the x86 architecture with a "3/1" split configuration, every userspace process has 3GB of address space available to it, within which the heap and stack are allocated. This is regardless of how much physical memory is available in the system. On the x86-64 architecture, 128TB of address space is typically available to each userspace process.
Physical memory is separately allocated to back that virtual memory. The amount of this available to a process depends upon the configuration of the system, but in general it's simply supplied "on-demand" - limited mostly how much physical memory and swap file space exists, and how much is currently in use for other purposes.
The stack does not magically grow. It's size is static and the size is determined at linking time. So when you take enough space from the stack, it overflows (stack overflow ;)
On the other hand, the heap area 'magically' grows. Meaning that when ever more memory is needed for heap, the program asks operating system for more memory.
EDIT: As Mat pointed out below, the stack actually can increase during runtime on modern operating systems.

Resources