NPTL Default Stack Size Problem - linux

I am developing a multithread modular application using C programming language and NPTL 2.6. For each plugin, a POSIX thread is created. The problem is each thread has its own stack area, since default stack size depends on user's choice, this may results in huge memory consumption in some cases.
To prevent unnecessary memory usage I used something similar to this to change stack size before creating each thread:
pthread_attr_t attr;
pthread_attr_init (&attr);
pthread_attr_getstacksize(&attr, &st1);
if(pthread_attr_setstacksize (&attr, MODULE_THREAD_SIZE) != 0) perror("Stack ERR");
pthread_attr_getstacksize(&attr, &st2);
printf("OLD:%d, NEW:%d - MIN: %d\n", st1, st2, PTHREAD_STACK_MIN);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
/* "this" is static data structure that stores plugin related data */
pthread_create(&this->runner, &attr, (void *)(void *)this->run, NULL);
EDIT I: pthread_create() section added.
This did not work work as I expected, the stack size reported by pthread_attr_getstacksize() is changed but total memory usage of the application (from ps/top/pmap output) did not changed:
OLD:10485760, NEW:65536 - MIN: 16384
When I use ulimit -s MY_STACK_SIZE_LIMIT before starting application I achieve the expected result.
My questions are:
1-) Is there any portable(between UNIX variants) way to change (default)thread stack size after starting application(before creating thread of course)?
2-) Is it possible to use same stack area for every thread?
3-) Is it possible completely disable stack for threads without much pain?

Answers for #2 and #3 are no and no. Each thread needs a stack (where else do your local variables and return addresses go?) and they need to be unique per-thread (otherwise threads would overwrite each other's local variables and return addresses, making everybody crash).
As for #1... the set stack size call is precisely the answer for this. I suggest you figure out an acceptable size to create your threads with, and set it.
As for why things don't look right to you in top.... top is a notorious liar about memory usage. :-) Is stuff actually failing to be allocated or getting OOM-killed? Are thread creations failing? Is performance suffering and paging to disk increasing? If the answer to these questions is no, then I don't think there's much to worry about.
Update based on some comments below and above:
First, 16KB is still pretty big for something that you say doesn't need much stack space. If you really want to go small, I would be tempted to say 4096 or 8192 on x86 Linux. Second, yes you can set your CPU's stack pointer to something else.. But when you malloc() or mmap(), that's going to take up space. I don't know how you think it's going to help to set the stack pointer to something else. That said, if you really feel strongly that the thread that calls main() has too big of a stack (I would say that is slightly crazy) and that pthread_attr_setstacksize() doesn't let you get small enough (?), then maybe you can look into non-portable stuff like creating threads by calling the clone() syscall and specifying stacks based on the main thread's stack pointer, or a buffer from elsewhere, or whatever. But you're still going to need a stack for each thread and I have a feeling top is still going to disappoint you. Maybe your expectations are a little high.

I have seen this problem as well. It is unclear how the stacks are accounted for but the "extra" space is counted against your total VM and if you run up against your process boundary you are in trouble (even though you aren't using the space). It seems dependent on what version of Linux you are running (even within the 2.6 family), and whether you are 32 bit or 64 bit.

Related

How much stack space is typically reserved for a thread? (POSIX / OSX)

The answer probably differs depending on the OS, but I'm curious how much stack space does a thread normally preallocate. For example, if I use:
push rax
that will put a value on the stack and increment the rsp. But what if I never use a push op? I imagine some space still gets allocated, but how much? Also, is this a fixed amount or is does it grow dynamically with the amount of stuff pushed?
POSIX does not define any standards regarding stack size, it is entirely implementation dependent. Since you tagged this OSX, the default allocations there are :
Main thread (8MB)
Secondary Thread (512kB)
Naturally, these can be configured to suit your needs. The allocation is dynamic :
The minimum allowed stack size for secondary threads is 16 KB and the
stack size must be a multiple of 4 KB. The space for this memory is
set aside in your process space at thread creation time, but the
actual pages associated with that memory are not created until they
are needed.
There is too much detail to include here. I suggest you read :
Thread Management (Mac Developer Library)

What is the Linux Stack?

I recently ran into a bug with the "linux stack" and the "linux stack size". I came across a blog directing me to try
ulimit -a
to see what the limit for my box was, and it was set to 8192kb which seems to be the default.
What is the "linux stack"? How does it work, what does it store, what does it do?
The short answer is:
When programs on your linux box run, they add and remove data from the stack on a regular basis as the programs function. The stack size, referes to how much space is allocated in memory for the stack. If you increase the stack size, that allows the program to increase the number of routines that can be called. Each time a function is called, data can be added to the stack (stacked on top of the last routines data.)
Unless the program is a very complex, or designed for a special purpose, a stack size of 8192kb is normally fine. Some programs like graphics processing programs require you to increase the size of the stack to function. As they may store a lot of data on the stack.
Feel free to increase the stack size for those applications, its not a problem. To do so, use
ulimit -s bytes
BTW, What is a StackOverflowError?

Maximum Thread Stack Size .NET?

What is the maximum stack size allowed for a thread in C#.NET 2.0? Also, does this value depend on the version of the CLR and/or the bitness (32 or 64) of the underlying OS?
I have looked at the following resources msdn1 and msdn2
public Thread(
ThreadStart start,
int maxStackSize
)
The only information I can see is that the default size is 1 megabytes and in the above method, if maxStackSize is '0' the default maximum stack size specified in the header for the executable will be used, what's the maximum value that we can change the value in the header upto? Also is it advisable to do so? Thanks.
For the record, this fits Raymond Chen's category of "if you need to know then you are doing something wrong".
The default stack size for threads running 64-bit code is 4 megabytes, 1 megabyte for 32-bit code. While the Thread constructor lets you pass a integer value up to int.MaxValue, you'll never get that on a 32-bit machine. The stack must fit in an available hole in the virtual memory address space, that usually tops out at ~600 MB early in the process lifetime. Rapidly getting smaller as you allocate memory and fragment the address space.
Allocating more than the default is quite unnecessary. You might contemplate doing this when you have a heavily recursive method that blows the stack. Don't, fix the algorithm or you'll blow it anyway when the job gets bigger.
The smallest stack that .NET lets you choose is 250 KB. It silently rounds it up if you pass a value that's smaller. Necessary because both the jitter and the garbage collector need stack space to get their job done. Again, doing so should be quite unnecessary. If you contemplate doing so because you have a lot of threads and consume all virtual memory with their stacks then you have too many threads. A StackOverflowException is one of the nastiest runtime exceptions you can get. Process death is immediate and untrappable.
The stack size for the main thread is determined by an option in the EXE header. The compiler doesn't have an option to change it, you have to use editbin.exe /stack to patch the .exe header.
I am unaware of what the maximum is, but MSDN speaks to whether you should do it or not:
Avoid using this constructor overload. The default stack size used by the Thread(ThreadStart) constructor overload is the recommended stack size for threads. If a thread has memory problems, the most likely cause is programming error, such as infinite recursion.
I have never had a StackOverflow occur in C# which was not due to infinite recursion. If there truly was a case where recursion went to that depth, I would consider replacing it with iteration.

Is a stack overflow a security hole?

Note: this question relates to stack overflows (think infinite recursion), NOT buffer overflows.
If I write a program that is correct, but it accepts an input from the Internet that determines the level of recursion in a recursive function that it calls, is that potentially sufficient to allow someone to compromise the machine?
I know someone might be able to crash the process by causing a stack overflow, but could they inject code? Or does the c runtime detect the stack overflow condition and abort cleanly?
Just curious...
Rapid Refresher
First off, you need to understand that the fundamental units of protection in modern OSes are the process and the memory page. Processes are memory protection domains; they are the level at which an OS enforces security policy, and they thus correspond strongly with a running program. (Where they don't, it's either because the program is running in multiple processes or because the program is being shared in some kind of framework; the latter case has the potential to be “security-interesting” but that's 'nother story.) Virtual memory pages are the level at which the hardware applies security rules; every page in a process's memory has attributes that determine what the process can do with the page: whether it can read the page, whether it can write to it, and whether it can execute program code on it (though the third attribute is rather more rarely used than perhaps it should be). Compiled program code is mapped into memory into pages that are both readable and executable, but not writable, whereas the stack should be readable and writable, but not executable. Most memory pages are not readable, writable or executable at all; the OS only lets a process use as many pages as it explicitly asks for, and that's what memory allocation libraries (malloc() et al.) manage for you.
Analysis
Provided each stack frame is smaller than a memory page[1] so that, as the program advances through the stack, it writes to each page, the OS (i.e., the privileged part of the runtime) can at least in principle detect stack overflows reliably and terminate the program if that occurs. Basically, all that happens to do this detection is that there is a page that the program cannot write to at the end of the stack; if the program tries to write to it, the memory management hardware traps it and the OS gets a chance to intervene.
The potential problems with this come if the OS can be tricked into not setting such a page or if the stack frames can become so large and sparsely written to that the guard page is jumped over. (Keeping more guard pages would help prevent the second case with little cost; forcing variable-sized stack allocations – e.g., alloca() – to always write to the space they allocate before returning control to the program, and so detect a smashed stack, would prevent the first with some cost in terms of speed, though the writes could be reasonably sparse to keep the cost fairly small.)
Consequences
What are the consequences of this? Well, the OS has to do the right thing with memory management. (#Michael's link illustrates what can happen when it gets that wrong.) But also it is dangerous to let an attacker determine memory allocation sizes where you don't force a write to the whole allocation immediately; alloca and C99 variable-sized arrays are a particular threat. Moreover, I would be more suspicious of C++ code as that tends to do a lot more stack-based memory allocation; it might be OK, but there's a greater potential for things to go wrong.
Personally, I prefer to keep stack sizes and stack-frame sizes small anyway and do all variable-sized allocations on the heap. In part, this is a legacy of working on some types of embedded system and with code which uses very large numbers of threads, but it does make protecting against stack overflow attacks much simpler; the OS can reliably trap them and all the attacker has then is a denial-of-service (annoying, but rarely fatal). I don't know whether this is a solution for all programmers.
[1] Typical page sizes: 4kB on 32-bit systems, 16kB on 64-bit systems. Check your system documentation for what it is in your environment.
Most systems (like Windows) exit when the stack is overflowed. I don't think you are likely to see a security issue here. At least, not an elevation of privilege security issue. You could get some denial of service issues.
There is no universally correct answer... on some system the stack might grow down or up to overwrite other memory that the program's using (or another program, or the OS), but on any well designed, vaguely security-conscious OS (Linux, any common UNIX variant, even Windows) there will be no rights escalation potential. On some systems, with stack size checks disabled, the address space might approach or exceed the free virtual memory size, allowing memory exhaustion to negatively affect or even bring down the entire machine rather than just the process, but on good OSes by default there's a limit on that (e.g. Linux's limit / ulimit commands).
Worth mentioning that it's typically pretty easy to use a counter to put an arbitrary but generous limit of recursive depth too: you can use a static local variable if single-threaded, or a final parameter (conveniently defaulted to 0 if your language allows it, else have an outer caller provide 0 the first time).
Yes it is. There are algorithms to avoid recursiveness. For example in case arithmetic expressions the reverse polish notation enable you to avoid recursiveness. The main idea behind is to alter the original expression. There could be some algorythm that can help you as well.
One other problem with stack overflow, that if error handling is not appropiate, it can cause anything. To explain it for example in Java StackOverflowError is an error, and it is caught if someone catches Throwable, which is a common mistake. So error handling is a key question in case of stack overflow.
Yes, it is. Availability is an important aspect of security, that is mostly overlooked.
Don't fall into that pit.
edit
As an example of poorly understood security-consciousness in modern OSs, take a look at a relatively newly discovered vulnerability that nobody yet patched completely. There are countless other examples of privilege escalation vulnerabilities that OS developers have written off as denial of service attacks.

How to mmap the stack for the clone() system call on linux?

The clone() system call on Linux takes a parameter pointing to the stack for the new created thread to use. The obvious way to do this is to simply malloc some space and pass that, but then you have to be sure you've malloc'd as much stack space as that thread will ever use (hard to predict).
I remembered that when using pthreads I didn't have to do this, so I was curious what it did instead. I came across this site which explains, "The best solution, used by the Linux pthreads implementation, is to use mmap to allocate memory, with flags specifying a region of memory which is allocated as it is used. This way, memory is allocated for the stack as it is needed, and a segmentation violation will occur if the system is unable to allocate additional memory."
The only context I've ever heard mmap used in is for mapping files into memory, and indeed reading the mmap man page it takes a file descriptor. How can this be used for allocating a stack of dynamic length to give to clone()? Is that site just crazy? ;)
In either case, doesn't the kernel need to know how to find a free bunch of memory for a new stack anyway, since that's something it has to do all the time as the user launches new processes? Why does a stack pointer even need to be specified in the first place if the kernel can already figure this out?
Stacks are not, and never can be, unlimited in their space for growth. Like everything else, they live in the process's virtual address space, and the amount by which they can grow is always limited by the distance to the adjacent mapped memory region.
When people talk about the stack growing dynamically, what they might mean is one of two things:
Pages of the stack might be copy-on-write zero pages, which do not get private copies made until the first write is performed.
Lower parts of the stack region may not yet be reserved (and thus not count towards the process's commit charge, i.e. the amount of physical memory/swap the kernel has accounted for as reserved for the process) until a guard page is hit, in which case the kernel commits more and moves the guard page, or kills the process if there is no memory left to commit.
Trying to rely on the MAP_GROWSDOWN flag is unreliable and dangerous because it cannot protect you against mmap creating a new mapping just adjacent to your stack, which will then get clobbered. (See http://lwn.net/Articles/294001/) For the main thread, the kernel automatically reserves the stack-size ulimit worth of address space (not memory) below the stack and prevents mmap from allocating it. (But beware! Some broken vendor-patched kernels disable this behavior leading to random memory corruption!) For other threads, you simply must mmap the entire range of address space the thread might need for stack when creating it. There is no other way. You could make most of it initially non-writable/non-readable, and change that on faults, but then you'd need signal handlers and this solution is not acceptable in a POSIX threads implementation because it would interfere with the application's signal handlers. (Note that, as an extension, the kernel could offer special MAP_ flags to deliver a different signal instead of SIGSEGV on illegal access to the mapping, and then the threads implementation could catch and act on this signal. But Linux at present has no such feature.)
Finally, note that the clone syscall does not take a stack pointer argument because it does not need it. The syscall must be performed from assembly code, because the userspace wrapper is required to change the stack pointer in the "child" thread to point to the desired stack, and avoid writing anything to the parent's stack.
Actually, clone does take a stack pointer argument, because it's unsafe to wait to change stack pointer in the "child" after returning to userspace. Unless signals are all blocked, a signal handler could run immediately on the wrong stack, and on some architectures the stack pointer must be valid and point to an area safe to write at all times.
Not only is modifying the stack pointer impossible from C, but you also couldn't avoid the possibility that the compiler would clobber the parent's stack after the syscall but before the stack pointer was changed.
You'd want the MAP_ANONYMOUS flag for mmap. And the MAP_GROWSDOWN since you want to make use it as a stack.
Something like:
void *stack = mmap(NULL,initial_stacksize,PROT_WRITE|PROT_READ,MAP_PRIVATE|MAP_GROWSDOWN|MAP_ANONYMOUS,-1,0);
See the mmap man page for more info. And remember, clone is a low level concept, that you're not meant to use unless you really need what it offers. And it offers a lot of control - like setting it's own stack - just in case you want to do some trickering(like having the stack accessible in all the related processes). Unless you have very good reason to use clone, stick with fork or pthreads.
Joseph, in answer to your last question:
When a user creates a "normal" new process, that's done by fork(). In this case, the kernel doesn't have to worry about creating a new stack at all, because the new process is a complete duplicate of the old one, right down to the stack.
If the user replaces the currently running process using exec(), then the kernel does need to create a new stack - but in this case that's easy, because it gets to start from a blank slate. exec() wipes out the memory space of the process and reinitialises it, so the kernel gets to say "after exec(), the stack always lives HERE".
If, however, we use clone(), then we can say that the new process will share a memory space with the old process (CLONE_VM). In this situation, the kernel can't leave the stack as it was in the calling process (like fork() does), because then our two processes would be stomping on each other's stack. The kernel also can't just put it in a default location (like exec()) does, because that location is already taken in this memory space. The only solution is to allow the calling process to find a place for it, which is what it does.
Here is the code, which mmaps a stack region and instructs the clone system call to use this region as the stack.
#include <sys/mman.h>
#include <stdio.h>
#include <string.h>
#include <sched.h>
int execute_clone(void *arg)
{
printf("\nclone function Executed....Sleeping\n");
fflush(stdout);
return 0;
}
int main()
{
void *ptr;
int rc;
void *start =(void *) 0x0000010000000000;
size_t len = 0x0000000000200000;
ptr = mmap(start, len, PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED|MAP_GROWSDOWN, 0, 0);
if(ptr == (void *)-1)
{
perror("\nmmap failed");
}
rc = clone(&execute_clone, ptr + len, CLONE_VM, NULL);
if(rc <= 0)
{
perror("\nClone() failed");
}
}
mmap is more than just mapping a file into memory. In fact, some malloc implementations will use mmap for large allocations. If you read the fine man page you'll notice the MAP_ANONYMOUS flag, and you'll see that you need not need supply a file descriptor at all.
As for why the kernel can't just "find a bunch of free memory", well if you want someone to do that work for you, either use fork instead, or use pthreads.
Note that the clone system call doesn't take an argument for the stack location. It actually works just like fork. It's just the glibc wrapper which takes that argument.
I think the stack grows downwards until it can not grow, for example when it grows to a memory that has been allocated before, maybe a fault is notified.That can be seen a default is the minimum available stack size, if there is redundant space downwards when the stack is full, it can grow downwards, otherwise, the system may notify a fault.

Resources