Maximum Thread Stack Size .NET?

Maximum Thread Stack Size .NET? - multithreading

What is the maximum stack size allowed for a thread in C#.NET 2.0? Also, does this value depend on the version of the CLR and/or the bitness (32 or 64) of the underlying OS?
I have looked at the following resources msdn1 and msdn2
public Thread(
ThreadStart start,
int maxStackSize
)
The only information I can see is that the default size is 1 megabytes and in the above method, if maxStackSize is '0' the default maximum stack size specified in the header for the executable will be used, what's the maximum value that we can change the value in the header upto? Also is it advisable to do so? Thanks.

For the record, this fits Raymond Chen's category of "if you need to know then you are doing something wrong".
The default stack size for threads running 64-bit code is 4 megabytes, 1 megabyte for 32-bit code. While the Thread constructor lets you pass a integer value up to int.MaxValue, you'll never get that on a 32-bit machine. The stack must fit in an available hole in the virtual memory address space, that usually tops out at ~600 MB early in the process lifetime. Rapidly getting smaller as you allocate memory and fragment the address space.
Allocating more than the default is quite unnecessary. You might contemplate doing this when you have a heavily recursive method that blows the stack. Don't, fix the algorithm or you'll blow it anyway when the job gets bigger.
The smallest stack that .NET lets you choose is 250 KB. It silently rounds it up if you pass a value that's smaller. Necessary because both the jitter and the garbage collector need stack space to get their job done. Again, doing so should be quite unnecessary. If you contemplate doing so because you have a lot of threads and consume all virtual memory with their stacks then you have too many threads. A StackOverflowException is one of the nastiest runtime exceptions you can get. Process death is immediate and untrappable.
The stack size for the main thread is determined by an option in the EXE header. The compiler doesn't have an option to change it, you have to use editbin.exe /stack to patch the .exe header.

I am unaware of what the maximum is, but MSDN speaks to whether you should do it or not:
Avoid using this constructor overload. The default stack size used by the Thread(ThreadStart) constructor overload is the recommended stack size for threads. If a thread has memory problems, the most likely cause is programming error, such as infinite recursion.
I have never had a StackOverflow occur in C# which was not due to infinite recursion. If there truly was a case where recursion went to that depth, I would consider replacing it with iteration.

Related

Increase stack size

I'm doing computations with huge arrays and for some of this computations I need an increased stack size! Is there any downside of setting the stack size to unlimited (ulimit -s unlimited) in my ~/.bashrc?
The program is written in fortran(F77 & F90) and parallelized with MPI. Some of my arrays have more than 2E7 entries and when I use a small number of cores with MPI it crashes with segmentation fault.
The array size stays the same through the whole computation therefore I setted them to fixes value:
real :: p(200,200,400)
integer :: ib,ie,jb,je,kb,ke
...
ib=1;ie=199
jb=2;je=198
kb=2;ke=398
call SOLVE_POI_EQ(rank,p(ib:ie,jb:je,kb:ke),R)

Setting the stacksize to unlimited likely won't help you. You are allocating a chunk of 64MB on the stack, and likely don't fill it from the top, but from the bottom.
This is important, because the OS grows the stack as you go. Whenever it detects a page-fault right below the stack segment, it will assume that you need more space, and silently insert a new page. The size of this trigger-region within your address-space is limited, though, and I doubt that its larger than 64 MB. Since you index variables are likely placed below your array on the stack, accessing them already does the 64 MB jump that kills your process.
Just make your array allocatable, add the corresponding allocate() statement, and you should be fine.

Stack size is never really unlimited, so you would still have some failures. And your code still won't be portable to Linux systems with smaller (or normal-sized) stacks.
BTW, you should explain which kind of programs are you running, show some source code.
If coding in C++, using standard containers should help a lot (regarding actual stack consumption). For example, a local (stack allocated) std::vector<int> v(10000); (instead of int v[10000];) has its data allocated on the heap (and deallocated by the destructor when you exit from the block defining it)
It would be much better to improve your programs to avoid excessive stack consumption. The need of a lot of stack space is really a bug that you should try to correct. A typical rule of thumb is to have call frames smaller than a few kilobytes (so allocate any larger data on the heap).
You might consider also using the Boehm conservative garbage collector: you would use GC_MALLOC instead of malloc (and you would heap allocate large data structure using GC_MALLOC) but you won't have to bother to free your (GC-heap allcoated) data.

How much stack space is typically reserved for a thread? (POSIX / OSX)

The answer probably differs depending on the OS, but I'm curious how much stack space does a thread normally preallocate. For example, if I use:
push rax
that will put a value on the stack and increment the rsp. But what if I never use a push op? I imagine some space still gets allocated, but how much? Also, is this a fixed amount or is does it grow dynamically with the amount of stuff pushed?

POSIX does not define any standards regarding stack size, it is entirely implementation dependent. Since you tagged this OSX, the default allocations there are :
Main thread (8MB)
Secondary Thread (512kB)
Naturally, these can be configured to suit your needs. The allocation is dynamic :
The minimum allowed stack size for secondary threads is 16 KB and the
stack size must be a multiple of 4 KB. The space for this memory is
set aside in your process space at thread creation time, but the
actual pages associated with that memory are not created until they
are needed.
There is too much detail to include here. I suggest you read :
Thread Management (Mac Developer Library)

Split stacks unneccesary on amd64

There seems to be an opinion out there that using a "split stack" runtime model is unnecessary on 64-bit architectures. I say seems to be, because I haven't seen anyone actually say that, only dance around it:
The memory usage of a typical multi-threaded program can decrease
significantly, as each thread does not require a worst-case stack
size. It becomes possible to run millions of threads (either full NPTL
threads or co-routines) in a 32-bit address space.
-- Ian Lance Taylor
...implying that a 64-bit address space can already handle it.
And...
... the constant overhead of split stacks and the narrow use case
(spawning enormous numbers of I/O-bound tasks on 32-bit architectures)
isn't acceptable...
-- bstrie
Two questions: Is this what they are saying? Second, if so, why are they unneccesary on 64-bit architectures?

Yes, that's what they are saying.
Split stacks are (currently) unnecessary on 64bit architectures because the 64bit virtual address space is so large it can contain millions of stack address ranges, each as large as an entire 32bit address space, if needed.
In the Flat memory model in use nowadays, the translation from virtual addresses to phisical memory locations is done with the support of the hardware MMU. On amd64 it turns out it's better (meaning, overall faster) to reserve big chunks of the 64bit virtual address space to each new stack you are creating, while only mapping the first page (4kB) to actual RAM. This way, the stack will be able to grow and shrink as needed, over contiguous virtual addresses (meaning less code in each function prologue, a big optimization) while the OS re-configures the MMU to map each page of virtual addresses to an actual free page of RAM, whenever the stack grows or shrinks above/below some configurable thresholds.
By choosing the thresholds smartly (see for example the theory of dynamic arrays) you can achieve O(1) complexity on the average stack operation, while retaining the benefits of millions of stacks that can grow as much as you need and only consume the memory they use.
PS: the current Go implementation is far behind any of this :-)

The Go core team is currently discussing the possibility of using contiguous stacks in a future Go version.
The split stack approach is useful because stacks can grow more flexibly but it also requires that the runtime allocates a relatively big chunk of memory to distribute these stacks across. There has been a lot of confusion about Go's memory usage, in part because of this.
Making contiguous but growable (relocatable) stacks is an option that would provide the same flexibility and maybe reduce the confusion about Go's memory usage. As well as remedying some ill corner-cases on low-memory machines (see linked thread).
As to advantages/disadvantages on 32-bit vs. 64-bit architectures, I don't think there are any directly associated solely with the use of segmented stacks.

Update Go 1.4 (Q4 2014)
Change to the runtime:
Up to Go 1.4, the runtime (garbage collector, concurrency support, interface management, maps, slices, strings, ...) was mostly written in C, with some assembler support.
In 1.4, much of the code has been translated to Go so that the garbage collector can scan the stacks of programs in the runtime and get accurate information about what variables are active.
This rewrite allows the garbage collector in 1.4 to be fully precise, meaning that it is aware of the location of all active pointers in the program. This means the heap will be smaller as there will be no false positives keeping non-pointers alive. Other related changes also reduce the heap size, which is smaller by 10%-30% overall relative to the previous release.
A consequence is that stacks are no longer segmented, eliminating the "hot split" problem. When a stack limit is reached, a new, larger stack is allocated, all active frames for the goroutine are copied there, and any pointers into the stack are updated.
Initial answer (March 2014)
The article "Contiguous stacks in Go" by Agis Anastasopoulo also addresses this issue
In such cases where the stack boundary happens to fall in a tight loop, the overhead of creating and destroying segments repeatedly becomes significant.
This is called the “hot split” problem inside the Go community.
The “hot split” will be addressed in Go 1.3 by implementing contiguous stacks.
Now when a stack needs to grow, instead of allocating a new segment the following happens:
Create a new, somewhat larger stack
Copy the contents of the old stack to the new stack
Re-adjust every copied pointer to point to the new addresses
Destroy the old stack
The following mention one problem seen mainly in 32-bit arhcitectures:
There is a certain challenge though.
The 1.2 runtime doesn’t know if a pointer-sized word in the stack is an actual pointer or not. There may be floats and most rarely integers that if interpreted as pointers, would actually point to data.
Due to the lack of such knowledge the garbage collector has to conservatively consider all the locations in the stack frames to be roots. This leaves the possibility for memory leaks especially on 32-bit architectures since their address pool is much smaller.
When copying stacks however, such cases have to be avoided and only real pointers should be taken into account when re-adjusting.
Work was done though and information about live stack pointers is now embedded in the binaries and is available to the runtime.
This means not only that the collector in 1.3 can precisely stack data but re-adjusting stack pointers is now possible.

Total stack sizes of threads in one process

I use pthreads_attr_getthreadsizes() to get default stack size of one thread, 8MB on my machine.
But when I create 8 threads and allocate a very large stack size to them, say hundreds of MB, the program crash.
So, I guess, shall
("Number of threads" * "stack size per thread") < a constant value (e.g. virtual memory size)
?

The short answer is "Yes".
The longer answer is that all of your threads share one virtual address space, and userspace-usable part of this space must be therefore be large enough to contain all thread stacks (as well as the code, static data, heap, libraries and any miscellaneous mappings).
Multi-hundred-megabyte stacks are a good indication that You're Doing It Wrong, as they say in the classics.

NPTL Default Stack Size Problem

I am developing a multithread modular application using C programming language and NPTL 2.6. For each plugin, a POSIX thread is created. The problem is each thread has its own stack area, since default stack size depends on user's choice, this may results in huge memory consumption in some cases.
To prevent unnecessary memory usage I used something similar to this to change stack size before creating each thread:
pthread_attr_t attr;
pthread_attr_init (&attr);
pthread_attr_getstacksize(&attr, &st1);
if(pthread_attr_setstacksize (&attr, MODULE_THREAD_SIZE) != 0) perror("Stack ERR");
pthread_attr_getstacksize(&attr, &st2);
printf("OLD:%d, NEW:%d - MIN: %d\n", st1, st2, PTHREAD_STACK_MIN);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
/* "this" is static data structure that stores plugin related data */
pthread_create(&this->runner, &attr, (void *)(void *)this->run, NULL);
EDIT I: pthread_create() section added.
This did not work work as I expected, the stack size reported by pthread_attr_getstacksize() is changed but total memory usage of the application (from ps/top/pmap output) did not changed:
OLD:10485760, NEW:65536 - MIN: 16384
When I use ulimit -s MY_STACK_SIZE_LIMIT before starting application I achieve the expected result.
My questions are:
1-) Is there any portable(between UNIX variants) way to change (default)thread stack size after starting application(before creating thread of course)?
2-) Is it possible to use same stack area for every thread?
3-) Is it possible completely disable stack for threads without much pain?

Answers for #2 and #3 are no and no. Each thread needs a stack (where else do your local variables and return addresses go?) and they need to be unique per-thread (otherwise threads would overwrite each other's local variables and return addresses, making everybody crash).
As for #1... the set stack size call is precisely the answer for this. I suggest you figure out an acceptable size to create your threads with, and set it.
As for why things don't look right to you in top.... top is a notorious liar about memory usage. :-) Is stuff actually failing to be allocated or getting OOM-killed? Are thread creations failing? Is performance suffering and paging to disk increasing? If the answer to these questions is no, then I don't think there's much to worry about.
Update based on some comments below and above:
First, 16KB is still pretty big for something that you say doesn't need much stack space. If you really want to go small, I would be tempted to say 4096 or 8192 on x86 Linux. Second, yes you can set your CPU's stack pointer to something else.. But when you malloc() or mmap(), that's going to take up space. I don't know how you think it's going to help to set the stack pointer to something else. That said, if you really feel strongly that the thread that calls main() has too big of a stack (I would say that is slightly crazy) and that pthread_attr_setstacksize() doesn't let you get small enough (?), then maybe you can look into non-portable stuff like creating threads by calling the clone() syscall and specifying stacks based on the main thread's stack pointer, or a buffer from elsewhere, or whatever. But you're still going to need a stack for each thread and I have a feeling top is still going to disappoint you. Maybe your expectations are a little high.

I have seen this problem as well. It is unclear how the stacks are accounted for but the "extra" space is counted against your total VM and if you run up against your process boundary you are in trouble (even though you aren't using the space). It seems dependent on what version of Linux you are running (even within the 2.6 family), and whether you are 32 bit or 64 bit.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string