Increase stack size - linux

I'm doing computations with huge arrays and for some of this computations I need an increased stack size! Is there any downside of setting the stack size to unlimited (ulimit -s unlimited) in my ~/.bashrc?
The program is written in fortran(F77 & F90) and parallelized with MPI. Some of my arrays have more than 2E7 entries and when I use a small number of cores with MPI it crashes with segmentation fault.
The array size stays the same through the whole computation therefore I setted them to fixes value:
real :: p(200,200,400)
integer :: ib,ie,jb,je,kb,ke
...
ib=1;ie=199
jb=2;je=198
kb=2;ke=398
call SOLVE_POI_EQ(rank,p(ib:ie,jb:je,kb:ke),R)

Setting the stacksize to unlimited likely won't help you. You are allocating a chunk of 64MB on the stack, and likely don't fill it from the top, but from the bottom.
This is important, because the OS grows the stack as you go. Whenever it detects a page-fault right below the stack segment, it will assume that you need more space, and silently insert a new page. The size of this trigger-region within your address-space is limited, though, and I doubt that its larger than 64 MB. Since you index variables are likely placed below your array on the stack, accessing them already does the 64 MB jump that kills your process.
Just make your array allocatable, add the corresponding allocate() statement, and you should be fine.

Stack size is never really unlimited, so you would still have some failures. And your code still won't be portable to Linux systems with smaller (or normal-sized) stacks.
BTW, you should explain which kind of programs are you running, show some source code.
If coding in C++, using standard containers should help a lot (regarding actual stack consumption). For example, a local (stack allocated) std::vector<int> v(10000); (instead of int v[10000];) has its data allocated on the heap (and deallocated by the destructor when you exit from the block defining it)
It would be much better to improve your programs to avoid excessive stack consumption. The need of a lot of stack space is really a bug that you should try to correct. A typical rule of thumb is to have call frames smaller than a few kilobytes (so allocate any larger data on the heap).
You might consider also using the Boehm conservative garbage collector: you would use GC_MALLOC instead of malloc (and you would heap allocate large data structure using GC_MALLOC) but you won't have to bother to free your (GC-heap allcoated) data.

Related

Is stack memory contiguous physically in Linux?

As far as I can see, stack memory is contiguous in virtual memory address, but stack memory is also contiguous physically? And does this have something to do with the stack size limit?
Edit:
I used to believe that stack memory doesn't has to be contiguous physically, but why do we think that stack memory is always quicker than heap memory? If it's not physically contiguous, how can stack take more advantage of cache? And there is another thing that always confuse me, cpu executes directives in data segment, which is not near the stack segment in virtual memory, I don't think the operating system will make stack segment and data segment close to each other physically, so this might do harm to the cache effect, what do you think?
Edit again:
Maybe I should give an example to express myself better, if we want to sort a large amount of numbers, using array to store the numbers is better than using a list, because every list node may be constructed by malloc, so it may not take good advantage of cache, that's why I say stack memory is quicker than heap memory.
As far as I can see, stack memory is contiguous in virtual memory
address, but stack memory is also contiguous physically? And does this
have something to do with the stack size limit?
No, stack memory is not necessarily contiguous in the physical address space. It's not related to the stack size limit. It's related to how the OS manages memory. The OS only allocates a physical page when the corresponding virtual page is accessed for the first time (or for the first time since it got paged out to the disk). This is called demand-paging, and it helps conserve memory usage.
why do we think that stack memory is always quicker
than heap memory? If it's not physically contiguous, how can stack
take more advantage of cache?
It has nothing to do with the cache. It's just faster to allocate and deallocate memory from the stack than the heap. That's because allocating and deallocating from the stack takes only a single instruction (incrementing or decrementing the stack pointer). On the other hand, there is a lot more work involved into allocating and/or deallocating memory from the heap. See this article for more information.
Now once memory allocated (from the heap or stack), the time it takes to access that allocated memory region does not depend on whether it's stack or heap memory. It depends on the memory access behavior and whether it's friendly to the cache and memory architecture.
if we want to sort a large amount of numbers, using array to store the
numbers is better than using a list, because every list node may be
constructed by malloc, so it may not take good advantage of cache,
that's why I say stack memory is quicker than heap memory.
Using an array is faster not because arrays are allocated from the stack. Arrays can be allocated from any memory (stack, heap, or anywhere). It's faster because arrays are usually accessed contiguously one element at a time. When the first element is accessed, a whole cache line that contains the element and other elements is fetched from memory to the L1 cache. So accessing the other elements in that cache line can be done very efficiently, but accessing the first element in the cache line is still slow (unless the cache line was prefetched). This is the key part: since cache lines are 64-byte aligned and both virtual and physical pages are 64-byte aligned as well, then it's guaranteed that any cache line fully resides within a single virtual page and a single physical page. This what makes fetching cache lines efficient. Again, all of this has nothing to do with whether the array was allocated from the stack or heap. It holds true either way.
On the other hand, since the elements of a linked list are typically not contiguous (not even in the virtual address space), then a cache line that contains an element may not contain any other elements. So fetching every single element can be more expensive.
Memory is memory. Stack memory is no faster than heap memory and is no slower. It is all the same. The only thing that makes a memory a stack or a heap is how it is allocated by the application. It is entirely possible to allocate a memory on the heap and make that the program stack.
The speed difference is in the allocation. Stack memory is allocated by subtracting from the stack pointer: one instruction.
The process of allocating heap depends upon the heap manager but it is much more complex and may requiring mapping pages to the address space.
No, there is no promise of contiguity of physical addresses. But it doesn't matter, because user-space programs don't use physical addresses, so have no idea that this is the case.
It is a complex topic.
Heap and stack have (usually) the same memory and memory type (MTRR, cache setting per page, etc.). [mmap, files, drivers could have different strategies, or when user explicit change it].
Stack could be faster, because it is often used. When you call a function, parameters and local variables are put into stack, so the cache is fresh. Additionally, because functions call and return often, probably there is some more stack in the other cache level, and seldom the top of stack is paged (because it was used recently).
So cache could be faster, but just if you have few variables. If you allow large arrays on stack e.g. with alloca, the advantage disappear.
In general, this is a very complex topic, and it is better not to optimize too much, because it could cause complex code, so more difficult to refactor and high level optimization of code. (e.g. on multi-dimentional arrays, the order of indices (and so memory) and loops could improve sensible the speed, but also quickly the code will be impossible to maintain).

How does an OS implement or maintain a stack for each thread?

There have been various questions on SO on whether or not threads get their own stack. However I fail to understand how the OS implements or how do OSs generally implement one stack per thread. In OS books the memory layout of a program is shown as thus:
Note that it can be considered as a contiguous block of memory ( virtual memory). I would imagine some part of the virtual memory space is divided among the stacks for the threads. Which brings me to the second part of this question: a popular technical interview question involves trying to implement 3 stacks using a single array. Is this problem directly related to solving the implementation of thread stacks.
I summarize my questions thus:
How does a modern day OS, say Linux divide the memory space for stacks of different threads?
Is the "3 stacks using 1 array" directly related to or an answer for the above question?
PS: Perhaps images to explain how the memory is divided for different thread stacks would be best to explain.
The picture shown above is totally obsolete on both Windows and Linux. It doesn't really matter at what addresses the individual allocations are located. Virtual address space is big on 32 bit and vast on 64 bit. The OS just needs to carve out some chunk of it somewhere and hand it out.
Each stack is an independent virtual memory allocation that can be placed at arbitrary locations. It is important to note that stacks are generally finite in size. The OS reserves a certain maximum size (such a 1MB or 8MB). The stack cannot exceed that size. This is suggested differently in the (obsolete) picture above. The stack indeed grows down, but when the fixed space is exhausted a stack overflow is triggered. This is not a concern in practice. In fact, exceeding a reasonable stack size is considered to be a bug.
Binary images (above: text, initialized data and bss) are also just placed anywhere. They are fixed in size as well.
The heap consists of multiple segments. It can grow arbitrarily by just adding more segments. The heap is managed by user-mode libraries. The kernel doesn't know about it. All the kernel does is provide slabs of virtual memory at locations chosen at will.
1)Thread's stack is just a contiguous block in virtual memory. It's maximal size is fixed. It may look like that:
2)I don't think it is directly related to this problem because thread's stack size limit is known when a thread is created, but nothing is known about each of 3 stack's sizes in a problem about "3 stacks using 1 array".

Split stacks unneccesary on amd64

There seems to be an opinion out there that using a "split stack" runtime model is unnecessary on 64-bit architectures. I say seems to be, because I haven't seen anyone actually say that, only dance around it:
The memory usage of a typical multi-threaded program can decrease
significantly, as each thread does not require a worst-case stack
size. It becomes possible to run millions of threads (either full NPTL
threads or co-routines) in a 32-bit address space.
-- Ian Lance Taylor
...implying that a 64-bit address space can already handle it.
And...
... the constant overhead of split stacks and the narrow use case
(spawning enormous numbers of I/O-bound tasks on 32-bit architectures)
isn't acceptable...
-- bstrie
Two questions: Is this what they are saying? Second, if so, why are they unneccesary on 64-bit architectures?
Yes, that's what they are saying.
Split stacks are (currently) unnecessary on 64bit architectures because the 64bit virtual address space is so large it can contain millions of stack address ranges, each as large as an entire 32bit address space, if needed.
In the Flat memory model in use nowadays, the translation from virtual addresses to phisical memory locations is done with the support of the hardware MMU. On amd64 it turns out it's better (meaning, overall faster) to reserve big chunks of the 64bit virtual address space to each new stack you are creating, while only mapping the first page (4kB) to actual RAM. This way, the stack will be able to grow and shrink as needed, over contiguous virtual addresses (meaning less code in each function prologue, a big optimization) while the OS re-configures the MMU to map each page of virtual addresses to an actual free page of RAM, whenever the stack grows or shrinks above/below some configurable thresholds.
By choosing the thresholds smartly (see for example the theory of dynamic arrays) you can achieve O(1) complexity on the average stack operation, while retaining the benefits of millions of stacks that can grow as much as you need and only consume the memory they use.
PS: the current Go implementation is far behind any of this :-)
The Go core team is currently discussing the possibility of using contiguous stacks in a future Go version.
The split stack approach is useful because stacks can grow more flexibly but it also requires that the runtime allocates a relatively big chunk of memory to distribute these stacks across. There has been a lot of confusion about Go's memory usage, in part because of this.
Making contiguous but growable (relocatable) stacks is an option that would provide the same flexibility and maybe reduce the confusion about Go's memory usage. As well as remedying some ill corner-cases on low-memory machines (see linked thread).
As to advantages/disadvantages on 32-bit vs. 64-bit architectures, I don't think there are any directly associated solely with the use of segmented stacks.
Update Go 1.4 (Q4 2014)
Change to the runtime:
Up to Go 1.4, the runtime (garbage collector, concurrency support, interface management, maps, slices, strings, ...) was mostly written in C, with some assembler support.
In 1.4, much of the code has been translated to Go so that the garbage collector can scan the stacks of programs in the runtime and get accurate information about what variables are active.
This rewrite allows the garbage collector in 1.4 to be fully precise, meaning that it is aware of the location of all active pointers in the program. This means the heap will be smaller as there will be no false positives keeping non-pointers alive. Other related changes also reduce the heap size, which is smaller by 10%-30% overall relative to the previous release.
A consequence is that stacks are no longer segmented, eliminating the "hot split" problem. When a stack limit is reached, a new, larger stack is allocated, all active frames for the goroutine are copied there, and any pointers into the stack are updated.
Initial answer (March 2014)
The article "Contiguous stacks in Go" by Agis Anastasopoulo also addresses this issue
In such cases where the stack boundary happens to fall in a tight loop, the overhead of creating and destroying segments repeatedly becomes significant.
This is called the “hot split” problem inside the Go community.
The “hot split” will be addressed in Go 1.3 by implementing contiguous stacks.
Now when a stack needs to grow, instead of allocating a new segment the following happens:
Create a new, somewhat larger stack
Copy the contents of the old stack to the new stack
Re-adjust every copied pointer to point to the new addresses
Destroy the old stack
The following mention one problem seen mainly in 32-bit arhcitectures:
There is a certain challenge though.
The 1.2 runtime doesn’t know if a pointer-sized word in the stack is an actual pointer or not. There may be floats and most rarely integers that if interpreted as pointers, would actually point to data.
Due to the lack of such knowledge the garbage collector has to conservatively consider all the locations in the stack frames to be roots. This leaves the possibility for memory leaks especially on 32-bit architectures since their address pool is much smaller.
When copying stacks however, such cases have to be avoided and only real pointers should be taken into account when re-adjusting.
Work was done though and information about live stack pointers is now embedded in the binaries and is available to the runtime.
This means not only that the collector in 1.3 can precisely stack data but re-adjusting stack pointers is now possible.

Why has a (C-)stack a maximum of 2mb?

This question is about stack overflows, so where better to ask it than here.
If we consider how memory is used for a program (a.out) in unix, it is something like this:
| etext | stack, 2mb | heap ->>>
And I have wondered for a few years now why there is a restriction of 2MB for the stack. Consider that we have 64 bits for a memory address, then why not allocate like this:
| MIN_ADDR MAX_ADDR|
| heap ->>>> <<<- stack | etext |
MAX_ADDR will be somewhere near 2^64 and MIN_ADDR somewhere near 2^0, so there are many bytes in between which the program can use, but are not necessarily accounted for by the kernel (by actually assigning pages for them). The heap and stack will probably never reach each other, and hence the 2MB limit is not needed ( and would instead have a ~1.8446744e+19 bytes limit). If we are scared that they will reach each other, then set the limit to 2^63 or some bizarre and enormous number.
Furthermore, the heap grows from low to high, so our kernel can still resize blocks of memory (allocated with for example malloc) without necessarily needing to shift the content.
Moreover, a stack frame is always static in size in some way. So we never need to resize there, if we do, that would be awkward anyway, since we also need to change the whole pointer structure used by return and created by call.
I read this as an answer on another stackoverflow question:
"My intuition is the following. The stack is not as easy to manage as the heap. The stack need to be stored in continuous memory locations. This means that you cannot randomly allocate the stack as needed, but you need to at least reserve virtual addresses for that purpose. The larger the size of the reserved virtual address space, the fewer threads you can create."
Source: Why is the page size of Linux (x86) 4 KB, how is that calcualted
But we have loads of memory addresses! So this makes no sense. So why 2MB?
The reason I ask is that allocating memory on the stack is quite safe with respect to dangling pointers and memory leaks:
e.g. I prefer
int foo[5];
instead of
int *foo = malloc(5*sizeof(int));
Since it will deallocate by itself. Also, allocation on the stack is faster than allocation executed by malloc. However, If I allocate an image (i.e. a jpeg or png) on the stack, I am in a dangerous zone of overflowing the stack.
Another point on this matter, why not also allow this:
int *huge_list_of_data = malloc(1000*sizeof(char), 10 000 000 000*sizeof(char))
where we allocate a list object, which has initially the size of 1KB, but we ask the kernel to allocate it such that the page it is put on is not used for anything else, and that we want to have 10GB of pages behind it, which can be (partially) swapped in when necessary.
This way we don't need 10GB of memory, we only need 10GB of memory addresses.
So why no:
void *malloc( unsigned long, unsigned long );
?
In essence: WHY NOT USE THE PAGING SYSTEM OF UNIX TO SOLVE OUR MEMORY ALLOCATION PROBLEMS?
Thank you for reading.

Maximum Thread Stack Size .NET?

What is the maximum stack size allowed for a thread in C#.NET 2.0? Also, does this value depend on the version of the CLR and/or the bitness (32 or 64) of the underlying OS?
I have looked at the following resources msdn1 and msdn2
public Thread(
ThreadStart start,
int maxStackSize
)
The only information I can see is that the default size is 1 megabytes and in the above method, if maxStackSize is '0' the default maximum stack size specified in the header for the executable will be used, what's the maximum value that we can change the value in the header upto? Also is it advisable to do so? Thanks.
For the record, this fits Raymond Chen's category of "if you need to know then you are doing something wrong".
The default stack size for threads running 64-bit code is 4 megabytes, 1 megabyte for 32-bit code. While the Thread constructor lets you pass a integer value up to int.MaxValue, you'll never get that on a 32-bit machine. The stack must fit in an available hole in the virtual memory address space, that usually tops out at ~600 MB early in the process lifetime. Rapidly getting smaller as you allocate memory and fragment the address space.
Allocating more than the default is quite unnecessary. You might contemplate doing this when you have a heavily recursive method that blows the stack. Don't, fix the algorithm or you'll blow it anyway when the job gets bigger.
The smallest stack that .NET lets you choose is 250 KB. It silently rounds it up if you pass a value that's smaller. Necessary because both the jitter and the garbage collector need stack space to get their job done. Again, doing so should be quite unnecessary. If you contemplate doing so because you have a lot of threads and consume all virtual memory with their stacks then you have too many threads. A StackOverflowException is one of the nastiest runtime exceptions you can get. Process death is immediate and untrappable.
The stack size for the main thread is determined by an option in the EXE header. The compiler doesn't have an option to change it, you have to use editbin.exe /stack to patch the .exe header.
I am unaware of what the maximum is, but MSDN speaks to whether you should do it or not:
Avoid using this constructor overload. The default stack size used by the Thread(ThreadStart) constructor overload is the recommended stack size for threads. If a thread has memory problems, the most likely cause is programming error, such as infinite recursion.
I have never had a StackOverflow occur in C# which was not due to infinite recursion. If there truly was a case where recursion went to that depth, I would consider replacing it with iteration.

Resources