Why has a (C-)stack a maximum of 2mb? - linux

This question is about stack overflows, so where better to ask it than here.
If we consider how memory is used for a program (a.out) in unix, it is something like this:
| etext | stack, 2mb | heap ->>>
And I have wondered for a few years now why there is a restriction of 2MB for the stack. Consider that we have 64 bits for a memory address, then why not allocate like this:
| MIN_ADDR MAX_ADDR|
| heap ->>>> <<<- stack | etext |
MAX_ADDR will be somewhere near 2^64 and MIN_ADDR somewhere near 2^0, so there are many bytes in between which the program can use, but are not necessarily accounted for by the kernel (by actually assigning pages for them). The heap and stack will probably never reach each other, and hence the 2MB limit is not needed ( and would instead have a ~1.8446744e+19 bytes limit). If we are scared that they will reach each other, then set the limit to 2^63 or some bizarre and enormous number.
Furthermore, the heap grows from low to high, so our kernel can still resize blocks of memory (allocated with for example malloc) without necessarily needing to shift the content.
Moreover, a stack frame is always static in size in some way. So we never need to resize there, if we do, that would be awkward anyway, since we also need to change the whole pointer structure used by return and created by call.
I read this as an answer on another stackoverflow question:
"My intuition is the following. The stack is not as easy to manage as the heap. The stack need to be stored in continuous memory locations. This means that you cannot randomly allocate the stack as needed, but you need to at least reserve virtual addresses for that purpose. The larger the size of the reserved virtual address space, the fewer threads you can create."
Source: Why is the page size of Linux (x86) 4 KB, how is that calcualted
But we have loads of memory addresses! So this makes no sense. So why 2MB?
The reason I ask is that allocating memory on the stack is quite safe with respect to dangling pointers and memory leaks:
e.g. I prefer
int foo[5];
instead of
int *foo = malloc(5*sizeof(int));
Since it will deallocate by itself. Also, allocation on the stack is faster than allocation executed by malloc. However, If I allocate an image (i.e. a jpeg or png) on the stack, I am in a dangerous zone of overflowing the stack.
Another point on this matter, why not also allow this:
int *huge_list_of_data = malloc(1000*sizeof(char), 10 000 000 000*sizeof(char))
where we allocate a list object, which has initially the size of 1KB, but we ask the kernel to allocate it such that the page it is put on is not used for anything else, and that we want to have 10GB of pages behind it, which can be (partially) swapped in when necessary.
This way we don't need 10GB of memory, we only need 10GB of memory addresses.
So why no:
void *malloc( unsigned long, unsigned long );
?
In essence: WHY NOT USE THE PAGING SYSTEM OF UNIX TO SOLVE OUR MEMORY ALLOCATION PROBLEMS?
Thank you for reading.

Related

Is stack memory contiguous physically in Linux?

As far as I can see, stack memory is contiguous in virtual memory address, but stack memory is also contiguous physically? And does this have something to do with the stack size limit?
Edit:
I used to believe that stack memory doesn't has to be contiguous physically, but why do we think that stack memory is always quicker than heap memory? If it's not physically contiguous, how can stack take more advantage of cache? And there is another thing that always confuse me, cpu executes directives in data segment, which is not near the stack segment in virtual memory, I don't think the operating system will make stack segment and data segment close to each other physically, so this might do harm to the cache effect, what do you think?
Edit again:
Maybe I should give an example to express myself better, if we want to sort a large amount of numbers, using array to store the numbers is better than using a list, because every list node may be constructed by malloc, so it may not take good advantage of cache, that's why I say stack memory is quicker than heap memory.
As far as I can see, stack memory is contiguous in virtual memory
address, but stack memory is also contiguous physically? And does this
have something to do with the stack size limit?
No, stack memory is not necessarily contiguous in the physical address space. It's not related to the stack size limit. It's related to how the OS manages memory. The OS only allocates a physical page when the corresponding virtual page is accessed for the first time (or for the first time since it got paged out to the disk). This is called demand-paging, and it helps conserve memory usage.
why do we think that stack memory is always quicker
than heap memory? If it's not physically contiguous, how can stack
take more advantage of cache?
It has nothing to do with the cache. It's just faster to allocate and deallocate memory from the stack than the heap. That's because allocating and deallocating from the stack takes only a single instruction (incrementing or decrementing the stack pointer). On the other hand, there is a lot more work involved into allocating and/or deallocating memory from the heap. See this article for more information.
Now once memory allocated (from the heap or stack), the time it takes to access that allocated memory region does not depend on whether it's stack or heap memory. It depends on the memory access behavior and whether it's friendly to the cache and memory architecture.
if we want to sort a large amount of numbers, using array to store the
numbers is better than using a list, because every list node may be
constructed by malloc, so it may not take good advantage of cache,
that's why I say stack memory is quicker than heap memory.
Using an array is faster not because arrays are allocated from the stack. Arrays can be allocated from any memory (stack, heap, or anywhere). It's faster because arrays are usually accessed contiguously one element at a time. When the first element is accessed, a whole cache line that contains the element and other elements is fetched from memory to the L1 cache. So accessing the other elements in that cache line can be done very efficiently, but accessing the first element in the cache line is still slow (unless the cache line was prefetched). This is the key part: since cache lines are 64-byte aligned and both virtual and physical pages are 64-byte aligned as well, then it's guaranteed that any cache line fully resides within a single virtual page and a single physical page. This what makes fetching cache lines efficient. Again, all of this has nothing to do with whether the array was allocated from the stack or heap. It holds true either way.
On the other hand, since the elements of a linked list are typically not contiguous (not even in the virtual address space), then a cache line that contains an element may not contain any other elements. So fetching every single element can be more expensive.
Memory is memory. Stack memory is no faster than heap memory and is no slower. It is all the same. The only thing that makes a memory a stack or a heap is how it is allocated by the application. It is entirely possible to allocate a memory on the heap and make that the program stack.
The speed difference is in the allocation. Stack memory is allocated by subtracting from the stack pointer: one instruction.
The process of allocating heap depends upon the heap manager but it is much more complex and may requiring mapping pages to the address space.
No, there is no promise of contiguity of physical addresses. But it doesn't matter, because user-space programs don't use physical addresses, so have no idea that this is the case.
It is a complex topic.
Heap and stack have (usually) the same memory and memory type (MTRR, cache setting per page, etc.). [mmap, files, drivers could have different strategies, or when user explicit change it].
Stack could be faster, because it is often used. When you call a function, parameters and local variables are put into stack, so the cache is fresh. Additionally, because functions call and return often, probably there is some more stack in the other cache level, and seldom the top of stack is paged (because it was used recently).
So cache could be faster, but just if you have few variables. If you allow large arrays on stack e.g. with alloca, the advantage disappear.
In general, this is a very complex topic, and it is better not to optimize too much, because it could cause complex code, so more difficult to refactor and high level optimization of code. (e.g. on multi-dimentional arrays, the order of indices (and so memory) and loops could improve sensible the speed, but also quickly the code will be impossible to maintain).

Increase stack size

I'm doing computations with huge arrays and for some of this computations I need an increased stack size! Is there any downside of setting the stack size to unlimited (ulimit -s unlimited) in my ~/.bashrc?
The program is written in fortran(F77 & F90) and parallelized with MPI. Some of my arrays have more than 2E7 entries and when I use a small number of cores with MPI it crashes with segmentation fault.
The array size stays the same through the whole computation therefore I setted them to fixes value:
real :: p(200,200,400)
integer :: ib,ie,jb,je,kb,ke
...
ib=1;ie=199
jb=2;je=198
kb=2;ke=398
call SOLVE_POI_EQ(rank,p(ib:ie,jb:je,kb:ke),R)
Setting the stacksize to unlimited likely won't help you. You are allocating a chunk of 64MB on the stack, and likely don't fill it from the top, but from the bottom.
This is important, because the OS grows the stack as you go. Whenever it detects a page-fault right below the stack segment, it will assume that you need more space, and silently insert a new page. The size of this trigger-region within your address-space is limited, though, and I doubt that its larger than 64 MB. Since you index variables are likely placed below your array on the stack, accessing them already does the 64 MB jump that kills your process.
Just make your array allocatable, add the corresponding allocate() statement, and you should be fine.
Stack size is never really unlimited, so you would still have some failures. And your code still won't be portable to Linux systems with smaller (or normal-sized) stacks.
BTW, you should explain which kind of programs are you running, show some source code.
If coding in C++, using standard containers should help a lot (regarding actual stack consumption). For example, a local (stack allocated) std::vector<int> v(10000); (instead of int v[10000];) has its data allocated on the heap (and deallocated by the destructor when you exit from the block defining it)
It would be much better to improve your programs to avoid excessive stack consumption. The need of a lot of stack space is really a bug that you should try to correct. A typical rule of thumb is to have call frames smaller than a few kilobytes (so allocate any larger data on the heap).
You might consider also using the Boehm conservative garbage collector: you would use GC_MALLOC instead of malloc (and you would heap allocate large data structure using GC_MALLOC) but you won't have to bother to free your (GC-heap allcoated) data.

Memory: Stack and Swap

When there isn't enough RAM, dynamically allocated variables on the heap can take advantage of swap space on the disk (albeit causing performance degradations). My question is if the stack in memory can take advantage of the swap space as well.
For example, the following program places a large array on the stack. (Of course, usually we would dynamically allocate large variables on the heap.) If this program crashes when run, can I make it run successfully by adding swap space?
int main()
{
int myArray[1000000];
return 0;
}
Actually it's what swap does, swaps program data and stack space:
http://www.linuxjournal.com/article/10678
These are placed in anonymous pages, so named because they have no
named filesystem source. Once modified, an anonymous page must remain
in RAM for the duration of the program unless there is secondary
storage to write it to. The secondary storage used for these modified
anonymous pages is what we call swap space.
The classic recommendation on systems that do strict VM accounting
vary, but most of them hover around a “twice the amount of RAM”
figure. That number assumes your memory mostly will be filled with a
bunch of small interactive programs (where their stack space is
possibly their largest memory demand).
Say you're running a Web server with 500 threads, each with 8MB of
stack space. That stack space alone is going to require that you have
4GB of swap space configured for the memory accountant to be happy.

Total stack sizes of threads in one process

I use pthreads_attr_getthreadsizes() to get default stack size of one thread, 8MB on my machine.
But when I create 8 threads and allocate a very large stack size to them, say hundreds of MB, the program crash.
So, I guess, shall
("Number of threads" * "stack size per thread") < a constant value (e.g. virtual memory size)
?
The short answer is "Yes".
The longer answer is that all of your threads share one virtual address space, and userspace-usable part of this space must be therefore be large enough to contain all thread stacks (as well as the code, static data, heap, libraries and any miscellaneous mappings).
Multi-hundred-megabyte stacks are a good indication that You're Doing It Wrong, as they say in the classics.

A way to determine a process's "real" memory usage, i.e. private dirty RSS?

Tools like 'ps' and 'top' report various kinds of memory usages, such as the VM size and the Resident Set Size. However, none of those are the "real" memory usage:
Program code is shared between multiple instances of the same program.
Shared library program code is shared between all processes that use that library.
Some apps fork off processes and share memory with them (e.g. via shared memory segments).
The virtual memory system makes the VM size report pretty much useless.
RSS is 0 when a process is swapped out, making it not very useful.
Etc etc.
I've found that the private dirty RSS, as reported by Linux, is the closest thing to the "real" memory usage. This can be obtained by summing all Private_Dirty values in /proc/somepid/smaps.
However, do other operating systems provide similar functionality? If not, what are the alternatives? In particular, I'm interested in FreeBSD and OS X.
On OSX the Activity Monitor gives you actually a very good guess.
Private memory is for sure memory that is only used by your application. E.g. stack memory and all memory dynamically reserved using malloc() and comparable functions/methods (alloc method for Objective-C) is private memory. If you fork, private memory will be shared with you child, but marked copy-on-write. That means as long as a page is not modified by either process (parent or child) it is shared between them. As soon as either process modifies any page, this page is copied before it is modified. Even while this memory is shared with fork children (and it can only be shared with fork children), it is still shown as "private" memory, because in the worst case, every page of it will get modified (sooner or later) and then it is again private to each process again.
Shared memory is either memory that is currently shared (the same pages are visible in the virtual process space of different processes) or that is likely to become shared in the future (e.g. read-only memory, since there is no reason for not sharing read-only memory). At least that's how I read the source code of some command line tools from Apple. So if you share memory between processes using mmap (or a comparable call that maps the same memory into multiple processes), this would be shared memory. However the executable code itself is also shared memory, since if another instance of your application is started there is no reason why it may not share the code already loaded in memory (executable code pages are read-only by default, unless you are running your app in a debugger). Thus shared memory is really memory used by your application, just like private one, but it might additionally be shared with another process (or it might not, but why would it not count towards your application if it was shared?)
Real memory is the amount of RAM currently "assigned" to your process, no matter if private or shared. This can be exactly the sum of private and shared, but usually it is not. Your process might have more memory assigned to it than it currently needs (this speeds up requests for more memory in the future), but that is no loss to the system. If another process needs memory and no free memory is available, before the system starts swapping, it will take that extra memory away from your process and assign it another process (which is a fast and painless operation); therefor your next malloc call might be somewhat slower. Real memory can also be smaller than private and physical memory; this is because if your process requests memory from the system, it will only receive "virtual memory". This virtual memory is not linked to any real memory pages as long as you don't use it (so malloc 10 MB of memory, use only one byte of it, your process will get only a single page, 4096 byte, of memory assigned - the rest is only assigned if you actually ever need it). Further memory that is swapped may not count towards real memory either (not sure about this), but it will count towards shared and private memory.
Virtual memory is the sum of all address blocks that are consider valid in your apps process space. These addresses might be linked to physical memory (that is again private or shared), or they might not, but in that case they will be linked to physical memory as soon as you use the address. Accessing memory addresses outside of the known addresses will cause a SIGBUS and your app will crash. When memory is swapped, the virtual address space for this memory remains valid and accessing those addresses causes memory to be swapped back in.
Conclusion:
If your app does not explicitly or implicitly use shared memory, private memory is the amount of memory your app needs because of the stack size (or sizes if multithreaded) and because of the malloc() calls you made for dynamic memory. You don't have to care a lot for shared or real memory in that case.
If your app uses shared memory, and this includes a graphical UI, where memory is shared between your application and the WindowServer for example, then you might have a look at shared memory as well. A very high shared memory number may mean you have too many graphical resources loaded in memory at the moment.
Real memory is of little interest for app development. If it is bigger than the sum of shared and private, then this means nothing other than that the system is lazy at taken memory away from your process. If it is smaller, then your process has requested more memory than it actually needed, which is not bad either, since as long as you don't use all of the requested memory, you are not "stealing" memory from the system. If it is much smaller than the sum of shared and private, you may only consider to request less memory where possible, as you are a bit over-requesting memory (again, this is not bad, but it tells me that your code is not optimized for minimal memory usage and if it is cross platform, other platforms may not have such a sophisticated memory handling, so you may prefer to alloc many small blocks instead of a few big ones for example, or free memory a lot sooner, and so on).
If you are still not happy with all that information, you can get even more information. Open a terminal and run:
sudo vmmap <pid>
where is the process ID of your process. This will show you statistics for EVERY block of memory in your process space with start and end address. It will also tell you where this memory came from (A mapped file? Stack memory? Malloc'ed memory? A __DATA or __TEXT section of your executable?), how big it is in KB, the access rights and whether it is private, shared or copy-on-write. If it is mapped from a file, it will even give you the path to the file.
If you want only "actual" RAM usage, use
sudo vmmap -resident <pid>
Now it will show for every memory block how big the memory block is virtually and how much of it is really currently present in physical memory.
At the end of each dump is also an overview table with the sums of different memory types. This table looks like this for Firefox right now on my system:
REGION TYPE [ VIRTUAL/RESIDENT]
=========== [ =======/========]
ATS (font support) [ 33.8M/ 2496K]
CG backing stores [ 5588K/ 5460K]
CG image [ 20K/ 20K]
CG raster data [ 576K/ 576K]
CG shared images [ 2572K/ 2404K]
Carbon [ 1516K/ 1516K]
CoreGraphics [ 8K/ 8K]
IOKit [ 256.0M/ 0K]
MALLOC [ 256.9M/ 247.2M]
Memory tag=240 [ 4K/ 4K]
Memory tag=242 [ 12K/ 12K]
Memory tag=243 [ 8K/ 8K]
Memory tag=249 [ 156K/ 76K]
STACK GUARD [ 101.2M/ 9908K]
Stack [ 14.0M/ 248K]
VM_ALLOCATE [ 25.9M/ 25.6M]
__DATA [ 6752K/ 3808K]
__DATA/__OBJC [ 28K/ 28K]
__IMAGE [ 1240K/ 112K]
__IMPORT [ 104K/ 104K]
__LINKEDIT [ 30.7M/ 3184K]
__OBJC [ 1388K/ 1336K]
__OBJC/__DATA [ 72K/ 72K]
__PAGEZERO [ 4K/ 0K]
__TEXT [ 108.6M/ 63.5M]
__UNICODE [ 536K/ 512K]
mapped file [ 118.8M/ 50.8M]
shared memory [ 300K/ 276K]
shared pmap [ 6396K/ 3120K]
What does this tell us? E.g. the Firefox binary and all library it loads have 108 MB data together in their __TEXT sections, but currently only 63 MB of those are currently resident in memory. The font support (ATS) needs 33 MB, but only about 2.5 MB are really in memory. It uses a bit over 5 MB CG backing stores, CG = Core Graphics, those are most likely window contents, buttons, images and other data that is cached for fast drawing. It has requested 256 MB via malloc calls and currently 247 MB are really in mapped to memory pages. It has 14 MB space reserved for stacks, but only 248 KB stack space is really in use right now.
vmmap also has a good summary above the table
ReadOnly portion of Libraries: Total=139.3M resident=66.6M(48%) swapped_out_or_unallocated=72.7M(52%)
Writable regions: Total=595.4M written=201.8M(34%) resident=283.1M(48%) swapped_out=0K(0%) unallocated=312.3M(52%)
And this shows an interesting aspect of the OS X: For read only memory that comes from libraries, it plays no role if it is swapped out or simply unallocated; there is only resident and not resident. For writable memory this makes a difference (in my case 52% of all requested memory has never been used and is such unallocated, 0% of memory has been swapped out to disk).
The reason for that is simple: Read-only memory from mapped files is not swapped. If the memory is needed by the system, the current pages are simply dropped from the process, as the memory is already "swapped". It consisted only of content mapped directly from files and this content can be remapped whenever needed, as the files are still there. That way this memory won't waste space in the swap file either. Only writable memory must first be swapped to file before it is dropped, as its content wasn't stored on disk before.
On Linux, you may want the PSS (proportional set size) numbers in /proc/self/smaps. A mapping's PSS is its RSS divided by the number of processes which are using that mapping.
Top knows how to do this. It shows VIRT, RES and SHR by default on Debian Linux. VIRT = SWAP + RES. RES = CODE + DATA. SHR is the memory that may be shared with another process (shared library or other memory.)
Also, 'dirty' memory is merely RES memory that has been used, and/or has not been swapped.
It can be hard to tell, but the best way to understand is to look at a system that isn't swapping. Then, RES - SHR is the process exclusive memory. However, that's not a good way of looking at it, because you don't know that the memory in SHR is being used by another process. It may represent unwritten shared object pages that are only used by the process.
You really can't.
I mean, shared memory between processes... are you going to count it, or not. If you don't count it, you are wrong; the sum of all processes' memory usage is not going to be the total memory usage. If you count it, you are going to count it twice- the sum's not going to be correct.
Me, I'm happy with RSS. And knowing you can't really rely on it completely...
You can get private dirty and private clean RSS from /proc/pid/smaps
Take a look at smem. It will give you PSS information
http://www.selenic.com/smem/
Reworked this to be much cleaner, to demonstrate some proper best practices in bash, and in particular to use awk instead of bc.
find /proc/ -maxdepth 1 -name '[0-9]*' -print0 | while read -r -d $'\0' pidpath; do
[ -f "${pidpath}/smaps" ] || continue
awk '!/^Private_Dirty:/ {next;}
$3=="kB" {pd += $2 * (1024^1); next}
$3=="mB" {pd += $2 * (1024^2); next}
$3=="gB" {pd += $2 * (1024^3); next}
$3=="tB" {pd += $2 * (1024^4); next}
$3=="pB" {pd += $2 * (1024^5); next}
{print "ERROR!! "$0 >"/dev/stderr"; exit(1)}
END {printf("%10d: %d\n", '"${pidpath##*/}"', pd)}' "${pidpath}/smaps" || break
done
On a handy little container on my machine, with | sort -n -k 2 to sort the output, this looks like:
56: 106496
1: 147456
55: 155648
Use the mincore(2) system call. Quoting the man page:
DESCRIPTION
The mincore() system call determines whether each of the pages in the
region beginning at addr and continuing for len bytes is resident. The
status is returned in the vec array, one character per page. Each
character is either 0 if the page is not resident, or a combination of
the following flags (defined in <sys/mman.h>):
For a question that mentioned Freebsd, surprised no one wrote this yet :
If you want a linux style /proc/PROCESSID/status output, please do the following :
mount -t linprocfs none /proc
cat /proc/PROCESSID/status
Atleast in FreeBSD 7.0, the mounting was not done by default ( 7.0 is a much older release,but for something this basic,the answer was hidden in a mailing list!)
Check it out, this is the source code of gnome-system-monitor, it thinks the memory "really used" by one process is sum(info->mem) of X Server Memory(info->memxserver) and Writable Memory(info->memwritable), the "Writable Memory" is the memory blocks which are marked as "Private_Dirty" in /proc/PID/smaps file.
Other than linux system, could be different way according to gnome-system-monitor code.
static void
get_process_memory_writable (ProcInfo *info)
{
glibtop_proc_map buf;
glibtop_map_entry *maps;
maps = glibtop_get_proc_map(&buf, info->pid);
gulong memwritable = 0;
const unsigned number = buf.number;
for (unsigned i = 0; i < number; ++i) {
#ifdef __linux__
memwritable += maps[i].private_dirty;
#else
if (maps[i].perm & GLIBTOP_MAP_PERM_WRITE)
memwritable += maps[i].size;
#endif
}
info->memwritable = memwritable;
g_free(maps);
}
static void
get_process_memory_info (ProcInfo *info)
{
glibtop_proc_mem procmem;
WnckResourceUsage xresources;
wnck_pid_read_resource_usage (gdk_screen_get_display (gdk_screen_get_default ()),
info->pid,
&xresources);
glibtop_get_proc_mem(&procmem, info->pid);
info->vmsize = procmem.vsize;
info->memres = procmem.resident;
info->memshared = procmem.share;
info->memxserver = xresources.total_bytes_estimate;
get_process_memory_writable(info);
// fake the smart memory column if writable is not available
info->mem = info->memxserver + (info->memwritable ? info->memwritable : info->memres);
}

Resources