Why does a JVM report more committed memory than the linux process resident set size? - linux

When running a Java app (in YARN) with native memory tracking enabled (-XX:NativeMemoryTracking=detail see https://docs.oracle.com/javase/8/docs/technotes/guides/vm/nmt-8.html and https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html), I can see how much memory the JVM is using in different categories.
My app on jdk 1.8.0_45 shows:
Native Memory Tracking:
Total: reserved=4023326KB, committed=2762382KB
- Java Heap (reserved=1331200KB, committed=1331200KB)
(mmap: reserved=1331200KB, committed=1331200KB)
- Class (reserved=1108143KB, committed=64559KB)
(classes #8621)
(malloc=6319KB #17371)
(mmap: reserved=1101824KB, committed=58240KB)
- Thread (reserved=1190668KB, committed=1190668KB)
(thread #1154)
(stack: reserved=1185284KB, committed=1185284KB)
(malloc=3809KB #5771)
(arena=1575KB #2306)
- Code (reserved=255744KB, committed=38384KB)
(malloc=6144KB #8858)
(mmap: reserved=249600KB, committed=32240KB)
- GC (reserved=54995KB, committed=54995KB)
(malloc=5775KB #217)
(mmap: reserved=49220KB, committed=49220KB)
- Compiler (reserved=267KB, committed=267KB)
(malloc=137KB #333)
(arena=131KB #3)
- Internal (reserved=65106KB, committed=65106KB)
(malloc=65074KB #29652)
(mmap: reserved=32KB, committed=32KB)
- Symbol (reserved=13622KB, committed=13622KB)
(malloc=12016KB #128199)
(arena=1606KB #1)
- Native Memory Tracking (reserved=3361KB, committed=3361KB)
(malloc=287KB #3994)
(tracking overhead=3075KB)
- Arena Chunk (reserved=220KB, committed=220KB)
This shows 2.7GB of committed memory, including 1.3GB of allocated heap and almost 1.2GB of allocated thread stacks (using many threads).
However, when running ps ax -o pid,rss | grep <mypid> or top it shows only 1.6GB of RES/rss resident memory. Checking swap says none in use:
free -m
total used free shared buffers cached
Mem: 129180 99348 29831 0 2689 73024
-/+ buffers/cache: 23633 105546
Swap: 15624 0 15624
Why does the JVM indicate 2.7GB memory is committed when only 1.6GB is resident? Where did the rest go?

I'm beginning to suspect that stack memory (unlike the JVM heap) seems to be precommitted without becoming resident and over time becomes resident only up to the high water mark of actual stack usage.
Yes, at least on linux mmap is lazy unless told otherwise. Anonymous pages are only backed by physical memory once they're written to (reads are not sufficient due to the zero-page optimization)
GC heap memory effectively gets touched by the copying collector or by pre-zeroing (-XX:+AlwaysPreTouch), so it'll always be resident. Thread stacks otoh aren't affected by this.
For further confirmation you can use pmap -x <java pid> and cross-reference the RSS of various address ranges with the output from the virtual memory map from NMT.
Reserved memory has been mmaped with PROT_NONE. Which means the virtual address space ranges have entries in the kernel's vma structs and thus will not be used by other mmap/malloc calls. But they will still cause page faults being forwarded to the process as SIGSEGV, i.e. accessing them is an error.
This is important to have contiguous address ranges available for future use, which in turn simplifies pointer arithmetic.
Committed-but-not-backed-by-storage memory has been mapped with - for example - PROT_READ | PROT_WRITE but accessing it still causes a page fault. But that page fault is silently handled by the kernel by backing it with actual memory and returning to execution as if nothing happened.
I.e. it's an implementation detail/optimization that won't be noticed by the process itself.
To give a breakdown of the concepts:
Used Heap: the amount of memory occupied by live objects according to the last GC
Committed: Address ranges that have been mapped with something other than PROT_NONE. They may or may not be backed by physical or swap due to lazy allocation and paging.
Reserved: The total address range that has been pre-mapped via mmap for a particular memory pool.
The reserved − committed difference consists of PROT_NONE mappings, which are guaranteed to not be backed by physical memory
Resident: Pages which are currently in physical ram. This means code, stacks, part of the committed memory pools but also portions of mmaped files which have recently been accessed and allocations outside the control of the JVM.
Virtual: The sum of all virtual address mappings. Covers committed, reserved memory pools but also mapped files or shared memory. This number is rarely informative since the JVM can reserve very large address ranges in advance or mmap large files.


Find exact physical memory usage in Ubuntu/Linux

(I'm new to Linux)
Say I've 1300 MB memory, on a Ubuntu machine. OS and other default programs consumes 300 MB memory and 1000 MB is free for my own applications.
I installed my application and I could configure it to use 700 MB memory, when the application starts.
However I couldn't verify its actual memory usage. Even I disabled swap space.
The "VIRT" value shows a huge value and "RES", "SHR", "%MEM" shows very less value.
It is difficult to find actual physical memory usage, similar to "Resource monitor" in Windows, which will say my application is using 700 MB memory.
Is there any way to find actual physical memory in Ubuntu/Linux ?
TL;DR - Virtual memory is complicated.
The best measure of a Linux processes current usage of physical memory is RES.
The RES value represents the sum of all of the processes pages that are currently resident in physical memory. It includes resident code pages and resident data pages. It also includes shared pages (SHR) that are currently RAM resident, though these pages cannot be exclusively ascribed to >>this<< process.
The VIRT value is actually the sum of all notionally allocated pages for the process, and it includes pages that are currently RAM resident, pages that are currently swapped to disk.
See https://stackoverflow.com/a/56351211/1184752 for another explanation.
Note that RES is giving you (roughly) instantaneous RAM usage. That is what you asked about ...
The "actual" memory usage over time is more complicated because the OS's virtual memory subsystem is typically be swapping pages in and out according to demand. So, for example, some of your application's pages may not have been accesses recently, and the OS may then swap them out (to swap space) to free up RAM for other pages required by your application ... or something else.
The VIRT value while actually representing virtual address space, is a good approximation of total (virtual) memory usage. However, it may be an over-estimate:
Some pages in a processes address space are shared between multiple processes. This includes read-only code segments, pages shared between parent and child processes between vfork and exec, and shared memory segments created using mmap.
Some pages may be set to have illegal access (e.g. for stack red-zones) and may not be backed by either RAM or swap device pages.
Some pages of the address space in certain states may not have been committed to either RAM or disk yet ... depending on how the virtual memory system is implemented. (Consider the case where a process requests a huge memory segment and neither reads from it or writes to it. It is possible that the virtual memory implementation will not allocate RAM pages until the first read or write in the page. And if you use lazy swap reservation, swap pages not be committed either. But beware that you can get into trouble with lazy swap reservation.)
VIRT can also be under-estimate because the OS usually reserves swap space for all pages ... whether they are currently swapped in or swapped out. So if you count the RAM and swap versions of a given page as separate units of storage, VIRT usually underestimates the total storage used.
Finally, if your real goal is to limit your application to using at most
700 MB (of virtual address space) then you can use ulimit -v ... to do this. If the application tries to request memory beyond its limit, the request fails.

Linux stack resident memory not reclaimed after stack unwind

Linux doesn't reclaim memory when it's not used anymore, if allocated on stack.
I dynamically allocate (malloc/mmap) 1GB on heap.
Before the allocation:
$ top
virtual memory 1GB
resident memory ~ 0
memset 1GB
$ top
virtual memory 1GB
resident memory 1GB
deallocate (free/munmap) of 1GB - reclaimed as expected
$ top
virtual memory 1GB
resident memory ~ 0
I dynamically allocate 1GB on stack.
$ top
virtual memory 1GB
resident memory ~ 0
memset 1GB
$ top
virtual memory 1GB
resident memory 1GB
deallocate (stack unwind) of 1GB - resident memory is still 1GB, even after deallocating! Why?
$ top
virtual memory 1GB
resident memory 1GB
Why, when the stack unwind the resident memory (physical pages are still in use)?
The heap segment allocation is done with mmap and the stack segment allocation is done with mmap - so why there is difference in the behavior of reclaim?
Because the OS thinks that once you have use that much stack, you probably will do that again. The OS can't really know [from outside your application] what your application is about to do in the future. It would be rather difficult to figure out when it's OK to free some of the stack, and you get all sorts of interesting race-conditions in the OS where you have to stop the application from running simply to reduce it's stack - and then it suddenly needs it again, so it needs to be allocated.
Using mmap, on the other hand, there is a distinct munmap to tell the OS "I have no interest in this memory". So it gets freed then and there [as part of the munmap call itself - specifically, in zap_pte_range the pages themselves are freed and given back to the OS.
It shouldn't really be a big issue, unless the following conditions are fulfilled:
1. You are running on an embedded system that doesn't have swap.
2. Your application runs for a long period of time after it has returned for using a lot of stack (assuming you actually do need this much memory as stack, you will have to have that memory available WHEN it's needed, so it's obviously only a problem if the application then doesn't need the stack later on and that period is long - whatever your definition of long is).
3. Your system doesn't have enough RAM to fulful other RAM needs in other applications.
The reason I say that is that although the stack is using that much memory, if the application isn't using the ram for a long time, and the system is running low on memory, it will swap it out to disk - to be swapped in at a later stage IF it's needed.
I would also say that using such large amounts of stackspace is generally considered a bad idea. Running out of space on stack [either hitting the limit or "there just isn't enough memory available"] is nearly always fatal.
So whilst I often suggest using stack-space to store temporary variables, I think 1GB of stack is quite excessive. A few megabytes should be acceptable, but hundreds of megabytes or more is probably a sign of "you should probably store things in another way".

Increase of virtual memory without increse of VmSize

I searched for my problem in Google and at this site but i still don't understand the solution.
I have piece of MPI program which RECV some data. Program crashes on big arrays with error of insufficient virtual memory, and so i started to consider /proc/self/status file.
Before MPI_RECV it was:
Name: model.exe
VmPeak: 841640 kB
VmSize: 841640 kB
VmHWM: 15100 kB
VmRSS: 15100 kB
VmData: 760692 kB
And after:
Name: model.exe
VmPeak: 841640 kB
VmSize: 841640 kB
VmHWM: 719980 kB
VmRSS: 719980 kB
VmData: 760692 kB
I test it on Ubuntu and through System Monitor i saw this memory increasing. But i was confused that there are no changes in VmSize(and VmPeak) parameters.
And the question is - what is the indicator of real memory usage?
Does it mean, that true indicator is VmRSS? (and VmSize is only allocated but still not used memory)
(The possible solution to your problem is the last paragraph)
Memory allocation on most modern operating systems with virtual memory is a two-phase process. First, a portion of the virtual address space of the process is reserved and the virtual memory size of the process (VmSize) increases accordingly. This creates entries in the so-called process page table. Pages are initially not associated with phyiscal memory frames, i.e. no physical memory is actually used. Whenever some part of this allocated portion is actually read from or written to, a page fault occurs and the operating system installs (maps) a free page from the physical memory. This increases the resident set size of the process (VmRSS). When some other process needs memory, the OS might store the content of some infrequently used page (the definition of "infrequently used page" is highly implementation-dependent) to some persistent storage (hard drive in most cases, or generally to the swap device) and then unmap up. This process decreases the RSS but leaves VmSize intact. If this page is later accessed, a page fault would again occur and it will be brought back. The virutal memory size only decreases when virtual memory allocations are freed. Note that VmSize also counts for memory mapped files (i.e. the executable file and all shared libraries it links to or other explicitly mapped files) and shared memory blocks.
There are two generic types of memory in a process - statically allocated memory and heap memory. The statically allocated memory keeps all constants and global/static variables. It is part of the data segment, whose size is shown by the VmData metric. The data segment also hosts part of the program heap, where dynamic memory is being allocated. The data segment is continuous, i.e. it starts at a certain location and grows upwards towards the stack (which starts at a very high address and then grows downwards). The problem with the heap in the data segment is that it is managed by a special heap allocator that takes care of subdividing the contiguous data segment into smaller memory chunks. On the other side, in Linux dynamic memory can also be allocated by directly mapping virtual memory. This is usually done only for large allocations in order to conserve memory, since it only allows memory in multiples of the page size (usually 4 KiB) to be allocated.
The stack is also an important source of heavy memory usage, especially if big arrays are allocated in the automatic (stack) storage. The stack starts near the very top of the usable virtual address space and grows downwards. In some cases it could reach the top of the data segment or it could reach the end of some other virtual allocation. Bad things happen then. The stack size is accounted in the VmStack metric and also in the VmSize.
One can summarise it as so:
VmSize accounts for all virtual memory allocations (file mappings, shared memory, heap memory, whatever memory) and grows almost every time new memory is being allocated. Almost, because if the new heap memory allocation is made in the place of a freed old allocation in the data segment, no new virtual memory would be allocated. It decreses whenever virtual allocations are being freed. VmPeak tracks the max value of VmSize - it could only increase in time.
VmRSS grows as memory is being accessed and decreases as memory is paged out to the swap device.
VmData grows as the data segment part of the heap is being utilised. It almost never shrinks as current heap allocators keep the freed memory in case future allocations need it.
If you are running on a cluster with InfiniBand or other RDMA-based fabrics, another kind of memory comes into play - the locked (registered) memory (VmLck). This is memory which is not allowed to be paged out. How it grows and shrinks depends on the MPI implementation. Some never unregister an already registered block (the technical details about why are too complex to be described here), others do so in order to play better with the virtual memory manager.
In your case you say that you are running into a virtual memory size limit. This could mean that this limit is set too low or that you are running into an OS-imposed limits. First, Linux (and most Unixes) have means to impose artificial restrictions through the ulimit mechanism. Running ulimit -v in the shell would tell you what the limit on the virtual memory size is in KiB. You can set the limit using ulimit -v <value in KiB>. This only applies to processes spawned by the current shell and to their children, grandchilren and so on. You need to instruct mpiexec (or mpirun) to propagate this value to all other processes, if they are to be launched on remote nodes. if you are running your program under the control of some workload manager like LSF, Sun/Oracle Grid Engine, Torque/PBS, etc., there are job parameters which control the virtual memory size limit. And last but not least, 32-bit processes are usually restricted to 2 GiB of usable virtual memory.

Difference between "memory cache" and "memory pool"

By reading "understanding linux network internals" and "understanding linux kernel" the two books as well as other references, I am quite confused and need some clarifications about the "memory cache" and "memory pool" techniques.
1) Are they the same or different techniques?
2) If not the same, what makes the difference, or the distinct goals?
3) Also, how does the Slab Allocator come in?
Regarding the slab allocator:
So imagine memory is flat that is you have a block of 4 gigs contiguous memory. Then one of your programs reqeuests a 256 bytes of memory so what the memory allocator has to do is choose a suitable block of 256 bytes from this 4 gigs. So now you your memory looks something like
(each = is a contiguous block of memory). Some time passes and a lot of programs operating with the memory require more 256 blocks or more or less so in the end your memory might look like:
so it gets fragmented and then there is no trace of your beautiful 4gig block of memory - this is fragmentation. Now, what the slab allocator would do is keep track of allocated objects and once they are not used anymore it will say that the memory is free when in fact it will be retained in some sort of List (You might wanna read about FreeLists).
So now imagine that the first program relinquish the 256 bytes allocated and then a new would like to have 256 bytes so instead of allocating a new chunk of the main memory it might re-use the lastly freed 256 bytes without having to go through the burden of searching the physical memory for appropriate contiguous block of space. This is how you essentially implement the memory cache. This is done so that memory fragmentation is reduced overall because you might end up in situation where memory is so fragmented that it is unusable and the memory-manager has to do some magic to get you block of appropriate size. Where as using a slab allocator pro-actively combats (but doesn't eliminate) the problem.
Linux memory allocator A.K.A slab allocator maintains the frequently used list/pool of memory objects of similar or approximate size. slab is giving extra flexibility to programmer to create their own pool of frequently used memory objects of same size and label it as programmer want,allocate, deallocate and finally destroy it.This cache is known to your driver and private to it.But there is a problem, during memory pressure there are high chances of allocation failures which could be not acceptable in some drivers, then what to do better always reserve some memory handy so that we never feel the memory crunch, since kmem cache is more generic pool mechanism we need some one who can always maintain minimum required memory and that's our buddy memory pool .
Lookaside Caches - The cache manager in the Linux kernel is sometimes called the slab allocator. You might end up allocating many objects of the same size over and over so by using this mechanism you just can allocate many objects in the same size and then use them later, without the need to allocate many objects over and over.
Memory Pool is just a form of lookaside cache that tries to always keep a list of memory around for use in emergencies, so when the memory pool is created, the allocation functions (slab allocators) create a pool of preallocated objects so you can acquire them when you need.

How is the Linux calculating MemFree

I am trying to understand my embedded linux memory usage.
By using the top utility and the process file /proc/meminfo I can see how much virtual memory a process is using, and how much physical memory is available to the system. But it would seem for any given process the virtual memory can be very much higher than the used physical memory. As this is an embedded system memory swapping is disabled.(SwapTotal = 0)
How is linux calculating the free physical memory? As it doesn't seem to be accounting for everything allocated in the virtual memory space.
MemFree in /proc/meminfo is a count of how many pages are free in the buddy allocator. This buddy allocator is the fundamental unit of physical memory allocation in the kernel; however there are a lot of ways pages can be returned to the buddy allocator in time of need - for example, freeing empty SLABs, discarding cache/buffer RAM (even if this means invalidating PTEs in a running process), or as a last resort, swapping things out.
In fact, MemFree is generally controlled to be only 5-10% of total physical RAM, with any extra free RAM being co-opted into cache as time goes on. As such, MemFree alone is a very incomplete view of the overall memory situation.
As for the virtual memory (VSIZE) of a given process, this refers to the sum total of the sizes of all mapped memory segments in the process's address space. However, not all of these will be physically present - some may be paged in upon first access and as such will not register as memory in use until actually used. The resident size (RSIZE) is a more accurate view, as it only registers pages that are mapped in right now - although this may also not be accurate if a given page is mapped in multiple virtual addresses (which is very common when you consider multiple processes - shared libraries have the same physical RAM mapped to all processes that are using that library)
Try using htop. You will have to install it sudo apt-get install htop or yum install htop, whatever.
It will show you a more accurate representation of memory usage.
Basically, it comes down to "buffers/cache".
free -m
Look at the free column in the buffers/cache row, this is a more accurate representation of what is actually available.
total used free shared buffers cached
Mem: 3770 3586 183 0 112 1498
-/+ buffers/cache: 1976 1793
Swap: 7624 750 6874
