What are RSS and VSZ in Linux memory management? In a multithreaded environment how can both of these can be managed and tracked?
RSS is the Resident Set Size and is used to show how much memory is allocated to that process and is in RAM. It does not include memory that is swapped out. It does include memory from shared libraries as long as the pages from those libraries are actually in memory. It does include all stack and heap memory.
VSZ is the Virtual Memory Size. It includes all memory that the process can access, including memory that is swapped out, memory that is allocated, but not used, and memory that is from shared libraries.
So if process A has a 500K binary and is linked to 2500K of shared libraries, has 200K of stack/heap allocations of which 100K is actually in memory (rest is swapped or unused), and it has only actually loaded 1000K of the shared libraries and 400K of its own binary then:
RSS: 400K + 1000K + 100K = 1500K
VSZ: 500K + 2500K + 200K = 3200K
Since part of the memory is shared, many processes may use it, so if you add up all of the RSS values you can easily end up with more space than your system has.
The memory that is allocated also may not be in RSS until it is actually used by the program. So if your program allocated a bunch of memory up front, then uses it over time, you could see RSS going up and VSZ staying the same.
There is also PSS (proportional set size). This is a newer measure which tracks the shared memory as a proportion used by the current process. So if there were two processes using the same shared library from before:
PSS: 400K + (1000K/2) + 100K = 400K + 500K + 100K = 1000K
Threads all share the same address space, so the RSS, VSZ and PSS for each thread is identical to all of the other threads in the process. Use ps or top to view this information in linux/unix.
There is way more to it than this, to learn more check the following references:
http://manpages.ubuntu.com/manpages/en/man1/ps.1.html
https://web.archive.org/web/20120520221529/http://emilics.com/blog/article/mconsumption.html
Also see:
A way to determine a process's "real" memory usage, i.e. private dirty RSS?
RSS is Resident Set Size (physically resident memory - this is currently occupying space in the machine's physical memory), and VSZ is Virtual Memory Size (address space allocated - this has addresses allocated in the process's memory map, but there isn't necessarily any actual memory behind it all right now).
Note that in these days of commonplace virtual machines, physical memory from the machine's view point may not really be actual physical memory.
Minimal runnable example
For this to make sense, you have to understand the basics of paging: How does x86 paging work? and in particular that the OS can allocate virtual memory via page tables / its internal memory book keeping (VSZ virtual memory) before it actually has a backing storage on RAM or disk (RSS resident memory).
Now to observe this in action, let's create a program that:
allocates more RAM than our physical memory with mmap
writes one byte on each page to ensure that each of those pages goes from virtual only memory (VSZ) to actually used memory (RSS)
checks the memory usage of the process with one of the methods mentioned at: Memory usage of current process in C
main.c
#define _GNU_SOURCE
#include <assert.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>
typedef struct {
unsigned long size,resident,share,text,lib,data,dt;
} ProcStatm;
/* https://stackoverflow.com/questions/1558402/memory-usage-of-current-process-in-c/7212248#7212248 */
void ProcStat_init(ProcStatm *result) {
const char* statm_path = "/proc/self/statm";
FILE *f = fopen(statm_path, "r");
if(!f) {
perror(statm_path);
abort();
}
if(7 != fscanf(
f,
"%lu %lu %lu %lu %lu %lu %lu",
&(result->size),
&(result->resident),
&(result->share),
&(result->text),
&(result->lib),
&(result->data),
&(result->dt)
)) {
perror(statm_path);
abort();
}
fclose(f);
}
int main(int argc, char **argv) {
ProcStatm proc_statm;
char *base, *p;
char system_cmd[1024];
long page_size;
size_t i, nbytes, print_interval, bytes_since_last_print;
int snprintf_return;
/* Decide how many ints to allocate. */
if (argc < 2) {
nbytes = 0x10000;
} else {
nbytes = strtoull(argv[1], NULL, 0);
}
if (argc < 3) {
print_interval = 0x1000;
} else {
print_interval = strtoull(argv[2], NULL, 0);
}
page_size = sysconf(_SC_PAGESIZE);
/* Allocate the memory. */
base = mmap(
NULL,
nbytes,
PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS,
-1,
0
);
if (base == MAP_FAILED) {
perror("mmap");
exit(EXIT_FAILURE);
}
/* Write to all the allocated pages. */
i = 0;
p = base;
bytes_since_last_print = 0;
/* Produce the ps command that lists only our VSZ and RSS. */
snprintf_return = snprintf(
system_cmd,
sizeof(system_cmd),
"ps -o pid,vsz,rss | awk '{if (NR == 1 || $1 == \"%ju\") print}'",
(uintmax_t)getpid()
);
assert(snprintf_return >= 0);
assert((size_t)snprintf_return < sizeof(system_cmd));
bytes_since_last_print = print_interval;
do {
/* Modify a byte in the page. */
*p = i;
p += page_size;
bytes_since_last_print += page_size;
/* Print process memory usage every print_interval bytes.
* We count memory using a few techniques from:
* https://stackoverflow.com/questions/1558402/memory-usage-of-current-process-in-c */
if (bytes_since_last_print > print_interval) {
bytes_since_last_print -= print_interval;
printf("extra_memory_committed %lu KiB\n", (i * page_size) / 1024);
ProcStat_init(&proc_statm);
/* Check /proc/self/statm */
printf(
"/proc/self/statm size resident %lu %lu KiB\n",
(proc_statm.size * page_size) / 1024,
(proc_statm.resident * page_size) / 1024
);
/* Check ps. */
puts(system_cmd);
system(system_cmd);
puts("");
}
i++;
} while (p < base + nbytes);
/* Cleanup. */
munmap(base, nbytes);
return EXIT_SUCCESS;
}
GitHub upstream.
Compile and run:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
echo 1 | sudo tee /proc/sys/vm/overcommit_memory
sudo dmesg -c
./main.out 0x1000000000 0x200000000
echo $?
sudo dmesg
where:
0x1000000000 == 64GiB: 2x my computer's physical RAM of 32GiB
0x200000000 == 8GiB: print the memory every 8GiB, so we should get 4 prints before the crash at around 32GiB
echo 1 | sudo tee /proc/sys/vm/overcommit_memory: required for Linux to allow us to make a mmap call larger than physical RAM: Maximum memory which malloc can allocate
Program output:
extra_memory_committed 0 KiB
/proc/self/statm size resident 67111332 768 KiB
ps -o pid,vsz,rss | awk '{if (NR == 1 || $1 == "29827") print}'
PID VSZ RSS
29827 67111332 1648
extra_memory_committed 8388608 KiB
/proc/self/statm size resident 67111332 8390244 KiB
ps -o pid,vsz,rss | awk '{if (NR == 1 || $1 == "29827") print}'
PID VSZ RSS
29827 67111332 8390256
extra_memory_committed 16777216 KiB
/proc/self/statm size resident 67111332 16778852 KiB
ps -o pid,vsz,rss | awk '{if (NR == 1 || $1 == "29827") print}'
PID VSZ RSS
29827 67111332 16778864
extra_memory_committed 25165824 KiB
/proc/self/statm size resident 67111332 25167460 KiB
ps -o pid,vsz,rss | awk '{if (NR == 1 || $1 == "29827") print}'
PID VSZ RSS
29827 67111332 25167472
Killed
Exit status:
137
which by the 128 + signal number rule means we got signal number 9, which man 7 signal says is SIGKILL, which is sent by the Linux out-of-memory killer.
Output interpretation:
VSZ virtual memory remains constant at printf '0x%X\n' 0x40009A4 KiB ~= 64GiB (ps values are in KiB) after the mmap.
RSS "real memory usage" increases lazily only as we touch the pages. For example:
on the first print, we have extra_memory_committed 0, which means we haven't yet touched any pages. RSS is a small 1648 KiB which has been allocated for normal program startup like text area, globals, etc.
on the second print, we have written to 8388608 KiB == 8GiB worth of pages. As a result, RSS increased by exactly 8GIB to 8390256 KiB == 8388608 KiB + 1648 KiB
RSS continues to increase in 8GiB increments. The last print shows about 24 GiB of memory, and before 32 GiB could be printed, the OOM killer killed the process
See also: https://unix.stackexchange.com/questions/35129/need-explanation-on-resident-set-size-virtual-size
OOM killer logs
Our dmesg commands have shown the OOM killer logs.
An exact interpretation of those has been asked at:
Understanding the Linux oom-killer's logs but let's have a quick look here.
https://serverfault.com/questions/548736/how-to-read-oom-killer-syslog-messages
The very first line of the log was:
[ 7283.479087] mongod invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
So we see that interestingly it was the MongoDB daemon that always runs in my laptop on the background that first triggered the OOM killer, presumably when the poor thing was trying to allocate some memory.
However, the OOM killer does not necessarily kill the one who awoke it.
After the invocation, the kernel prints a table or processes including the oom_score:
[ 7283.479292] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 7283.479303] [ 496] 0 496 16126 6 172032 484 0 systemd-journal
[ 7283.479306] [ 505] 0 505 1309 0 45056 52 0 blkmapd
[ 7283.479309] [ 513] 0 513 19757 0 57344 55 0 lvmetad
[ 7283.479312] [ 516] 0 516 4681 1 61440 444 -1000 systemd-udevd
and further ahead we see that our own little main.out actually got killed on the previous invocation:
[ 7283.479871] Out of memory: Kill process 15665 (main.out) score 865 or sacrifice child
[ 7283.479879] Killed process 15665 (main.out) total-vm:67111332kB, anon-rss:92kB, file-rss:4kB, shmem-rss:30080832kB
[ 7283.479951] oom_reaper: reaped process 15665 (main.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:30080832kB
This log mentions the score 865 which that process had, presumably the highest (worst) OOM killer score as mentioned at: https://unix.stackexchange.com/questions/153585/how-does-the-oom-killer-decide-which-process-to-kill-first
Also interestingly, everything apparently happened so fast that before the freed memory was accounted, the oom was awoken again by the DeadlineMonitor process:
[ 7283.481043] DeadlineMonitor invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
and this time that killed some Chromium process, which is usually my computers normal memory hog:
[ 7283.481773] Out of memory: Kill process 11786 (chromium-browse) score 306 or sacrifice child
[ 7283.481833] Killed process 11786 (chromium-browse) total-vm:1813576kB, anon-rss:208804kB, file-rss:0kB, shmem-rss:8380kB
[ 7283.497847] oom_reaper: reaped process 11786 (chromium-browse), now anon-rss:0kB, file-rss:0kB, shmem-rss:8044kB
Tested in Ubuntu 19.04, Linux kernel 5.0.0.
Linux kernel docs
https://github.com/torvalds/linux/blob/v5.17/Documentation/filesystems/proc.rst has some points. The term "VSZ" is not used there but "RSS" is, and there's nothing too enlightening (surprise?!)
Instead of VSZ, the kernel seems to use the term VmSize, which appears e.g. on /proc/$PID/status.
Some quotes of interest:
The first of these lines shows the same information as is displayed for the mapping in /proc/PID/maps. Following lines show the size of the mapping (size); the size of each page allocated when backing a VMA (KernelPageSize), which is usually the same as the size in the page table entries; the page size used by the MMU when backing a VMA (in most cases, the same as KernelPageSize); the amount of the mapping that is currently resident in RAM (RSS); the process' proportional share of this mapping (PSS); and the number of clean and dirty shared and private pages in the mapping.
The "proportional set size" (PSS) of a process is the count of pages it has in memory, where each page is divided by the number of processes sharing it. So if a process has 1000 pages all to itself, and 1000 shared with one other process, its PSS will be 1500.
Note that even a page which is part of a MAP_SHARED mapping, but has only a single pte mapped, i.e. is currently used by only one process, is accounted as private and not as shared.
So we can guess a few more things:
shared libraries used by a single process appear in RSS, if more than one process has them then not
PSS was mentioned by jmh, and has a more proportional approach between "I'm the only process that holds the shared library" and "there are N process holding the shared library, so each one holds memory/N on average"
VSZ - Virtual Set Size
The Virtual Set Size is a memory size assigned to a process ( program ) during the initial execution. The Virtual Set Size memory is simply a number of how much memory a process has available for its execution.
RSS - Resident Set Size (kinda RAM)
As opposed to VSZ ( Virtual Set Size ), RSS is a memory currently used by a process. This is a actual number in kilobytes of how much RAM the current process is using.
Source
I think much has already been said, about RSS vs VSZ. From an administrator/programmer/user perspective, when I design/code applications I am more concerned about the RSZ, (Resident memory), as and when you keep pulling more and more variables (heaped) you will see this value shooting up. Try a simple program to build malloc based space allocation in loop, and make sure you fill data in that malloc'd space. RSS keeps moving up.
As far as VSZ is concerned, it's more of virtual memory mapping that linux does, and one of its core features derived out of conventional operating system concepts. The VSZ management is done by Virtual memory management of the kernel, for more info on VSZ, see Robert Love's description on mm_struct and vm_struct, which are part of basic task_struct data structure in kernel.
To summarize #jmh excellent answer :
In #linux, a process's memory comprises :
its own binary
its shared libs
its stack and heap
Due to paging, not all of those are always fully in memory, only the useful, most recently used parts (pages) are. Other parts are paged out (or swapped out) to make room for other processes.
The table below, taken from #jmh's answer, shows an example of what Resident and Virtual memory are for a specific process.
+-------------+-------------------------+------------------------+
| portion | actually in memory | total (allocated) size |
|-------------+-------------------------+------------------------|
| binary | 400K | 500K |
| shared libs | 1000K | 2500K |
| stack+heap | 100K | 200K |
|-------------+-------------------------+------------------------|
| | RSS (Resident Set Size) | VSZ (Virtual Set Size) |
|-------------+-------------------------+------------------------|
| | 1500K | 3200K |
+-------------+-------------------------+------------------------+
To summarize : resident memory is what is actually in physical memory right now, and virtual size is the total, necessary physical memory to load all components.
Of course, numbers don't add up, because libraries are shared between multiple processes and their memory is counted for each process separately, even if they are loaded only once.
They are not managed, but measured and possibly limited (see getrlimit system call, also on getrlimit(2)).
RSS means resident set size (the part of your virtual address space sitting in RAM).
You can query the virtual address space of process 1234 using proc(5) with cat /proc/1234/maps and its status (including memory consumption) thru cat /proc/1234/status
Related
I am working on a per process memory monitoring (Bash) script but it turns out to be more of a headache than I thought. Especially on forked processes such as PostgreSQL. There are a couple of reasons:
RSS is a potential value to be used as memory usage, however this also contains shared libraries etc which are used in other processes
PSS is another potential value which (should) show only the private memory of a process. Problem here is that PSS can only be retrieved from /proc//smaps which requires elevated capability privileges (or root)
USS (calculated as Private_Dirty + Private_Clean, source How does smem calculate RSS, USS and PSS?) could also be a potential candidate but here again we need access to /proc//smaps
For now I am trying to solve the forked process problem by looping through each PID's smaps (as suggested in https://www.depesz.com/2012/06/09/how-much-ram-is-postgresql-using/), for example:
for pid in $(pgrep -a -f "postgres" | awk '{print $1}' | tr "\n" " " ); do grep "^Pss:" /proc/$pid/smaps; done
Maybe some of the postgres processes should be excluded, I am not sure.
Using this method to calculate and sum the PSS and USS values, resulting in:
PSS: 4817 MB - USS: 4547 MB - RES: 6176 MB - VIRT: 26851 MB used
Obviously this only works with elevated privileges, which I would prefer to avoid. If these values actually represent the truth is not known because other tools/commands show yet again different values.
Unfortunately top and htop are unable to combine the postgres processes. atop is able to do this and seems to be (from a feeling) the most accurate with the following values:
NPROCS SYSCPU USRCPU VSIZE RSIZE PSIZE SWAPSZ RDDSK WRDSK RNET SNET MEM CMD 1/1
27 56m50s 16m40s 5.4G 1.1G 0K 2308K 0K 0K 0 0 11% postgres
Now to the question: What is the suggested and best way to retrieve the most accurate memory usage of an application with forked processes, such as PostgreSQL?
And in case atop already does an accurate calculation, how does atop get the to RSIZE value? Note that this value shown as root and non-root user, which would probably mean that /proc/<pid>/smaps is not used for the calculation.
Please comment if more information is needed.
EDIT: I actually found a bug in my pgrep pattern in my final script and it falsely parsed a lot more than just the postgres processes.
The new output now shows the same RES value as seen in atop RSIZE:
Script output:
PSS: 205 MB - USS: 60 MB - RES: 1162 MB - VIRT: 5506 MB
atop summarized postgres output:
NPROCS SYSCPU USRCPU VSIZE RSIZE PSIZE SWAPSZ RDDSK WRDSK RNET SNET MEM CMD
27 0.04s 0.10s 5.4G 1.1G 0K 2308K 0K 32K 0 0 11% postgres
But the question remains of course. Unless I am now using the most accurate way with the summarized RSS (RES) memory value. Let me know your thoughts, thanks :)
Symptoms:
I allocate TLS key with a destructor, create a bundle of threads and pass the TLS key to each thread. Each thread allocates memory and sets its pointer in TLS, the TLS destructor deallocates memory. I wait for threads to finish before app exits.
The app is run under valgrind/massif that reports this memory not deallocated.
int main(int argc, char **argv)
{
pthread_key_t* key = new pthread_key_t();
pthread_key_create(key, my_destructor);
pthread_t threads[32000];
for(int i=0; i<32000; ++i)
pthread_create(&threads[i], NULL, my_thread, key);
for(int i=0; i<32000; ++i)
pthread_join(threads[i], NULL);
return 0;
}
In the thread runner I allocate the memory and set it up in the TLS:
extern "C" void* my_thread(void* p)
{
pthread_setspecific(*(pthread_key_t*)p, malloc(100));
return NULL;
}
In the TLS destructor, I release the memory:
extern "C" void my_destructor(void *p)
{
free(p);
}
I run this under valgrind/massif 3.19 with the following options:
--tool=massif
--heap=yes
--pages-as-heap=yes
--log-file=/tmp/my.log
--massif-out-file=/tmp/my.massif.log
Then I run ms_print /tmp/my.massif.log. I am getting the leaks reported like the following:
| ->01.75% (67,108,864B) 0x76F92D0: new_heap (in /usr/lib64/libc-2.17.so)
| | ->01.75% (67,108,864B) 0x76F98D3: arena_get2.isra.3 (in /usr/lib64/libc-2.17.so)
| | ->01.75% (67,108,864B) 0x76FF77D: malloc (in /usr/lib64/libc-2.17.so)
| | ->01.75% (67,108,864B) 0x410300: my_thread (threadsT.cpp:136)
| | ...
| | <skipped by author>
| | ...
| |
| ->00.00% (73,728B) in 1+ places, all below ms_print's threshold (01.00%)
...while I would not expect anything reported leaked at all.
I added the instrumentation to my_destructor and manually verified that:
it is invoked, indeed
it deallocates the memory, as it is supposed to do
Is there something apparent I am doing wrong here that makes valgrind/massif report these?
Is it a valgrind/massif limitation that it cannot detect the memory deallocation when invoked from TLS destructors?
Building and running that with gcc 4.9.4 on Red Hat Enterprise Linux Server release 7.9 (Maipo).
A second answer, this time concentrating on the 'leak' aspect.
Massif isn't really a leak detector. It's for profiling heap use.
If I compile the example (with 320 threads) then at the end I get about 89 million bytes still allocated. That is made up of
75% the arena used by malloc called from start_thread
9% pthread_create
15% loading shared libraries
None of that looks like much of a concern to me. I assume that the start_thread memory is the pthread stack cache.
If I use massif for profiling malloc/new, then the last sample is
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
73 2,929,610 2,360 2,308 52 0
You should check the return status for your thread creation. It's unlikely that you are succeeding in creating 32000 threads.
A bit of Valgrind source:
coregrind/pub_core_options.h:#define MAX_THREADS_DEFAULT 500
coregrind/m_scheduler/scheduler.c: VG_(printf)("Use --max-threads=INT to specify a larger number of threads\n"
Assuming that this is amd64 Linux, I believe that the default pthread stack size is 8Mbytes. That means you need 256Gbytes for stack memory. Does your machine have that much?
Please try the following
Use pthread_attr_setstacksize to set the stack sizes to PTHREAD_STACK_MIN (16k).
Run Valgrind with --max-threads=32001
Even with the above you may still hit some Valgrind limits such as VG_N_SEGMENTS.
If you see a message like
"Valgrind: FATAL: VG_N_SEGMENTS is too low.
Increase it and rebuild.
Exiting now."
Then you will need to rebuild Valgrind with an increased limit.
I am running application on ARMv7-A machine with Fedora 18, 2GB of RAM.
The application terminates:
130413 15:49:34 19344 Xrd: PhyConnection: Can't run reader thread: out of system resources. Critical error.
If I strace that, I see that allocation of stack fails for a new thread:
mmap2(NULL, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = -1 ENOMEM (Cannot allocate memory)
gettimeofday({1365921367, 588018}, NULL) = 0
gettid() = 6309
writev(2, [{"130414 02:36:07 6309 ", 21}, {"Xrd", 3}, {"", 0}, {": ", 2}, {"PhyConnection: Can't run reader "..., 80}, {"\n", 1}], 6130414 02:36:07 6309 Xrd: PhyConnection: Can't ru
n reader thread: out of system resources. Critical error.
) = 107
munmap(0x48172000, 292) = 0
munmap(0x48225000, 292) = 0
Actual code:
253 if (fReaderthreadhandler[i]->Run(this)) {
254 Error("PhyConnection",
255 "Can't run reader thread: out of system resources. Critical error.");
256 // HELP: what do we do here
257 exit(-1);
258 }
The application had 300-350MB in virtual memory size, and ~250MB is resident memory size. High memory limitation is 1.3GB. Virtual address space is not limited:
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes 1024
-n: file descriptors 1024
-l: locked-in-memory size (kbytes) 64
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 15870
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 0
-N 15: unlimited
But it does work from GDB! I also looked what limits are reported from GDB and they are the same. Thus GDB does not adjust soft limits, which would be inherited.
Summary:
I have enough memory to run the application. It even works fine inside GDB.
It doesn't seem that it hit any of the resource limits.
Works in GDB, but not outside.
Any hints of what could be wrong here?
Works in GDB, but not outside.
One thing that is different "inside GDB" is address layout (randomization).
In order to make debugging easier, GDB disables ASLR by default. You can turn it back on with
(gdb) set disable-randomization off
and then run the app several times, and check whether it still reliably works.
I have enough memory to run the application.
The allocation (mapping) that is failing requests 8MB of continuous memory, which you may not have if your address space is fragmented. If you don't actually need 8MB of stack (most applications don't), you could get many more threads in by setting ulimit -s (or use setrilimit(RLIMIT_STACK, ...) from within the application) to a significantly smaller value.
The manual page told me so much and through it I know lots of the background knowledge of memory management of "glibc".
But I still get confused. does "malloc_trim(0)"(note zero as the parameter) mean (1.)all the memory in the "heap" section will be returned to OS ? Or (2.)just all "unused" memory of the top most region of the heap will be returned to OS ?
If the answer is (1.), what if the still used memory in the heap? if the heap has used momery at places somewhere, will them be eliminated, or the function wouldn't execute successfully?
While if the answer is (2.), what about those "holes" at places rather than top in the heap? they're unused memory anymore, but the top most region of the heap is still used, will this calling work efficiently?
Thanks.
Man page of malloc_trim was committed here: https://github.com/mkerrisk/man-pages/blob/master/man3/malloc_trim.3 and as I understand, it was written by man-pages project maintainer, kerrisk in 2012 from scratch: https://github.com/mkerrisk/man-pages/commit/a15b0e60b297e29c825b7417582a33e6ca26bf65
As I can grep the glibc's git, there are no man pages in the glibc, and no commit to malloc_trim manpage to document this patch. The best and the only documentation of glibc malloc is its source code: https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c
There are malloc_trim comments from malloc/malloc.c:
Additional functions:
malloc_trim(size_t pad);
609 /*
610 malloc_trim(size_t pad);
611
612 If possible, gives memory back to the system (via negative
613 arguments to sbrk) if there is unused memory at the `high' end of
614 the malloc pool. You can call this after freeing large blocks of
615 memory to potentially reduce the system-level memory requirements
616 of a program. However, it cannot guarantee to reduce memory. Under
617 some allocation patterns, some large free blocks of memory will be
618 locked between two used chunks, so they cannot be given back to
619 the system.
620
621 The `pad' argument to malloc_trim represents the amount of free
622 trailing space to leave untrimmed. If this argument is zero,
623 only the minimum amount of memory to maintain internal data
624 structures will be left (one page or less). Non-zero arguments
625 can be supplied to maintain enough trailing space to service
626 future expected allocations without having to re-obtain memory
627 from the system.
628
629 Malloc_trim returns 1 if it actually released any memory, else 0.
630 On systems that do not support "negative sbrks", it will always
631 return 0.
632 */
633 int __malloc_trim(size_t);
634
Freeing from the middle of the chunk is not documented as text in malloc/malloc.c and not documented in man-pages project. Man page from 2012 may be the first man page of the function, written not by authors of glibc. Info page of glibc only mentions M_TRIM_THRESHOLD of 128 KB:
https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html#Malloc-Tunable-Parameters and don't list malloc_trim function https://www.gnu.org/software/libc/manual/html_node/Summary-of-Malloc.html#Summary-of-Malloc (and it also don't document memusage/memusagestat/libmemusage.so).
In december 2007 there was commit https://sourceware.org/git/?p=glibc.git;a=commit;f=malloc/malloc.c;h=68631c8eb92ff38d9da1ae34f6aa048539b199cc by Ulrich Drepper (it is part of glibc 2.9 and newer) which changed mtrim implementation (but it didn't change any documentation or man page as there are no man pages in glibc):
malloc/malloc.c (public_mTRIm): Iterate over all arenas and call
mTRIm for all of them.
(mTRIm): Additionally iterate over all free blocks and use madvise
to free memory for all those blocks which contain at least one
memory page.
Unused parts of chunks (anywhere, including chunks in the middle), aligned on page size and having size more than page may be marked as MADV_DONTNEED https://sourceware.org/git/?p=glibc.git;a=blobdiff;f=malloc/malloc.c;h=c54c203cbf1f024e72493546221305b4fd5729b7;hp=1e716089a2b976d120c304ad75dd95c63737ad75;hb=68631c8eb92ff38d9da1ae34f6aa048539b199cc;hpb=52386be756e113f20502f181d780aecc38cbb66a
INTERNAL_SIZE_T size = chunksize (p);
if (size > psm1 + sizeof (struct malloc_chunk))
{
/* See whether the chunk contains at least one unused page. */
char *paligned_mem = (char *) (((uintptr_t) p
+ sizeof (struct malloc_chunk)
+ psm1) & ~psm1);
assert ((char *) chunk2mem (p) + 4 * SIZE_SZ <= paligned_mem);
assert ((char *) p + size > paligned_mem);
/* This is the size we could potentially free. */
size -= paligned_mem - (char *) p;
if (size > psm1)
madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
}
This is one of total two usages of madvise with MADV_DONTNEED in glibc now, one for top part of heaps (shrink_heap) and other is marking of any chunk (mtrim): http://code.metager.de/source/search?q=MADV_DONTNEED&path=%2Fgnu%2Fglibc%2Fmalloc%2F&project=gnu
H A D arena.c 643 __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
H A D malloc.c 4535 __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
We can test the malloc_trim with this simple C program (test_malloc_trim.c) and strace/ltrace:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <malloc.h>
int main()
{
int *m1,*m2,*m3,*m4;
printf("%s\n","Test started");
m1=(int*)malloc(20000);
m2=(int*)malloc(40000);
m3=(int*)malloc(80000);
m4=(int*)malloc(10000);
// check that all arrays are allocated on the heap and not with mmap
printf("1:%p 2:%p 3:%p 4:%p\n", m1, m2, m3, m4);
// free 40000 bytes in the middle
free(m2);
// call trim (same result with 2000 or 2000000 argument)
malloc_trim(0);
// call some syscall to find this point in the strace output
sleep(1);
free(m1);
free(m3);
free(m4);
// malloc_stats(); malloc_info(0, stdout);
return 0;
}
gcc test_malloc_trim.c -o test_malloc_trim, strace ./test_malloc_trim
write(1, "Test started\n", 13Test started
) = 13
brk(0) = 0xcca000
brk(0xcef000) = 0xcef000
write(1, "1:0xcca010 2:0xccee40 3:0xcd8a90"..., 441:0xcca010 2:0xccee40 3:0xcd8a90 4:0xcec320
) = 44
madvise(0xccf000, 36864, MADV_DONTNEED) = 0
...
nanosleep({1, 0}, 0x7ffffafbfff0) = 0
brk(0xceb000) = 0xceb000
So, there was madvise with MADV_DONTNEED for 9 pages after malloc_trim(0) call, when there was hole of 40008 bytes in the middle of the heap.
The man page for malloc_trim says it releases free memory, so if there is allocated memory in the heap, it won't release the whole heap. The parameter is there if you know you're still going to need a certain amount of memory, so freeing more than that would cause glibc to have to do unnecessary work later.
As for holes, this is a standard problem with memory management and returning memory to the OS. The primary low-level heap management available to the program is brk and sbrk, and all they can do is extend or shrink the heap area by changing the top. So there's no way for them to return holes to the operating system; once the program has called sbrk to allocate more heap, that space can only be returned if the top of that space is free and can be handed back.
Note that there are other, more complex ways to allocate memory (with anonymous mmap, for example), which may have different constraints than sbrk-based allocation.
I tried to get the shared memory size of a process on Linux. Here's the result of using 2 different commands:
top and check with the SHR field:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1734 root 20 0 201m 4072 1012 S 0.0 0.1 22:00.65 php-fpm
pmap -d :
mapped: 206672K writeable/private: 4352K shared: 128K
You can see that the shared memory size in pmap is much smaller than top.
I read some source code to find the reason. It seems that top is reading the value from /proc//statm and the values are calculated by:
unsigned long task_statm(struct mm_struct *mm,
unsigned long *shared, unsigned long *text,
unsigned long *data, unsigned long *resident)
{
*shared = get_mm_counter(mm, MM_FILEPAGES);
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
return mm->total_vm;
}
It seems that all the file pages are counted as shared memory?
And the pmap command is reading the info from /proc//maps and then calculate the shared memory through some flags:
3dc822a000-3dc822d000 rw-p 0002a000 08:13 5134288 /usr/lib64/libmcrypt.so.4.4.8
start-end flags file_offset dev_major:dev_minor inode
If the flags[3] == 's' then this map will be counted as shared one.
So my question is which one is more accurate? And why they have different methods to calculate the shared memory size?
Thanks in advance!
The SHR column in top does not report the same thing that pmaps shared entry does. top is reporting the amount of memory that is shared with other processes because it's in a dynamic library that is loaded once into memory, and all processes using that library include the same pages in their image, since those pages are read-only. pmap seems to be showing "shared memory" segments, which are data pages that may be read-write or read-only, and are shared between processes with the shmget() and related functions.