psutil vs dd: monitoring disk I/O - python-3.x

I'm writing y.a.t. (yet-another-tool :)) for monitoring disk usage on Linux.
I'm using python 3.3.2 and psutil 3.3.0.
The process I'm monitoring does something really basic: I use the dd tool and I vary the block size (128, 512, 1024, 4096)
#!/bin/bash
dd if=./bigfile.txt of=./copy.img bs=4096
bigfile.txt:
$ stat bigfile.txt
File: ‘bigfile.txt’
Size: 87851423 Blocks: 171600 IO Block: 4096 regular file
And the snippet of the monitor is as follows:
def poll(interval, proc):
d_before = proc.io_counters()
time.sleep(interval)
tst = time.time()
d_after = proc.io_counters()
usage = OrderedDict.fromkeys(d_after.__dict__.keys())
for k, v in usage.items():
usage[k] = d_after.__dict__[k] - d_before.__dict__[k]
return tst, usage
At each run, I clear the cache (as suggested many times on stackoverflow):
rm copy.img && sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
My question is: why aren't the numbers matching?
bs=128:
dd:
686339+1 records in
686339+1 records out
87851423 bytes (88 MB) copied, 1.21664 s, 72.2 MB/s
monitor.py:
1450778750.104943 OrderedDict([('read_count', 686352), ('write_count', 686343), ('read_bytes', 87920640), ('write_bytes', 87855104)])
bs=4096
dd:
21448+1 records in
21448+1 records out
87851423 bytes (88 MB) copied, 0.223911 s, 392 MB/s
monitor.py:
1450779294.5541275 OrderedDict([('read_count', 21468), ('write_count', 21452), ('read_bytes', 88252416), ('write_bytes', 87855104)])
The difference is still there with all the values of bs.
Is it a matter of certains read/write not being counted? Does psutil performs some extra work? For example, with bs=4096, why in psutil 400993 more bytes (for read) and 3681 (for write) are reported?
Am I missing something big?
Thanks a lot.
EDIT: as an update, it doesn't matter the granularity of timings in the measurement, i.e., the time.sleep(interval) call. I tried with different values, and summing up the total number of reads and writes reported by psutil. The difference remains.
EDIT2: typo in snippet code

write_bytes
The read_bytes and write_bytes correspond to the same fields from /proc/<PID>/io. Quoting the documentation (emphasis mine):
read_bytes
----------
I/O counter: bytes read
Attempt to count the number of bytes which this process really did cause to
be fetched from the storage layer. Done at the submit_bio() level, so it is
accurate for block-backed filesystems.
write_bytes
-----------
I/O counter: bytes written
Attempt to count the number of bytes which this process caused to be sent to
the storage layer. This is done at page-dirtying time.
As you know, most (all?) filesystems are block-based. This implies that if you have a program that, say, writes just 5 bytes to a file, and if your block size if 4 KiB, then 4 KiB will be written.
If you don't trust dd, let's try with a simple Python script:
with open('something', 'wb') as f:
f.write(b'12345')
input('press Enter to exit')
This script should write only 5 bytes, but if we inspect /proc/<PID>/io, we can see that 4 KiB were written:
$ cat /proc/3455/io
rchar: 215317
wchar: 24
syscr: 66
syscw: 2
read_bytes: 0
write_bytes: 4096
cancelled_write_bytes: 0
This is the same thing that is happening with dd in your case.
You have asked dd to write 87851423 bytes. How many 4 KiB blocks are 87851423 bytes?
87851423 - (87851423 mod 4096) + 4096 = 87855104
Not by chance 87855104 is the number reported by psutil.
read_bytes
How about read_bytes? In theory we should have read_bytes equal to write_bytes, but actually read_bytes shows 16 more blocks in the first run, and 97 more blocks in the second run.
Well, first of all, let's see what files dd is actually reading:
$ strace -e trace=open,read -- dd if=/dev/zero of=zero bs=1M count=2
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\v\2\0\0\0\0\0"..., 832) = 832
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
open("/dev/zero", O_RDONLY) = 3
open("zero", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 0
read(0, "# Locale name alias data base.\n#"..., 4096) = 2570
read(0, "", 4096) = 0
open("/usr/share/locale/en_US/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en_US/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = 0
+++ exited with 0 +++
As you can see, dd is opening and reading the linker, the GNU C library, and locale files. It is reading more bytes than you can see above, because it's also using mmap, not just read.
The point is: dd reads many more files than the source file, therefore it's acceptable the read_bytes is much higher than write_bytes. But why is it inconsistent?
Those files that are read by dd are also used by many other programs. Even if you drop_caches just before executing dd, there are chances that some other process may reload one of these files into memory. You can try with this very simple C program:
int main()
{
while(1) {
}
}
Compiled with the default GCC options, this program does nothing except opening the linker and the GNU C library. If you try to drop_caches, execute the program and cat /proc/<PID>/IO more than once, you'll see that read_bytes will vary across runs (except if you perform the steps very fast, in which case the probability that some other program has loaded some files into the cache is low).

Related

What is the correct way to get a consistent snapshot of /proc/pid/smaps?

I am trying to parse the PSS value from /proc/<pid>/smaps of a process in my C++ binary.
According to this SO answer, naively reading the /proc/<pid>/smaps file for example with ifstream::getLine() will result in an inconsistent dataset. The solution suggested is to use the read() system call to read the whole data in one go, something like:
#include <unistd.h>
#include <fcntl.h>
...
char rawData[102400];
int file = open("/proc/12345/smaps", O_RDONLY, 0);
auto bytesRead = read(file, rawData, 102400); // this returns 3722 instead of expected ~64k
close(file);
std::cout << bytesRead << std::endl;
// do some parsing here after null-terminating the buffer
...
My problem now is that despite me using a 100kB buffer, only 3722 bytes are returned. Looking at what cat does when parsing the file using strace, I see that it is using multiple calls to read() (also getting around 3k bytes on every read) until read() returns 0 - as described in the documentation of read():
...
read(3, "7fa8db3d7000-7fa8db3d8000 r--p 0"..., 131072) = 3588
write(1, "7fa8db3d7000-7fa8db3d8000 r--p 0"..., 3588) = 3588
read(3, "7fa8db3df000-7fa8db3e0000 r--p 0"..., 131072) = 3632
write(1, "7fa8db3df000-7fa8db3e0000 r--p 0"..., 3632) = 3632
read(3, "7fa8db3e8000-7fa8db3ed000 r--s 0"..., 131072) = 3603
write(1, "7fa8db3e8000-7fa8db3ed000 r--s 0"..., 3603) = 3603
read(3, "7fa8db41d000-7fa8db425000 r--p 0"..., 131072) = 3445
write(1, "7fa8db41d000-7fa8db425000 r--p 0"..., 3445) = 3445
read(3, "7fff05467000-7fff05496000 rw-p 0"..., 131072) = 2725
write(1, "7fff05467000-7fff05496000 rw-p 0"..., 2725) = 2725
read(3, "", 131072) = 0
munmap(0x7f8d29ad4000, 139264) = 0
close(3) = 0
close(1) = 0
close(2) = 0
exit_group(0) = ?
+++ exited with 0 +++
But isn't this supposed to produce inconsistent data according to the SO answer linked above?
I have also found some information about proc here, that seem to support the previous SO answer:
To see a precise
snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
Then later in the text it says:
Note: reading /proc/PID/maps or /proc/PID/smaps is inherently racy (consistent
output can be achieved only in the single read call).
This typically manifests when doing partial reads of these files while the
memory map is being modified.
Despite the races, we do provide the following
guarantees:
1) The mapped addresses never go backwards, which implies no two regions will ever overlap.
2) If there is something at a given vaddr during the entirety of the
life of the smaps/maps walk, there will be some output for it.
So it seems to me, I can only trust the data I'm getting if I get it in a single read() call.
Which only returns a small chunk of data despite the buffer being big enough.
Which in turn means there is actually no way to get a consistent snapshot of /proc/<pid>/smaps and the data returned by cat/using multiple read() calls may be garbage depending on the sun to moon light ratio?
Or does 2) actually mean I'm too hung up on the previous SO answer listed above?
You are being limited with the internal kernel buffer size in fs/seq_file.c, which is used to generate many /proc files.
Buffer is first set to be the size of a page, then is exponentially grown to fit at least one record, and then is crammed with as many entire records as will fit, but is not grown any more after being able needed to fit the first entry. When the internal buffer cannot fit any more entries, the read is ended.

Python3 open buffering argument looks strange

From the doc
buffering is an optional integer used to set the buffering policy.
Pass 0 to switch buffering off (only allowed in binary mode), 1 to
select line buffering (only usable in text mode), and an integer > 1
to indicate the size in bytes of a fixed-size chunk buffer. When no
buffering argument is given, the default buffering policy works as
follows:
Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying
device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On
many systems, the buffer will typically be 4096 or 8192 bytes long.
“Interactive” text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above
for binary files.
I open a file named test.log with text mode, and set the buffering to 16. So I think the chunk size is 16, and when I write 32 bytes string to the file. It will call write(syscall) twice. But acutally, it only call once.(test in Python 3.7.2 GCC 8.2.1 20181127 on Linux)
import os
try:
os.unlink('test.log')
except Exception:
pass
with open('test.log', 'a', buffering=16) as f:
for _ in range(10):
f.write('a' * 32)
Using strace -e write python3 test.py to trace syscall, and get following
write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 320) = 320
What does the buffering means?
This answer is valid for CPython 3.7 other implementations of Python can differ.
The open() function in text mode returns _io.TextIOWrapper(). The _io.TextIOWrapper() has internal 'buffer' called pending_bytes with size of 8192 bytes (it is hard coded) and it also have handle on _io.BufferedWriter() for text mode w or _io.BufferedRandom() for text mode a. The size of _io.BufferedWriter()/_io.BufferedRandom() is specified by the argument buffering in the open() function.
When you call into _io.TextIOWrapper().write("some text") it will add the text into internal pending_bytes buffer. After some writes you will fill the pending_bytes buffer and then it will be written into buffer inside _io.BufferedWriter(). When you fill up also the buffer inside _io.BufferedWriter() then it will be written into target file.
When you open file in binary mode you will get directly the _io.BufferedWriter()/_io.BufferedRandom() object initialized with buffer size from buffering parametr.
Let's look at some examples. I will start with simpler one using binary mode.
# Case 1
with open('test.log', 'wb', buffering=16) as f:
for _ in range(5):
f.write(b'a'*15)
strace output:
write(3, "aaaaaaaaaaaaaaa", 15) = 15
write(3, "aaaaaaaaaaaaaaa", 15) = 15
write(3, "aaaaaaaaaaaaaaa", 15) = 15
write(3, "aaaaaaaaaaaaaaa", 15) = 15
write(3, "aaaaaaaaaaaaaaa", 15) = 15
In the first iteration it fill buffer with 15 bytes. In the second iteration it discovers that adding another 15 bytes would overflow the buffer so it first flush it (calls system write) and then save those new 15 bytes. In next iteration the same happens again. After last iteration in the buffer is 15 B which are written on close of the file (leaving the with context).
The second case, I will try write into buffer more data than the buffer's size:
# Case 2
with open('test.log', 'wb', buffering=16) as f:
for _ in range(5):
f.write(b'a'*17)
strace output:
write(3, "aaaaaaaaaaaaaaaaa", 17) = 17
write(3, "aaaaaaaaaaaaaaaaa", 17) = 17
write(3, "aaaaaaaaaaaaaaaaa", 17) = 17
write(3, "aaaaaaaaaaaaaaaaa", 17) = 17
write(3, "aaaaaaaaaaaaaaaaa", 17) = 17
What happens here is that in the first iteration it will try write into buffer 17 B but it cannot fit there so it is directly written into the file and buffer stays empty. This applies for every iteration.
Now let's look at the text mode.
# Case 3
with open('test.log', 'w', buffering=16) as f:
for _ in range(5):
f.write('a'*8192)
strace output:
write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 16384) = 16384
write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 16384) = 16384
write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 8192) = 8192
First recall that pending_bytes has size 8192 B. In the first iteration it writes 8192 bytes (from code: 'a'*8192) into pending_bytes buffer. In the second iteration it adds to the pending_buffer another 8192 bytes and discovers it is more than 8192 (size of pending_bytes buffer) and writes it into underlying _io.BufferedWriter(). The buffer in _io.BufferedWriter() has size 16 B (buffering parameter) so it will immediately writes into file (same as case 2). Now the pending_buffer is empty and in the third iteration it's again filled with 8192 B. In the fourth iteration it adds another 8192 B pending_bytes buffer overflows and it again written directly into file as in the second iteration. In the last iteration it adds 8192 B into pending_bytes buffer which is flushed when the files is closed.
Last example contains buffering bigger than 8192 B. Also for better explanation I added 2 more iterations.
# Case 4
with open('test.log', 'w', buffering=30000) as f:
for _ in range(7):
f.write('a'*8192)
strace output:
write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 16384) = 16384
write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 16384) = 16384
write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 24576) = 24576
Iterations:
Add 8192 B into pending_bytes.
Add 8192 B into pending_bytes but it is more than maximal size so it is written into underlying _io.BufferedWritter() and it stays there (pending_bytes is empty now).
Add 8192 B into pending_bytes.
Add 8192 B into pending_bytes but it is more than maximal size so it tries to write into into underlying _io.BufferedWritter(). But it would exceed maximal capacity of the underlying buffer cause 16384 + 16384 > 30000 (first 16384 B are still there from iteration 2) so it first writes the old 16384 B into file and then puts those new 16384 B (from pending_bytes) into buffer. (Now again the pending_bytes buffer is empty)
Same as 3
Same as 4
Currently pending_buffer is empty and _io.BufferedWritter() contains 16384 B. In this iteration it fills pending_buffer with 8192 B. And that's it.
When the program leave with section it close the file. The process of closing follows:
Writes 8192 B from pending_buffer into _io.BufferedWriter() (it is possible cause 8192 + 16384 < 30000)
Writes (8192 + 16384=) 24576 B into file.
Close the file descriptor.
Btw currently I have no idea why is there that pending_buffer when it can use for buffering the underlying buffer from _io.BufferedWritter(). My best guess is it's there because it improve performance with files working in text mode.

determining the optimal buffer size for file read in linux

I am writing a C program which reads from stdin and writes to stdout. But it buffers the data so that a write is performed only after it reads a specific number of bytes(=SIZE)
#include<stdio.h>
#include<stdlib.h>
#define SIZE 100
int main()
{
char buf[SIZE];
int n=0;
//printf("Block size = %d\n", BUFSIZ);
while( ( n = read(0, buf, sizeof(buf)) ) > 0 )
write(1, buf, n);
exit(0);
}
Iam running this program on a Ubuntu 18.04 hosted on Oracle Virtual Box(4GB RAM, 2 cores), and testing the program for different values of buffer size. I have redirected the standard input to come from a file(which contains random numbers created dynamically) and standard output to go to /dev/null. Here is the shell script used to run the test:
#!/bin/bash
# $1 - step size (bytes)
# $2 - start size (bytes)
# $3 - stop size (bytes)
echo "Changing buffer size from $2 to $3 in steps of $1, and measuring time for copying."
buff_size=$2
echo "Test Data" >testData
echo "Step Size:(doubles from previous size) Start Size:$2 Stop Size:$3" >>testData
while [ $buff_size -le $3 ]
do
echo "" >>testData
echo -n "$buff_size," >>testData
gcc -DSIZE=$buff_size copy.c # Compile the program for cat, with new buffer size
dd bs=1000 count=1000000 </dev/urandom >testFile #Create testFile with random data of 1GB
(/usr/bin/time -f "\t%U, \t%S," ./a.out <testFile 1>/dev/null) 2>>testData
buff_size=$(($buff_size * 2))
rm -f a.out
rm -f testFile
done
I am measuring the time taken to execute the program and tabulate it. A test run produces the following data:
Test Data
Step Size:(doubles from previous size) Start Size:1 Stop Size:524288
1, 5.94, 17.81,
2, 5.53, 18.37,
4, 5.35, 18.37,
8, 5.58, 18.78,
16, 5.45, 18.96,
32, 5.96, 19.81,
64, 5.60, 18.64,
128, 5.62, 17.94,
256, 5.37, 18.33,
512, 5.70, 18.45,
1024, 5.43, 17.45,
2048, 5.22, 17.95,
4096, 5.57, 18.14,
8192, 5.88, 17.39,
16384, 5.39, 18.64,
32768, 5.27, 17.78,
65536, 5.22, 17.77,
131072, 5.52, 17.70,
262144, 5.60, 17.40,
524288, 5.96, 17.99,
I dont see any significant variation in user+system time as we use a different block size. But theoretically, as the block size becomes smaller, many number of system calls are generated for the same file size, and it should take more time to execute. I have seen test results in the book 'Advanced Programming in Unix Environment' by Richard Stevens for a similar test, which shows that user+system time reduces significantly if the buffer size used in copy is close to block size.(In my case, block size is 4096 bytes on an ext4 partition)
Why am i not able to reproduce these results? Am i missing some factors in these tests?
You did not disable the line #define SIZE 100 in your source code so the definition via option (-DSIZE=1000) does have influence only above this #define. On my compiler I get a warning for this (<command-line>:0:0: note: this is the location of the previous definition) at compile time.
If you comment out the #define you should be able to fix this error.
Another aspect which comes to mind:
If you create a file on a machine and read it right away after that, it will be in the OS's disk cache (which is large enough to store all of this file), so the actual disk block size won't have much of an influence here.
Stevens's book was written in 1992 when RAM was way more expensive than today, so maybe some information in there is outdated. I also doubt that newer editions of the book have taken things like these out because in general they are still true.

How to use dd to read and write different block sizes

I am trying to transfer data from a file to another one using dd.
Since I invoke dd multiple times in a loop, I have to skip a certain number of blocks from the input file and then start copying.
However, in the last iteration the output block size may be different from the previous ones, so I need to set the ibs and the obs operands with different values and to skip the first N ibs-sized blocks from the input file.
I know that in dd 8.2 this can be done by setting the iflag=skip_bytes operand and specifying the exact number of bytes to skip instead of the number of blocks, but that flag is not available in dd 8.4 that I have to use.
So, I tried to do this by setting the iflag=count_bytes option and count equal to the number of bytes to be copied.
The resulting commands as regards a normal iteration of the loop and the last one are the following:
dd if=./ifile of=./ofile iflag=count_bytes ibs=512 obs=512 skip=0 count=512
...
dd if=./ifile of=./ofile iflag=count_bytes ibs=512 obs=256 skip=2 count=256
but dd hangs while copying the data. If I force to terminate each iteration I get the following output from dd:
// ibs:512 obs:512 skip:0 count:512
^C1+0 records in
0+0 records out
0 bytes (0 B) copied, 9.25573 s, 0.0 kB/s
...
// ibs:512 obs:256 skip:2 count:256
^C0+1 records in
0+0 records out
0 bytes (0 B) copied, 5.20154 s, 0.0 kB/s
Am I missing something?
EDIT: dd is invoked from the following C code:
while (bytes_sent < tot_bytes){
unsigned int packet_size = (tot_bytes - bytes_sent < MAX_BLOCK_SIZE) ?
(tot_bytes - bytes_sent) : MAX_BLOCK_SIZE;
sprintf(cmd, "dd if=./ifile of=./ofile iflag=count_bytes ibs=%u obs=%u skip=%u count=%u",
MAX_BLOCK_SIZE, packet_size, sent, packet_size);
system(cmd);
bytes_sent += packet_size;
sent++;
}
Thanks in advance for the help!

Can malloc_trim() release memory from the middle of the heap?

I am confused about the behaviour of malloc_trim as implemented in the glibc.
man malloc_trim
[...]
malloc_trim - release free memory from the top of the heap
[...]
This function cannot release free memory located at places other than the top of the heap.
When I now look up the source of malloc_trim() (in malloc/malloc.c) I see that it calls mtrim() which is utilizing madvise(x, MADV_DONTNEED) to release memory back to the operating system.
So I wonder if the man-page is wrong or if I misinterpret the source in malloc/malloc.c.
Can malloc_trim() release memory from the middle of the heap?
There are two usages of madvise with MADV_DONTNEED in glibc now: http://code.metager.de/source/search?q=MADV_DONTNEED&path=%2Fgnu%2Fglibc%2Fmalloc%2F&project=gnu
H A D arena.c 643 __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
H A D malloc.c 4535 __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
There was https://sourceware.org/git/?p=glibc.git;a=commit;f=malloc/malloc.c;h=68631c8eb92ff38d9da1ae34f6aa048539b199cc commit by Ulrich Drepper on 16 Dec 2007 (part of glibc 2.9 and newer):
malloc/malloc.c (public_mTRIm): Iterate over all arenas and call
mTRIm for all of them.
(mTRIm): Additionally iterate over all free blocks and use madvise
to free memory for all those blocks which contain at least one
memory page.
mTRIm (now mtrim) implementation was changed. Unused parts of chunks, aligned on page size and having size more than page may be marked as MADV_DONTNEED:
/* See whether the chunk contains at least one unused page. */
char *paligned_mem = (char *) (((uintptr_t) p
+ sizeof (struct malloc_chunk)
+ psm1) & ~psm1);
assert ((char *) chunk2mem (p) + 4 * SIZE_SZ <= paligned_mem);
assert ((char *) p + size > paligned_mem);
/* This is the size we could potentially free. */
size -= paligned_mem - (char *) p;
if (size > psm1)
madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
Man page of malloc_trim is there: https://github.com/mkerrisk/man-pages/blob/master/man3/malloc_trim.3 and it was committed by kerrisk in 2012: https://github.com/mkerrisk/man-pages/commit/a15b0e60b297e29c825b7417582a33e6ca26bf65
As I can grep the glibc's git, there are no man pages in the glibc, and no commit to malloc_trim manpage to document this patch. The best and the only documentation of glibc malloc is its source code: https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c
Additional functions:
malloc_trim(size_t pad);
609 /*
610 malloc_trim(size_t pad);
611
612 If possible, gives memory back to the system (via negative
613 arguments to sbrk) if there is unused memory at the `high' end of
614 the malloc pool. You can call this after freeing large blocks of
615 memory to potentially reduce the system-level memory requirements
616 of a program. However, it cannot guarantee to reduce memory. Under
617 some allocation patterns, some large free blocks of memory will be
618 locked between two used chunks, so they cannot be given back to
619 the system.
620
621 The `pad' argument to malloc_trim represents the amount of free
622 trailing space to leave untrimmed. If this argument is zero,
623 only the minimum amount of memory to maintain internal data
624 structures will be left (one page or less). Non-zero arguments
625 can be supplied to maintain enough trailing space to service
626 future expected allocations without having to re-obtain memory
627 from the system.
628
629 Malloc_trim returns 1 if it actually released any memory, else 0.
630 On systems that do not support "negative sbrks", it will always
631 return 0.
632 */
633 int __malloc_trim(size_t);
634
Freeing from the middle of the chunk is not documented as text in malloc/malloc.c (and malloc_trim description in commend was not updated in 2007) and not documented in man-pages project. Man page from 2012 may be the first man page of the function, written not by authors of glibc. Info page of glibc only mentions M_TRIM_THRESHOLD of 128 KB:
https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html#Malloc-Tunable-Parameters and don't list malloc_trim function https://www.gnu.org/software/libc/manual/html_node/Summary-of-Malloc.html#Summary-of-Malloc (and it also don't document memusage/memusagestat/libmemusage.so).
You may ask Drepper and other glibc developers again as you already did in https://sourceware.org/ml/libc-help/2015-02/msg00022.html "malloc_trim() behaviour", but there is still no reply from them. (Only wrong answers from other users like https://sourceware.org/ml/libc-help/2015-05/msg00007.html https://sourceware.org/ml/libc-help/2015-05/msg00008.html)
Or you may test the malloc_trim with this simple C program (test_malloc_trim.c) and strace/ltrace:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <malloc.h>
int main()
{
int *m1,*m2,*m3,*m4;
printf("%s\n","Test started");
m1=(int*)malloc(20000);
m2=(int*)malloc(40000);
m3=(int*)malloc(80000);
m4=(int*)malloc(10000);
printf("1:%p 2:%p 3:%p 4:%p\n", m1, m2, m3, m4);
free(m2);
malloc_trim(0); // 20000, 2000000
sleep(1);
free(m1);
free(m3);
free(m4);
// malloc_stats(); malloc_info(0, stdout);
return 0;
}
gcc test_malloc_trim.c -o test_malloc_trim, strace ./test_malloc_trim
write(1, "Test started\n", 13Test started
) = 13
brk(0) = 0xcca000
brk(0xcef000) = 0xcef000
write(1, "1:0xcca010 2:0xccee40 3:0xcd8a90"..., 441:0xcca010 2:0xccee40 3:0xcd8a90 4:0xcec320
) = 44
madvise(0xccf000, 36864, MADV_DONTNEED) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffffafbfff0) = 0
brk(0xceb000) = 0xceb000
So, there is madvise with MADV_DONTNEED for 9 pages after malloc_trim(0) call, when there was hole of 40008 bytes in the middle of the heap.
... utilizing madvise(x, MADV_DONTNEED) to release memory back to the
operating system.
madvise(x, MADV_DONTNEED) does not release memory. man madvise:
MADV_DONTNEED
Do not expect access in the near future. (For the time being,
the application is finished with the given range, so the kernel
can free resources associated with it.) Subsequent accesses of
pages in this range will succeed, but will result either in
reloading of the memory contents from the underlying mapped file
(see mmap(2)) or zero-fill-on-demand pages for mappings without
an underlying file.
So, the usage of madvise(x, MADV_DONTNEED) does not contradict man malloc_trim's statement:
This function cannot release free memory located at places other than the top of the heap.

Resources