Does RCHAR include READ_BYTES (proc/<pid>/io)? - linux

I read proc/<pid>/io to measure the IO-activity of SQL-queries, where <pid> is the PID of the database server. I read the values before and after each query to compute the difference and get the number of bytes the request caused to be read and/or written.
As far as I know the field READ_BYTES counts actual disk-IO, while RCHAR includes more, like reads that could be satisfied by the linux page cache (see Understanding the counters in /proc/[pid]/io for clarification).
This leads to the assumption, that RCHAR should come up with a value equal or greater than READ_BYTES, but my results contradict this assumption.
I could imagine some minor block or page overhead for results I get for Infobright ICE (values are MB):
Query RCHAR READ_BYTES
tpch_q01.sql| 34.44180| 34.89453|
tpch_q02.sql| 2.89191| 3.64453|
tpch_q03.sql| 32.58994| 33.19531|
tpch_q04.sql| 17.78325| 18.27344|
But I completely fail to understand the IO-counters for MonetDB (values are MB):
Query RCHAR READ_BYTES
tpch_q01.sql| 0.07501| 220.58203|
tpch_q02.sql| 1.37840| 18.16016|
tpch_q03.sql| 0.08272| 162.38281|
tpch_q04.sql| 0.06604| 83.25391|
Am I wrong with the assumption that RCHAR includes READ_BYTES? Is there a way to trick out the kernels counters, that MonetDB could use? What is going on here?
I might add, that I clear the page cache and restart the database-server before each query.
I'm on Ubuntu 11.10, running kernel 3.0.0-15-generic.

I can only think of two things:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/proc.txt;hb=HEAD#l1305
1:
1446 read_bytes
1447 ----------
1448
1449 I/O counter: bytes read
1450 Attempt to count the number of bytes which this process really did cause to
1451 be fetched from the storage layer.
I read "Caused to be fetched from the storage layer" to include readahead, whatever.
2:
1411 rchar
1412 -----
1413
1414 I/O counter: chars read
1415 The number of bytes which this task has caused to be read from storage. This
1416 is simply the sum of bytes which this process passed to read() and pread().
1417 It includes things like tty IO and it is unaffected by whether or not actual
1418 physical disk IO was required (the read might have been satisfied from
1419 pagecache)
Note that this says nothing about "disk access via memory mapped files". I think this is the more likely reason, and that your MonetDB probably mmaps out its database files and then does everything on them.
I'm not really sure how you could check the used bandwidth on mmap, because of its nature.

You can also read Linux kernel source code file: /include/linux/task_io_accounting.h
struct task_io_accounting {
#ifdef CONFIG_TASK_XACCT
/* bytes read */
u64 rchar;
/* bytes written */
u64 wchar;
/* # of read syscalls */
u64 syscr;
/* # of write syscalls */
u64 syscw;
#endif /* CONFIG_TASK_XACCT */
#ifdef CONFIG_TASK_IO_ACCOUNTING
/*
* The number of bytes which this task has caused to be read from
* storage.
*/
u64 read_bytes;
/*
* The number of bytes which this task has caused, or shall cause to be
* written to disk.
*/
u64 write_bytes;
/*
* A task can cause "negative" IO too. If this task truncates some
* dirty pagecache, some IO which another task has been accounted for
* (in its write_bytes) will not be happening. We _could_ just
* subtract that from the truncating task's write_bytes, but there is
* information loss in doing that.
*/
u64 cancelled_write_bytes;
#endif /* CONFIG_TASK_IO_ACCOUNTING */
};

Related

SHC "Argument list too long" not able to resolve by increasing stack size to unlimited [duplicate]

I have to pass 256Kb of text as an argument to the "aws sqs" command but am running into a limit in the command-line at around 140Kb. This has been discussed in many places that it been solved in the Linux kernel as of 2.6.23 kernel.
But cannot get it to work. I am using 3.14.48-33.39.amzn1.x86_64
Here's a simple example to test:
#!/bin/bash
SIZE=1000
while [ $SIZE -lt 300000 ]
do
echo "$SIZE"
VAR="`head -c $SIZE < /dev/zero | tr '\0' 'a'`"
./foo "$VAR"
let SIZE="( $SIZE * 20 ) / 19"
done
And the foo script is just:
#!/bin/bash
echo -n "$1" | wc -c
And the output for me is:
117037
123196
123196
129680
129680
136505
./testCL: line 11: ./foo: Argument list too long
143689
./testCL: line 11: ./foo: Argument list too long
151251
./testCL: line 11: ./foo: Argument list too long
159211
So, the question how do I modify the testCL script is it can pass 256Kb of data? Btw, I have tried adding ulimit -s 65536 to the script and it didn't help.
And if this is plain impossible I can deal with that but can you shed light on this quote from my link above
"While Linux is not Plan 9, in 2.6.23 Linux is adding variable
argument length. Theoretically you shouldn't hit frequently "argument
list too long" errors again, but this patch also limits the maximum
argument length to 25% of the maximum stack limit (ulimit -s)."
edit:
I was finally able to pass <= 256 KB as a single command line argument (see edit (4) in the bottom). However, please read carefully how I did it and decide for yourself if this is a way you want to go. At least you should be able to understand why you are 'stuck' otherwise from what I found out.
With the coupling of ARG_MAX to ulim -s / 4 came the introduction of MAX_ARG_STRLEN as max. length of an argument:
/*
* linux/fs/exec.c
*
* Copyright (C) 1991, 1992 Linus Torvalds
*/
...
#ifdef CONFIG_MMU
/*
* The nascent bprm->mm is not visible until exec_mmap() but it can
* use a lot of memory, account these pages in current->mm temporary
* for oom_badness()->get_mm_rss(). Once exec succeeds or fails, we
* change the counter back via acct_arg_size(0).
*/
...
static bool valid_arg_len(struct linux_binprm *bprm, long len)
{
return len <= MAX_ARG_STRLEN;
}
...
#else
...
static bool valid_arg_len(struct linux_binprm *bprm, long len)
{
return len <= bprm->p;
}
#endif /* CONFIG_MMU */
...
static int copy_strings(int argc, struct user_arg_ptr argv,
struct linux_binprm *bprm)
{
...
str = get_user_arg_ptr(argv, argc);
...
len = strnlen_user(str, MAX_ARG_STRLEN);
if (!len)
goto out;
ret = -E2BIG;
if (!valid_arg_len(bprm, len))
goto out;
...
}
...
MAX_ARG_STRLEN is defined as 32 times the page size in linux/include/uapi/linux/binfmts.h:
...
/*
* These are the maximum length and maximum number of strings passed to the
* execve() system call. MAX_ARG_STRLEN is essentially random but serves to
* prevent the kernel from being unduly impacted by misaddressed pointers.
* MAX_ARG_STRINGS is chosen to fit in a signed 32-bit integer.
*/
#define MAX_ARG_STRLEN (PAGE_SIZE * 32)
#define MAX_ARG_STRINGS 0x7FFFFFFF
...
The default page size is 4 KB so you cannot pass arguments longer than 128 KB.
I can't try it now but maybe switching to huge page mode (page size 4 MB) if possible on your system solves this problem.
For more detailed information and references see this answer to a similar question on Unix & Linux SE.
edits:
(1)
According to this answer one can change the page size of x86_64 Linux to 1 MB by enabling CONFIG_TRANSPARENT_HUGEPAGE and setting CONFIG_TRANSPARENT_HUGEPAGE_MADVISE to n in the kernel config.
(2)
After recompiling my kernel with the above configuration changes getconf PAGESIZE still returns 4096.
According to this answer CONFIG_HUGETLB_PAGE is also needed which I could pull in via CONFIG_HUGETLBFS. I am recompiling now and will test again.
(3)
I recompiled my kernel with CONFIG_HUGETLBFS enabled and now /proc/meminfo contains the corresponding HugePages_* entries mentioned in the corresponding section of the kernel documentation.
However, the page size according to getconf PAGESIZE is still unchanged. So while I should be able now to request huge pages via mmap calls, the kernel's default page size determining MAX_ARG_STRLEN is still fixed at 4 KB.
(4)
I modified linux/include/uapi/linux/binfmts.h to #define MAX_ARG_STRLEN (PAGE_SIZE * 64), recompiled my kernel and now your code produces:
...
117037
123196
123196
129680
129680
136505
143689
151251
159211
...
227982
227982
239981
239981
252611
252611
265906
./testCL: line 11: ./foo: Argument list too long
279901
./testCL: line 11: ./foo: Argument list too long
294632
./testCL: line 11: ./foo: Argument list too long
So now the limit moved from 128 KB to 256 KB as expected.
I don't know about potential side effects though.
As far as I can tell, my system seems to run just fine.
Just put the arguments into some file, and modify your program to accept "arguments" from a file. A common convention (notably used by GCC and several other GNU programs) is that an argument like #/tmp/arglist.txt asks your program to read arguments from file /tmp/arglist.txt, often one line per argument
You might perhaps pass some data thru long environment variables, but they also are limited (and what is limited by the kernel in fact is the size of main's initial stack, containing both program arguments and the environment)
Alternatively, modify your program to be configurable thru some configuration file which would contain the information you want to pass thru arguments.
(If you can recompile your kernel, you might try to increase -to a bigger power of two much smaller than your available RAM, e.g. to 2097152- the ARG_MAX which is #define-d in linux-4.*/include/uapi/linux/limits.h before recompiling your kernel)
In other ways, there is no way to circumvent that limitation (see execve(2) man page and its Limits on size of arguments and environment section) - once you have raised your stack limit (using setrlimit(2) with RLIMIT_STACK, generally with ulimit builtin in the parent shell). You need to deal with it otherwise.

How to get around the Linux "Too Many Arguments" limit

I have to pass 256Kb of text as an argument to the "aws sqs" command but am running into a limit in the command-line at around 140Kb. This has been discussed in many places that it been solved in the Linux kernel as of 2.6.23 kernel.
But cannot get it to work. I am using 3.14.48-33.39.amzn1.x86_64
Here's a simple example to test:
#!/bin/bash
SIZE=1000
while [ $SIZE -lt 300000 ]
do
echo "$SIZE"
VAR="`head -c $SIZE < /dev/zero | tr '\0' 'a'`"
./foo "$VAR"
let SIZE="( $SIZE * 20 ) / 19"
done
And the foo script is just:
#!/bin/bash
echo -n "$1" | wc -c
And the output for me is:
117037
123196
123196
129680
129680
136505
./testCL: line 11: ./foo: Argument list too long
143689
./testCL: line 11: ./foo: Argument list too long
151251
./testCL: line 11: ./foo: Argument list too long
159211
So, the question how do I modify the testCL script is it can pass 256Kb of data? Btw, I have tried adding ulimit -s 65536 to the script and it didn't help.
And if this is plain impossible I can deal with that but can you shed light on this quote from my link above
"While Linux is not Plan 9, in 2.6.23 Linux is adding variable
argument length. Theoretically you shouldn't hit frequently "argument
list too long" errors again, but this patch also limits the maximum
argument length to 25% of the maximum stack limit (ulimit -s)."
edit:
I was finally able to pass <= 256 KB as a single command line argument (see edit (4) in the bottom). However, please read carefully how I did it and decide for yourself if this is a way you want to go. At least you should be able to understand why you are 'stuck' otherwise from what I found out.
With the coupling of ARG_MAX to ulim -s / 4 came the introduction of MAX_ARG_STRLEN as max. length of an argument:
/*
* linux/fs/exec.c
*
* Copyright (C) 1991, 1992 Linus Torvalds
*/
...
#ifdef CONFIG_MMU
/*
* The nascent bprm->mm is not visible until exec_mmap() but it can
* use a lot of memory, account these pages in current->mm temporary
* for oom_badness()->get_mm_rss(). Once exec succeeds or fails, we
* change the counter back via acct_arg_size(0).
*/
...
static bool valid_arg_len(struct linux_binprm *bprm, long len)
{
return len <= MAX_ARG_STRLEN;
}
...
#else
...
static bool valid_arg_len(struct linux_binprm *bprm, long len)
{
return len <= bprm->p;
}
#endif /* CONFIG_MMU */
...
static int copy_strings(int argc, struct user_arg_ptr argv,
struct linux_binprm *bprm)
{
...
str = get_user_arg_ptr(argv, argc);
...
len = strnlen_user(str, MAX_ARG_STRLEN);
if (!len)
goto out;
ret = -E2BIG;
if (!valid_arg_len(bprm, len))
goto out;
...
}
...
MAX_ARG_STRLEN is defined as 32 times the page size in linux/include/uapi/linux/binfmts.h:
...
/*
* These are the maximum length and maximum number of strings passed to the
* execve() system call. MAX_ARG_STRLEN is essentially random but serves to
* prevent the kernel from being unduly impacted by misaddressed pointers.
* MAX_ARG_STRINGS is chosen to fit in a signed 32-bit integer.
*/
#define MAX_ARG_STRLEN (PAGE_SIZE * 32)
#define MAX_ARG_STRINGS 0x7FFFFFFF
...
The default page size is 4 KB so you cannot pass arguments longer than 128 KB.
I can't try it now but maybe switching to huge page mode (page size 4 MB) if possible on your system solves this problem.
For more detailed information and references see this answer to a similar question on Unix & Linux SE.
edits:
(1)
According to this answer one can change the page size of x86_64 Linux to 1 MB by enabling CONFIG_TRANSPARENT_HUGEPAGE and setting CONFIG_TRANSPARENT_HUGEPAGE_MADVISE to n in the kernel config.
(2)
After recompiling my kernel with the above configuration changes getconf PAGESIZE still returns 4096.
According to this answer CONFIG_HUGETLB_PAGE is also needed which I could pull in via CONFIG_HUGETLBFS. I am recompiling now and will test again.
(3)
I recompiled my kernel with CONFIG_HUGETLBFS enabled and now /proc/meminfo contains the corresponding HugePages_* entries mentioned in the corresponding section of the kernel documentation.
However, the page size according to getconf PAGESIZE is still unchanged. So while I should be able now to request huge pages via mmap calls, the kernel's default page size determining MAX_ARG_STRLEN is still fixed at 4 KB.
(4)
I modified linux/include/uapi/linux/binfmts.h to #define MAX_ARG_STRLEN (PAGE_SIZE * 64), recompiled my kernel and now your code produces:
...
117037
123196
123196
129680
129680
136505
143689
151251
159211
...
227982
227982
239981
239981
252611
252611
265906
./testCL: line 11: ./foo: Argument list too long
279901
./testCL: line 11: ./foo: Argument list too long
294632
./testCL: line 11: ./foo: Argument list too long
So now the limit moved from 128 KB to 256 KB as expected.
I don't know about potential side effects though.
As far as I can tell, my system seems to run just fine.
Just put the arguments into some file, and modify your program to accept "arguments" from a file. A common convention (notably used by GCC and several other GNU programs) is that an argument like #/tmp/arglist.txt asks your program to read arguments from file /tmp/arglist.txt, often one line per argument
You might perhaps pass some data thru long environment variables, but they also are limited (and what is limited by the kernel in fact is the size of main's initial stack, containing both program arguments and the environment)
Alternatively, modify your program to be configurable thru some configuration file which would contain the information you want to pass thru arguments.
(If you can recompile your kernel, you might try to increase -to a bigger power of two much smaller than your available RAM, e.g. to 2097152- the ARG_MAX which is #define-d in linux-4.*/include/uapi/linux/limits.h before recompiling your kernel)
In other ways, there is no way to circumvent that limitation (see execve(2) man page and its Limits on size of arguments and environment section) - once you have raised your stack limit (using setrlimit(2) with RLIMIT_STACK, generally with ulimit builtin in the parent shell). You need to deal with it otherwise.

Determine the number of "logical" bytes read/written in a Linux system

I would like to determine the number of bytes logically read/written by all processes via syscalls such as read() and write(). This is different than the number of bytes actually fetched from the storage layer (displayed by tools like iotop) since it includes (for example) reads that hit the pagecache, and is also differs in when writes are recognized: the logical write IO happens immediately when the write call is issued, while the actual physical IO may occur some time later depending on various factors (Linux usually buffers writes and does the physical IO some time later).
I know how to do it on a per-process basis (see this question for example), but not how to the get the system-wide count.
If you want to use /proc filesystem for the total counts (and not for per second counts), it is quite easy.
This works also on quite old kernels (tested on Debian Squeeze 2.6.32 kernel).
# cat /proc/1979/io
rchar: 111195372883082
wchar: 10424431162257
syscr: 130902776102
syscw: 6236420365
read_bytes: 2839822376960
write_bytes: 803408183296
cancelled_write_bytes: 374812672
For system-wide, just sum the numbers from all processes, which however will be good enough only in short-term, because as processes die, their statistics are removed from memory. You would need process accounting enabled to save them.
Meaning of these files is documented in the kernel sources file Documentation/filesystems/proc.txt:
rchar - I/O counter: chars read
The number of bytes which this task has caused
to be read from storage. This is simply the sum of bytes which this
process passed to read() and pread(). It includes things like tty IO
and it is unaffected by whether or not actual physical disk IO was
required (the read might have been satisfied from pagecache)
wchar - I/O counter: chars written
The number of bytes which this task has
caused, or shall cause to be written to disk. Similar caveats apply
here as with rchar.
syscr - I/O counter: read syscalls
Attempt to count the number of read I/O
operations, i.e. syscalls like read() and pread().
syscw - I/O counter: write syscalls
Attempt to count the number of write I/O
operations, i.e. syscalls like write() and pwrite().
read_bytes - I/O counter: bytes read
Attempt to count the number of bytes which
this process really did cause to be fetched from the storage layer.
Done at the submit_bio() level, so it is accurate for block-backed
filesystems.
write_bytes - I/O counter: bytes written
Attempt to count the number of bytes which
this process caused to be sent to the storage layer. This is done at
page-dirtying time.
cancelled_write_bytes
The big inaccuracy here is truncate. If a process writes 1MB to a file
and then deletes the file, it will in fact perform no writeout. But it
will have been accounted as having caused 1MB of write. In other
words: The number of bytes which this process caused to not happen, by
truncating pagecache. A task can cause "negative" IO too.
Here is a SystemTap script that tracks the logical IO. It is based on the script at https://sourceware.org/systemtap/SystemTap_Beginners_Guide/traceiosect.html
#! /usr/bin/env stap
# traceio.stp
# Copyright (C) 2007 Red Hat, Inc., Eugene Teo <eteo#redhat.com>
# Copyright (C) 2009 Kai Meyer <kai#unixlords.com>
# Fixed a bug that allows this to run longer
# And added the humanreadable function
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
#
global reads, writes
probe vfs.read.return {
if ($return > 0) {
reads += $return
}
}
probe vfs.write.return {
if ($return > 0) {
writes += $return
}
}
function humanreadable(bytes) {
if (bytes > 1024*1024*1024) {
return sprintf("%d GiB", bytes/1024/1024/1024)
} else if (bytes > 1024*1024) {
return sprintf("%d MiB", bytes/1024/1024)
} else if (bytes > 1024) {
return sprintf("%d KiB", bytes/1024)
} else {
return sprintf("%d B", bytes)
}
}
probe timer.s(1) {
printf("reads: %12s writes: %12s\n", humanreadable(reads), humanreadable(writes))
# Note we don't zero out reads and writes,
# so the values are cumulative since the script started.
}

what does "malloc_trim(0)" really mean?

The manual page told me so much and through it I know lots of the background knowledge of memory management of "glibc".
But I still get confused. does "malloc_trim(0)"(note zero as the parameter) mean (1.)all the memory in the "heap" section will be returned to OS ? Or (2.)just all "unused" memory of the top most region of the heap will be returned to OS ?
If the answer is (1.), what if the still used memory in the heap? if the heap has used momery at places somewhere, will them be eliminated, or the function wouldn't execute successfully?
While if the answer is (2.), what about those "holes" at places rather than top in the heap? they're unused memory anymore, but the top most region of the heap is still used, will this calling work efficiently?
Thanks.
Man page of malloc_trim was committed here: https://github.com/mkerrisk/man-pages/blob/master/man3/malloc_trim.3 and as I understand, it was written by man-pages project maintainer, kerrisk in 2012 from scratch: https://github.com/mkerrisk/man-pages/commit/a15b0e60b297e29c825b7417582a33e6ca26bf65
As I can grep the glibc's git, there are no man pages in the glibc, and no commit to malloc_trim manpage to document this patch. The best and the only documentation of glibc malloc is its source code: https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c
There are malloc_trim comments from malloc/malloc.c:
Additional functions:
malloc_trim(size_t pad);
609 /*
610 malloc_trim(size_t pad);
611
612 If possible, gives memory back to the system (via negative
613 arguments to sbrk) if there is unused memory at the `high' end of
614 the malloc pool. You can call this after freeing large blocks of
615 memory to potentially reduce the system-level memory requirements
616 of a program. However, it cannot guarantee to reduce memory. Under
617 some allocation patterns, some large free blocks of memory will be
618 locked between two used chunks, so they cannot be given back to
619 the system.
620
621 The `pad' argument to malloc_trim represents the amount of free
622 trailing space to leave untrimmed. If this argument is zero,
623 only the minimum amount of memory to maintain internal data
624 structures will be left (one page or less). Non-zero arguments
625 can be supplied to maintain enough trailing space to service
626 future expected allocations without having to re-obtain memory
627 from the system.
628
629 Malloc_trim returns 1 if it actually released any memory, else 0.
630 On systems that do not support "negative sbrks", it will always
631 return 0.
632 */
633 int __malloc_trim(size_t);
634
Freeing from the middle of the chunk is not documented as text in malloc/malloc.c and not documented in man-pages project. Man page from 2012 may be the first man page of the function, written not by authors of glibc. Info page of glibc only mentions M_TRIM_THRESHOLD of 128 KB:
https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html#Malloc-Tunable-Parameters and don't list malloc_trim function https://www.gnu.org/software/libc/manual/html_node/Summary-of-Malloc.html#Summary-of-Malloc (and it also don't document memusage/memusagestat/libmemusage.so).
In december 2007 there was commit https://sourceware.org/git/?p=glibc.git;a=commit;f=malloc/malloc.c;h=68631c8eb92ff38d9da1ae34f6aa048539b199cc by Ulrich Drepper (it is part of glibc 2.9 and newer) which changed mtrim implementation (but it didn't change any documentation or man page as there are no man pages in glibc):
malloc/malloc.c (public_mTRIm): Iterate over all arenas and call
mTRIm for all of them.
(mTRIm): Additionally iterate over all free blocks and use madvise
to free memory for all those blocks which contain at least one
memory page.
Unused parts of chunks (anywhere, including chunks in the middle), aligned on page size and having size more than page may be marked as MADV_DONTNEED https://sourceware.org/git/?p=glibc.git;a=blobdiff;f=malloc/malloc.c;h=c54c203cbf1f024e72493546221305b4fd5729b7;hp=1e716089a2b976d120c304ad75dd95c63737ad75;hb=68631c8eb92ff38d9da1ae34f6aa048539b199cc;hpb=52386be756e113f20502f181d780aecc38cbb66a
INTERNAL_SIZE_T size = chunksize (p);
if (size > psm1 + sizeof (struct malloc_chunk))
{
/* See whether the chunk contains at least one unused page. */
char *paligned_mem = (char *) (((uintptr_t) p
+ sizeof (struct malloc_chunk)
+ psm1) & ~psm1);
assert ((char *) chunk2mem (p) + 4 * SIZE_SZ <= paligned_mem);
assert ((char *) p + size > paligned_mem);
/* This is the size we could potentially free. */
size -= paligned_mem - (char *) p;
if (size > psm1)
madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
}
This is one of total two usages of madvise with MADV_DONTNEED in glibc now, one for top part of heaps (shrink_heap) and other is marking of any chunk (mtrim): http://code.metager.de/source/search?q=MADV_DONTNEED&path=%2Fgnu%2Fglibc%2Fmalloc%2F&project=gnu
H A D arena.c 643 __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
H A D malloc.c 4535 __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
We can test the malloc_trim with this simple C program (test_malloc_trim.c) and strace/ltrace:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <malloc.h>
int main()
{
int *m1,*m2,*m3,*m4;
printf("%s\n","Test started");
m1=(int*)malloc(20000);
m2=(int*)malloc(40000);
m3=(int*)malloc(80000);
m4=(int*)malloc(10000);
// check that all arrays are allocated on the heap and not with mmap
printf("1:%p 2:%p 3:%p 4:%p\n", m1, m2, m3, m4);
// free 40000 bytes in the middle
free(m2);
// call trim (same result with 2000 or 2000000 argument)
malloc_trim(0);
// call some syscall to find this point in the strace output
sleep(1);
free(m1);
free(m3);
free(m4);
// malloc_stats(); malloc_info(0, stdout);
return 0;
}
gcc test_malloc_trim.c -o test_malloc_trim, strace ./test_malloc_trim
write(1, "Test started\n", 13Test started
) = 13
brk(0) = 0xcca000
brk(0xcef000) = 0xcef000
write(1, "1:0xcca010 2:0xccee40 3:0xcd8a90"..., 441:0xcca010 2:0xccee40 3:0xcd8a90 4:0xcec320
) = 44
madvise(0xccf000, 36864, MADV_DONTNEED) = 0
...
nanosleep({1, 0}, 0x7ffffafbfff0) = 0
brk(0xceb000) = 0xceb000
So, there was madvise with MADV_DONTNEED for 9 pages after malloc_trim(0) call, when there was hole of 40008 bytes in the middle of the heap.
The man page for malloc_trim says it releases free memory, so if there is allocated memory in the heap, it won't release the whole heap. The parameter is there if you know you're still going to need a certain amount of memory, so freeing more than that would cause glibc to have to do unnecessary work later.
As for holes, this is a standard problem with memory management and returning memory to the OS. The primary low-level heap management available to the program is brk and sbrk, and all they can do is extend or shrink the heap area by changing the top. So there's no way for them to return holes to the operating system; once the program has called sbrk to allocate more heap, that space can only be returned if the top of that space is free and can be handed back.
Note that there are other, more complex ways to allocate memory (with anonymous mmap, for example), which may have different constraints than sbrk-based allocation.

What do the counters in /proc/[pid]/io mean?

I'm creating a plugin for Munin to monitor stats of named processes. One of the sources of information would be /proc/[pid]/io. But I have a hard time finding out what the difference is between rchar/wchar and read_bytes/written_bytes.
They are not the same, as they provide different values. What do they represent?
While the proc manpage is woefully behind (and so are most manpages/documentation on anything not relating to cookie-cutter user-space development), this stuff is fortunately documented completely in the Linux kernel source under Documentation/filesystems/proc.rst. Here are the relevant bits:
rchar
-----
I/O counter: chars read
The number of bytes which this task has caused to be read from storage. This
is simply the sum of bytes which this process passed to read() and pread().
It includes things like tty IO and it is unaffected by whether or not actual
physical disk IO was required (the read might have been satisfied from
pagecache)
wchar
-----
I/O counter: chars written
The number of bytes which this task has caused, or shall cause to be written
to disk. Similar caveats apply here as with rchar.
read_bytes
----------
I/O counter: bytes read
Attempt to count the number of bytes which this process really did cause to
be fetched from the storage layer. Done at the submit_bio() level, so it is
accurate for block-backed filesystems. <please add status regarding NFS and
CIFS at a later time>
write_bytes
-----------
I/O counter: bytes written
Attempt to count the number of bytes which this process caused to be sent to
the storage layer. This is done at page-dirtying time.

Resources