What is the correct way to get a consistent snapshot of /proc/pid/smaps? - linux

I am trying to parse the PSS value from /proc/<pid>/smaps of a process in my C++ binary.
According to this SO answer, naively reading the /proc/<pid>/smaps file for example with ifstream::getLine() will result in an inconsistent dataset. The solution suggested is to use the read() system call to read the whole data in one go, something like:
#include <unistd.h>
#include <fcntl.h>
...
char rawData[102400];
int file = open("/proc/12345/smaps", O_RDONLY, 0);
auto bytesRead = read(file, rawData, 102400); // this returns 3722 instead of expected ~64k
close(file);
std::cout << bytesRead << std::endl;
// do some parsing here after null-terminating the buffer
...
My problem now is that despite me using a 100kB buffer, only 3722 bytes are returned. Looking at what cat does when parsing the file using strace, I see that it is using multiple calls to read() (also getting around 3k bytes on every read) until read() returns 0 - as described in the documentation of read():
...
read(3, "7fa8db3d7000-7fa8db3d8000 r--p 0"..., 131072) = 3588
write(1, "7fa8db3d7000-7fa8db3d8000 r--p 0"..., 3588) = 3588
read(3, "7fa8db3df000-7fa8db3e0000 r--p 0"..., 131072) = 3632
write(1, "7fa8db3df000-7fa8db3e0000 r--p 0"..., 3632) = 3632
read(3, "7fa8db3e8000-7fa8db3ed000 r--s 0"..., 131072) = 3603
write(1, "7fa8db3e8000-7fa8db3ed000 r--s 0"..., 3603) = 3603
read(3, "7fa8db41d000-7fa8db425000 r--p 0"..., 131072) = 3445
write(1, "7fa8db41d000-7fa8db425000 r--p 0"..., 3445) = 3445
read(3, "7fff05467000-7fff05496000 rw-p 0"..., 131072) = 2725
write(1, "7fff05467000-7fff05496000 rw-p 0"..., 2725) = 2725
read(3, "", 131072) = 0
munmap(0x7f8d29ad4000, 139264) = 0
close(3) = 0
close(1) = 0
close(2) = 0
exit_group(0) = ?
+++ exited with 0 +++
But isn't this supposed to produce inconsistent data according to the SO answer linked above?
I have also found some information about proc here, that seem to support the previous SO answer:
To see a precise
snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
Then later in the text it says:
Note: reading /proc/PID/maps or /proc/PID/smaps is inherently racy (consistent
output can be achieved only in the single read call).
This typically manifests when doing partial reads of these files while the
memory map is being modified.
Despite the races, we do provide the following
guarantees:
1) The mapped addresses never go backwards, which implies no two regions will ever overlap.
2) If there is something at a given vaddr during the entirety of the
life of the smaps/maps walk, there will be some output for it.
So it seems to me, I can only trust the data I'm getting if I get it in a single read() call.
Which only returns a small chunk of data despite the buffer being big enough.
Which in turn means there is actually no way to get a consistent snapshot of /proc/<pid>/smaps and the data returned by cat/using multiple read() calls may be garbage depending on the sun to moon light ratio?
Or does 2) actually mean I'm too hung up on the previous SO answer listed above?

You are being limited with the internal kernel buffer size in fs/seq_file.c, which is used to generate many /proc files.
Buffer is first set to be the size of a page, then is exponentially grown to fit at least one record, and then is crammed with as many entire records as will fit, but is not grown any more after being able needed to fit the first entry. When the internal buffer cannot fit any more entries, the read is ended.

Related

Multiple global offset tables in linux process

I'm examining memory layout of a running process and made an interesting observation. There seems to be multiple GOTs (global offset table). Here is what I see in the debugger when I study a malloc function:
(gdb) p (void *) 0x7ff5806ae020
$5 = (void *) 0x7ff5806ae020 <malloc#got.plt>
(gdb) p (void *) 0x7ff5806471d0
$6 = (void *) 0x7ff5806471d0 <malloc#got.plt>
(gdb) p (void *) 0x5634ef446030
$7 = (void *) 0x5634ef446030 <malloc#got.plt>
I examine 3 different addresses of a malloc trampoline. When I look at the memory maps of the process, these addresses correspond to following entries:
7ff580647000-7ff580648000 rw-p 0001c000 fd:01 547076 /lib/x86_64-linux-gnu/libpthread-2.31.so
5634ef446000-5634ef447000 rw-p 00003000 fd:02 12248955 /home/user/binary
7ff5806ae000-7ff5806af000 rw-p 0002a000 fd:01 523810 /lib/x86_64-linux-gnu/ld-2.31.so
I see the different entries correspond to different "linkable objects": the binary and two dynamic libraries.
Further, two out of three trampolines point to the actual function. And both the pointers are the same. The third trampoline points to the stub.
(gdb) p *(void **) 0x5634ef446030
$8 = (void *) 0x7ff5804ef1b0 <__GI___libc_malloc>
(gdb) p *(void **) 0x7ff5806471d0
$9 = (void *) 0x7ff580631396 <malloc#plt+6>
(gdb) p *(void **) 0x7ff5806ae020
$10 = (void *) 0x7ff5804ef1b0 <__GI___libc_malloc>
Is there really a need for three trampolines? If yes, then why?
I realised that such system is the only sensible way to implement trampolines.
In assembly, each call instruction to a dynamically linked function basically refers to an index of the function in GOT. The index is encoded in the instruction directly. Therefore, the index must be known the latest during static linkage. Otherwise, the program code must be updated by the dynamic linker each time the program starts. Clearly, very cumbersome task.
Moreover, each library is compiled separately, therefore must not depend on other libraries, including their exact GOT layout. If there was a single GOT, then all libraries that loaded together must somehow agree on the meanings of each entry in GOT. Having a shared data structure (GOT), filled by all libraries together, would almost certainly create such a dependency.
For example, readelf says that .so-files also have got table:
$ readelf -S /lib/ld-linux.so.2
[18] .got PROGBITS 00029ff4 028ff4 000008 04 WA 0 0 4
[19] .got.plt PROGBITS 0002a000 029000 000028 04 WA 0 0 4
$ readelf -S /usr/lib/libpurple.so.0.13.0
[21] .got PROGBITS 0000000000137318 00136318
0000000000003cd8 0000000000000008 WA 0 0 8
Although, libpurple does not have .got.plt, which I don't fully understand.
My confusion was coming from a fact that the table is called "global". The word "global" actually means that the table is global at the level of a linkable object, in contrast to a compilation module (.o files).
Second, I had an illusion that GOT is referred to an executable application, instead of any dynamically linkable object.

How are multiple copies of shared library text section avoided in physical memory?

When Linux loads shared libraries, my understanding is that, the text section is loaded only once into physical memory and is then mapped across page tables of different processes that reference it.
But where/who ensures/checks that the same shared library text section has not been loaded into a physical memory multiple times?
Is the duplication avoided by the loader or by the mmap() system call or is there some other way and how?
Edit1:
I must've shown what was done so far (research). Here it is...
Tried to trace a simple sleep command.
$ strace sleep 100 &
[1] 22824
$ execve("/bin/sleep", ["sleep", "100"], [/* 26 vars */]) = 0
brk(0) = 0x89bd000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=92360, ...}) = 0
mmap2(NULL, 92360, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f56000
close(3) = 0
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\0`G\0004\0\0\0"..., 512) = 512
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f55000
fstat64(3, {st_mode=S_IFREG|0755, st_size=1706232, ...}) = 0
mmap2(0x460000, 1426884, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x460000
mmap2(0x5b7000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x156) = 0x5b7000
mmap2(0x5ba000, 9668, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x5ba000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f54000
...
munmap(0xb7f56000, 92360) = 0
...
Then checked the /proc/pid/maps file for this process;
$ cat /proc/22824/maps
00441000-0045c000 r-xp 00000000 fd:00 2622360 /lib/ld-2.5.so
...
00460000-005b7000 r-xp 00000000 fd:00 2622361 /lib/libc-2.5.so
...
00e3e000-00e3f000 r-xp 00e3e000 00:00 0 [vdso]
08048000-0807c000 r-xp 00000000 fd:00 5681559 /usr/bin/strace
...
Here it was seen that the addr argument for mmap2() of libc.so.6 with PROT_READ|PROT_EXEC was at a specific address. This lead me to believe that the shared library mapping in physical memory was somehow managed by loader.
Shared libraries are loaded in by the mmap() syscall, and the Linux kernel is smart. It has an internal data structure, which maps the file descriptors (containing the mount instance and the inode number) to the mapped pages in it.
The dynamic linker (its code is somewhere /lib/ld-linux.so or similar) only uses this mmap() call to map the libraries (and then relocates their symbol tables), this page-level deduplication is done entirely by the kernel.
The mappings happen with PROT_READ|PROT_EXEC|PROT_SHARED flags, what you can easily check by stracing any tool (like strace /bin/echo).

psutil vs dd: monitoring disk I/O

I'm writing y.a.t. (yet-another-tool :)) for monitoring disk usage on Linux.
I'm using python 3.3.2 and psutil 3.3.0.
The process I'm monitoring does something really basic: I use the dd tool and I vary the block size (128, 512, 1024, 4096)
#!/bin/bash
dd if=./bigfile.txt of=./copy.img bs=4096
bigfile.txt:
$ stat bigfile.txt
File: ‘bigfile.txt’
Size: 87851423 Blocks: 171600 IO Block: 4096 regular file
And the snippet of the monitor is as follows:
def poll(interval, proc):
d_before = proc.io_counters()
time.sleep(interval)
tst = time.time()
d_after = proc.io_counters()
usage = OrderedDict.fromkeys(d_after.__dict__.keys())
for k, v in usage.items():
usage[k] = d_after.__dict__[k] - d_before.__dict__[k]
return tst, usage
At each run, I clear the cache (as suggested many times on stackoverflow):
rm copy.img && sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
My question is: why aren't the numbers matching?
bs=128:
dd:
686339+1 records in
686339+1 records out
87851423 bytes (88 MB) copied, 1.21664 s, 72.2 MB/s
monitor.py:
1450778750.104943 OrderedDict([('read_count', 686352), ('write_count', 686343), ('read_bytes', 87920640), ('write_bytes', 87855104)])
bs=4096
dd:
21448+1 records in
21448+1 records out
87851423 bytes (88 MB) copied, 0.223911 s, 392 MB/s
monitor.py:
1450779294.5541275 OrderedDict([('read_count', 21468), ('write_count', 21452), ('read_bytes', 88252416), ('write_bytes', 87855104)])
The difference is still there with all the values of bs.
Is it a matter of certains read/write not being counted? Does psutil performs some extra work? For example, with bs=4096, why in psutil 400993 more bytes (for read) and 3681 (for write) are reported?
Am I missing something big?
Thanks a lot.
EDIT: as an update, it doesn't matter the granularity of timings in the measurement, i.e., the time.sleep(interval) call. I tried with different values, and summing up the total number of reads and writes reported by psutil. The difference remains.
EDIT2: typo in snippet code
write_bytes
The read_bytes and write_bytes correspond to the same fields from /proc/<PID>/io. Quoting the documentation (emphasis mine):
read_bytes
----------
I/O counter: bytes read
Attempt to count the number of bytes which this process really did cause to
be fetched from the storage layer. Done at the submit_bio() level, so it is
accurate for block-backed filesystems.
write_bytes
-----------
I/O counter: bytes written
Attempt to count the number of bytes which this process caused to be sent to
the storage layer. This is done at page-dirtying time.
As you know, most (all?) filesystems are block-based. This implies that if you have a program that, say, writes just 5 bytes to a file, and if your block size if 4 KiB, then 4 KiB will be written.
If you don't trust dd, let's try with a simple Python script:
with open('something', 'wb') as f:
f.write(b'12345')
input('press Enter to exit')
This script should write only 5 bytes, but if we inspect /proc/<PID>/io, we can see that 4 KiB were written:
$ cat /proc/3455/io
rchar: 215317
wchar: 24
syscr: 66
syscw: 2
read_bytes: 0
write_bytes: 4096
cancelled_write_bytes: 0
This is the same thing that is happening with dd in your case.
You have asked dd to write 87851423 bytes. How many 4 KiB blocks are 87851423 bytes?
87851423 - (87851423 mod 4096) + 4096 = 87855104
Not by chance 87855104 is the number reported by psutil.
read_bytes
How about read_bytes? In theory we should have read_bytes equal to write_bytes, but actually read_bytes shows 16 more blocks in the first run, and 97 more blocks in the second run.
Well, first of all, let's see what files dd is actually reading:
$ strace -e trace=open,read -- dd if=/dev/zero of=zero bs=1M count=2
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\v\2\0\0\0\0\0"..., 832) = 832
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
open("/dev/zero", O_RDONLY) = 3
open("zero", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 0
read(0, "# Locale name alias data base.\n#"..., 4096) = 2570
read(0, "", 4096) = 0
open("/usr/share/locale/en_US/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en_US/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = 0
+++ exited with 0 +++
As you can see, dd is opening and reading the linker, the GNU C library, and locale files. It is reading more bytes than you can see above, because it's also using mmap, not just read.
The point is: dd reads many more files than the source file, therefore it's acceptable the read_bytes is much higher than write_bytes. But why is it inconsistent?
Those files that are read by dd are also used by many other programs. Even if you drop_caches just before executing dd, there are chances that some other process may reload one of these files into memory. You can try with this very simple C program:
int main()
{
while(1) {
}
}
Compiled with the default GCC options, this program does nothing except opening the linker and the GNU C library. If you try to drop_caches, execute the program and cat /proc/<PID>/IO more than once, you'll see that read_bytes will vary across runs (except if you perform the steps very fast, in which case the probability that some other program has loaded some files into the cache is low).

Can malloc_trim() release memory from the middle of the heap?

I am confused about the behaviour of malloc_trim as implemented in the glibc.
man malloc_trim
[...]
malloc_trim - release free memory from the top of the heap
[...]
This function cannot release free memory located at places other than the top of the heap.
When I now look up the source of malloc_trim() (in malloc/malloc.c) I see that it calls mtrim() which is utilizing madvise(x, MADV_DONTNEED) to release memory back to the operating system.
So I wonder if the man-page is wrong or if I misinterpret the source in malloc/malloc.c.
Can malloc_trim() release memory from the middle of the heap?
There are two usages of madvise with MADV_DONTNEED in glibc now: http://code.metager.de/source/search?q=MADV_DONTNEED&path=%2Fgnu%2Fglibc%2Fmalloc%2F&project=gnu
H A D arena.c 643 __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
H A D malloc.c 4535 __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
There was https://sourceware.org/git/?p=glibc.git;a=commit;f=malloc/malloc.c;h=68631c8eb92ff38d9da1ae34f6aa048539b199cc commit by Ulrich Drepper on 16 Dec 2007 (part of glibc 2.9 and newer):
malloc/malloc.c (public_mTRIm): Iterate over all arenas and call
mTRIm for all of them.
(mTRIm): Additionally iterate over all free blocks and use madvise
to free memory for all those blocks which contain at least one
memory page.
mTRIm (now mtrim) implementation was changed. Unused parts of chunks, aligned on page size and having size more than page may be marked as MADV_DONTNEED:
/* See whether the chunk contains at least one unused page. */
char *paligned_mem = (char *) (((uintptr_t) p
+ sizeof (struct malloc_chunk)
+ psm1) & ~psm1);
assert ((char *) chunk2mem (p) + 4 * SIZE_SZ <= paligned_mem);
assert ((char *) p + size > paligned_mem);
/* This is the size we could potentially free. */
size -= paligned_mem - (char *) p;
if (size > psm1)
madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
Man page of malloc_trim is there: https://github.com/mkerrisk/man-pages/blob/master/man3/malloc_trim.3 and it was committed by kerrisk in 2012: https://github.com/mkerrisk/man-pages/commit/a15b0e60b297e29c825b7417582a33e6ca26bf65
As I can grep the glibc's git, there are no man pages in the glibc, and no commit to malloc_trim manpage to document this patch. The best and the only documentation of glibc malloc is its source code: https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c
Additional functions:
malloc_trim(size_t pad);
609 /*
610 malloc_trim(size_t pad);
611
612 If possible, gives memory back to the system (via negative
613 arguments to sbrk) if there is unused memory at the `high' end of
614 the malloc pool. You can call this after freeing large blocks of
615 memory to potentially reduce the system-level memory requirements
616 of a program. However, it cannot guarantee to reduce memory. Under
617 some allocation patterns, some large free blocks of memory will be
618 locked between two used chunks, so they cannot be given back to
619 the system.
620
621 The `pad' argument to malloc_trim represents the amount of free
622 trailing space to leave untrimmed. If this argument is zero,
623 only the minimum amount of memory to maintain internal data
624 structures will be left (one page or less). Non-zero arguments
625 can be supplied to maintain enough trailing space to service
626 future expected allocations without having to re-obtain memory
627 from the system.
628
629 Malloc_trim returns 1 if it actually released any memory, else 0.
630 On systems that do not support "negative sbrks", it will always
631 return 0.
632 */
633 int __malloc_trim(size_t);
634
Freeing from the middle of the chunk is not documented as text in malloc/malloc.c (and malloc_trim description in commend was not updated in 2007) and not documented in man-pages project. Man page from 2012 may be the first man page of the function, written not by authors of glibc. Info page of glibc only mentions M_TRIM_THRESHOLD of 128 KB:
https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html#Malloc-Tunable-Parameters and don't list malloc_trim function https://www.gnu.org/software/libc/manual/html_node/Summary-of-Malloc.html#Summary-of-Malloc (and it also don't document memusage/memusagestat/libmemusage.so).
You may ask Drepper and other glibc developers again as you already did in https://sourceware.org/ml/libc-help/2015-02/msg00022.html "malloc_trim() behaviour", but there is still no reply from them. (Only wrong answers from other users like https://sourceware.org/ml/libc-help/2015-05/msg00007.html https://sourceware.org/ml/libc-help/2015-05/msg00008.html)
Or you may test the malloc_trim with this simple C program (test_malloc_trim.c) and strace/ltrace:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <malloc.h>
int main()
{
int *m1,*m2,*m3,*m4;
printf("%s\n","Test started");
m1=(int*)malloc(20000);
m2=(int*)malloc(40000);
m3=(int*)malloc(80000);
m4=(int*)malloc(10000);
printf("1:%p 2:%p 3:%p 4:%p\n", m1, m2, m3, m4);
free(m2);
malloc_trim(0); // 20000, 2000000
sleep(1);
free(m1);
free(m3);
free(m4);
// malloc_stats(); malloc_info(0, stdout);
return 0;
}
gcc test_malloc_trim.c -o test_malloc_trim, strace ./test_malloc_trim
write(1, "Test started\n", 13Test started
) = 13
brk(0) = 0xcca000
brk(0xcef000) = 0xcef000
write(1, "1:0xcca010 2:0xccee40 3:0xcd8a90"..., 441:0xcca010 2:0xccee40 3:0xcd8a90 4:0xcec320
) = 44
madvise(0xccf000, 36864, MADV_DONTNEED) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffffafbfff0) = 0
brk(0xceb000) = 0xceb000
So, there is madvise with MADV_DONTNEED for 9 pages after malloc_trim(0) call, when there was hole of 40008 bytes in the middle of the heap.
... utilizing madvise(x, MADV_DONTNEED) to release memory back to the
operating system.
madvise(x, MADV_DONTNEED) does not release memory. man madvise:
MADV_DONTNEED
Do not expect access in the near future. (For the time being,
the application is finished with the given range, so the kernel
can free resources associated with it.) Subsequent accesses of
pages in this range will succeed, but will result either in
reloading of the memory contents from the underlying mapped file
(see mmap(2)) or zero-fill-on-demand pages for mappings without
an underlying file.
So, the usage of madvise(x, MADV_DONTNEED) does not contradict man malloc_trim's statement:
This function cannot release free memory located at places other than the top of the heap.

How to trace per-file IO operations in Linux?

I need to track read system calls for specific files, and I'm currently doing this by parsing the output of strace. Since read operates on file descriptors I have to keep track of the current mapping between fd and path. Additionally, seek has to be monitored to keep the current position up-to-date in the trace.
Is there a better way to get per-application, per-file-path IO traces in Linux?
You could wait for the files to be opened so you can learn the fd and attach strace after the process launch like this:
strace -p pid -e trace=file -e read=fd
First, you probably don't need to keep track because mapping between fd and path is available in /proc/PID/fd/.
Second, maybe you should use the LD_PRELOAD trick and overload in C open, seek and read system call. There are some article here and there about how to overload malloc/free.
I guess it won't be too different to apply the same kind of trick for those system calls. It needs to be implemented in C, but it should take far less code and be more precise than parsing strace output.
systemtap - a kind of DTrace reimplementation for Linux - could be of help here.
As with strace you only have the fd, but with the scripting ability it is easy to maintain the filename for an fd (unless with fun stuff like dup). There is the example script iotime that illustates it.
#! /usr/bin/env stap
/*
* Copyright (C) 2006-2007 Red Hat Inc.
*
* This copyrighted material is made available to anyone wishing to use,
* modify, copy, or redistribute it subject to the terms and conditions
* of the GNU General Public License v.2.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*
* Print out the amount of time spent in the read and write systemcall
* when each file opened by the process is closed. Note that the systemtap
* script needs to be running before the open operations occur for
* the script to record data.
*
* This script could be used to to find out which files are slow to load
* on a machine. e.g.
*
* stap iotime.stp -c 'firefox'
*
* Output format is:
* timestamp pid (executabable) info_type path ...
*
* 200283135 2573 (cupsd) access /etc/printcap read: 0 write: 7063
* 200283143 2573 (cupsd) iotime /etc/printcap time: 69
*
*/
global start
global time_io
function timestamp:long() { return gettimeofday_us() - start }
function proc:string() { return sprintf("%d (%s)", pid(), execname()) }
probe begin { start = gettimeofday_us() }
global filehandles, fileread, filewrite
probe syscall.open.return {
filename = user_string($filename)
if ($return != -1) {
filehandles[pid(), $return] = filename
} else {
printf("%d %s access %s fail\n", timestamp(), proc(), filename)
}
}
probe syscall.read.return {
p = pid()
fd = $fd
bytes = $return
time = gettimeofday_us() - #entry(gettimeofday_us())
if (bytes > 0)
fileread[p, fd] += bytes
time_io[p, fd] <<< time
}
probe syscall.write.return {
p = pid()
fd = $fd
bytes = $return
time = gettimeofday_us() - #entry(gettimeofday_us())
if (bytes > 0)
filewrite[p, fd] += bytes
time_io[p, fd] <<< time
}
probe syscall.close {
if ([pid(), $fd] in filehandles) {
printf("%d %s access %s read: %d write: %d\n",
timestamp(), proc(), filehandles[pid(), $fd],
fileread[pid(), $fd], filewrite[pid(), $fd])
if (#count(time_io[pid(), $fd]))
printf("%d %s iotime %s time: %d\n", timestamp(), proc(),
filehandles[pid(), $fd], #sum(time_io[pid(), $fd]))
}
delete fileread[pid(), $fd]
delete filewrite[pid(), $fd]
delete filehandles[pid(), $fd]
delete time_io[pid(),$fd]
}
It only works up to a certain number of files because the hash map is size limited.
strace now has new options to track file descriptors:
--decode-fds=set
Decode various information associated with file descriptors. The default is decode-fds=none. set can include the following elements:
path Print file paths.
socket Print socket protocol-specific information,
dev Print character/block device numbers.
pidfd Print PIDs associated with pidfd file descriptors.
This is useful as file descriptors are reused after being closed, and /proc/$PID/fd only provides one snapshot in time, which is useless when debugging something in realtime.
Sample output, note how file names are displayed in angular brackets and FD 3 is reused for all of /etc/ld.so.cache, /lib/x86_64-linux-gnu/libc.so.6, /usr/lib/locale/locale-archive, /home/florian/hello.
$ strace -e trace=desc --decode-fds=all cat hello 1>/dev/null
execve("/usr/bin/cat", ["cat", "hello"], 0x7fff42e20710 /* 102 vars */) = 0
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3</etc/ld.so.cache>
newfstatat(3</etc/ld.so.cache>, "", {st_mode=S_IFREG|0644, st_size=167234, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 167234, PROT_READ, MAP_PRIVATE, 3</etc/ld.so.cache>, 0) = 0x7f22edeee000
close(3</etc/ld.so.cache>) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>
read(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\206\2\0\0\0\0\0"..., 832) = 832
pread64(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\6\0\0\0\4\0\0\0#\0\0\0\0\0\0\0#\0\0\0\0\0\0\0#\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\4\0\0\0 \0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0"..., 48, 848) = 48
pread64(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0+H)\227\201T\214\233\304R\352\306\3379\220%"..., 68, 896) = 68
newfstatat(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "", {st_mode=S_IFREG|0755, st_size=1983576, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f22edeec000
pread64(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\6\0\0\0\4\0\0\0#\0\0\0\0\0\0\0#\0\0\0\0\0\0\0#\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 2012056, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, 0) = 0x7f22edd00000
mmap(0x7f22edd26000, 1486848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, 0x26000) = 0x7f22edd26000
mmap(0x7f22ede91000, 311296, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, 0x191000) = 0x7f22ede91000
mmap(0x7f22ededd000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, 0x1dc000) = 0x7f22ededd000
mmap(0x7f22edee3000, 33688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f22edee3000
close(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f22edcfe000
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3</usr/lib/locale/locale-archive>
newfstatat(3</usr/lib/locale/locale-archive>, "", {st_mode=S_IFREG|0644, st_size=6055600, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 6055600, PROT_READ, MAP_PRIVATE, 3</usr/lib/locale/locale-archive>, 0) = 0x7f22ed737000
close(3</usr/lib/locale/locale-archive>) = 0
fstat(1</dev/null<char 1:3>>, {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}) = 0
openat(AT_FDCWD, "hello", O_RDONLY) = 3</home/florian/hello>
fstat(3</home/florian/hello>, {st_mode=S_IFREG|0664, st_size=6, ...}) = 0
fadvise64(3</home/florian/hello>, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f22edef5000
read(3</home/florian/hello>, "world\n", 131072) = 6
write(1</dev/null<char 1:3>>, "world\n", 6) = 6
read(3</home/florian/hello>, "", 131072) = 0
close(3</home/florian/hello>) = 0
close(1</dev/null<char 1:3>>) = 0
close(2</dev/pts/5<char 136:5>>) = 0
+++ exited with 0 +++
I think overloading open, seek and read is a good solution. But just FYI if you want to parse and analyze the strace output programmatically, I did something similar before and put my code in github: https://github.com/johnlcf/Stana/wiki
(I did that because I have to analyze the strace result of program ran by others, which is not easy to ask them to do LD_PRELOAD.)
Probably the least ugly way to do this is to use fanotify. Fanotify is a Linux kernel facility that allows cheaply watching filesystem events. I'm not sure if it allows filtering by PID, but it does pass the PID to your program so you can check if it's the one you're interested in.
Here's a nice code sample:
http://bazaar.launchpad.net/~pitti/fatrace/trunk/view/head:/fatrace.c
However, it seems to be under-documented at the moment. All the docs I could find are http://www.spinics.net/lists/linux-man/msg02302.html and http://lkml.indiana.edu/hypermail/linux/kernel/0811.1/01668.html
Parsing command-line utils like strace is cumbersome; you could use ptrace() syscall instead. See man ptrace for details.

Resources