Linux malloc segmentation fault missing - linux

The included program should produce a segmentation fault in Linux, but it doesn't:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
/* This program should produce a segmentation error */
/* but it doesn't */
/* Therefore memory bounds are not being checked by the kernel */
typedef struct {
double x;
} blkfmt;
int main()
{
blkfmt *blk;
unsigned char *p;
blk = (blkfmt *) malloc(sizeof(blkfmt));
p = (unsigned char *) blk + 16384;
/* this assignment should produce a segmentation error */
*p = (unsigned char) 0xff;
/* this print statement should produce a segmentation error */
printf("%02x %d\n", *p, sizeof(blkfmt));
free(blk);
return(0);
} /* main */
makefile:
OBJ=bug.o
CC=gcc
CFLAGS=-c -Wall -O2
LDFLAGS=
bug: $(OBJ)
$(CC) -Wall -O2 $(OBJ) -o bug $(LDFLAGS)
bug.o: bug.c
$(CC) $(CFLAGS) bug.c
clean:
rm -f bug $(OBJ)

A memory allocator can do almost anything it wants to. Which is why this isn't acting as you expect.
The following is the output of strace running your program. In it you can see the brk system call is used to allocate 0x21000 more bytes for the program. That's 135,168 bytes. Much more than the 16,384 your program adds to the returned pointer.
execve("./malloc-outside-test", ["./malloc-outside-test"], [/* 62 vars */]) = 0
brk(NULL) = 0x1211000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2efad76000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=183184, ...}) = 0
mmap(NULL, 183184, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f2efad49000
close(3) = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\10\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2089496, ...}) = 0
mmap(NULL, 3938656, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f2efa792000
mprotect(0x7f2efa94b000, 2093056, PROT_NONE) = 0
mmap(0x7f2efab4a000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x7f2efab4a000
mmap(0x7f2efab50000, 14688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f2efab50000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2efad48000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2efad47000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2efad46000
arch_prctl(ARCH_SET_FS, 0x7f2efad47700) = 0
mprotect(0x7f2efab4a000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ) = 0
mprotect(0x7f2efad77000, 4096, PROT_READ) = 0
munmap(0x7f2efad49000, 183184) = 0
brk(NULL) = 0x1211000
brk(0x1232000) = 0x1232000
brk(NULL) = 0x1232000
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
write(1, "ff 8\n", 5) = 5
exit_group(0)

Undefined behaviour is undefined - there's no requirement to detect an report the error. Most implementations favour performance over error detection.
If you want to check your program for correctness (hint: you do, just like the rest of us!), then you'll want to use Valgrind. When I run Valgrind with your program, it reports both your errors:
==26331== Invalid write of size 1
==26331== at 0x40059E: main (38255191.c:21)
==26331== Address 0x51de040 is 16,304 bytes inside an unallocated block of size 4,194,128 in arena "client"
==26331==
==26331== Invalid read of size 1
==26331== at 0x4005A5: main (38255191.c:23)
==26331== Address 0x51de040 is 16,304 bytes inside an unallocated block of size 4,194,128 in arena "client"
==26331==
In passing, I noticed a couple of problems in your code. sizeof blkfmt is a size_t, so you need the %zd conversion.
Also, one should never cast the result of malloc, as this can mask other problems. A good idiom is p = malloc(sizeof *p) - this is obviously correct without having to even look up the type of p.
Also, you didn't need to include <unistd.h>.
Simplified example:
#include <stdio.h>
#include <stdlib.h>
int main()
{
double *blk = malloc(sizeof *blk);;
unsigned char *p = malloc(sizeof *blk);
p = (unsigned char*)blk + 16384;
/* this assignment writes to unallocated memory */
*p = 0xff;
/* this reads from unallocated memory */
printf("%02x %zd\n", *p, sizeof *blk);
free(blk);
return 0;
}

malloc will get bigger space from kernel by page(4096), so if the pointer point to a valid space, it will not crash.

Related

How are multiple copies of shared library text section avoided in physical memory?

When Linux loads shared libraries, my understanding is that, the text section is loaded only once into physical memory and is then mapped across page tables of different processes that reference it.
But where/who ensures/checks that the same shared library text section has not been loaded into a physical memory multiple times?
Is the duplication avoided by the loader or by the mmap() system call or is there some other way and how?
Edit1:
I must've shown what was done so far (research). Here it is...
Tried to trace a simple sleep command.
$ strace sleep 100 &
[1] 22824
$ execve("/bin/sleep", ["sleep", "100"], [/* 26 vars */]) = 0
brk(0) = 0x89bd000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=92360, ...}) = 0
mmap2(NULL, 92360, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f56000
close(3) = 0
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\0`G\0004\0\0\0"..., 512) = 512
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f55000
fstat64(3, {st_mode=S_IFREG|0755, st_size=1706232, ...}) = 0
mmap2(0x460000, 1426884, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x460000
mmap2(0x5b7000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x156) = 0x5b7000
mmap2(0x5ba000, 9668, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x5ba000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f54000
...
munmap(0xb7f56000, 92360) = 0
...
Then checked the /proc/pid/maps file for this process;
$ cat /proc/22824/maps
00441000-0045c000 r-xp 00000000 fd:00 2622360 /lib/ld-2.5.so
...
00460000-005b7000 r-xp 00000000 fd:00 2622361 /lib/libc-2.5.so
...
00e3e000-00e3f000 r-xp 00e3e000 00:00 0 [vdso]
08048000-0807c000 r-xp 00000000 fd:00 5681559 /usr/bin/strace
...
Here it was seen that the addr argument for mmap2() of libc.so.6 with PROT_READ|PROT_EXEC was at a specific address. This lead me to believe that the shared library mapping in physical memory was somehow managed by loader.
Shared libraries are loaded in by the mmap() syscall, and the Linux kernel is smart. It has an internal data structure, which maps the file descriptors (containing the mount instance and the inode number) to the mapped pages in it.
The dynamic linker (its code is somewhere /lib/ld-linux.so or similar) only uses this mmap() call to map the libraries (and then relocates their symbol tables), this page-level deduplication is done entirely by the kernel.
The mappings happen with PROT_READ|PROT_EXEC|PROT_SHARED flags, what you can easily check by stracing any tool (like strace /bin/echo).

mmap /dev/fb0 fails with "Invalid argument"

I have an embedded system and want to use /dev/fb0 directly. As a first test, I use some code based on example-code found everywhere in the net and SO. Opening succeeds, also fstat and similar. But mmap fails with EINVAL.
Source:
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <linux/fb.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
int main() {
int fbfd = 0;
struct fb_var_screeninfo vinfo;
struct fb_fix_screeninfo finfo;
long int screensize = 0;
char *fbp = 0;
int x = 0, y = 0;
long int location = 0;
// Open the file for reading and writing
fbfd = open("/dev/fb0", O_RDWR);
if (fbfd == -1) {
perror("Error: cannot open framebuffer device");
exit(1);
}
printf("The framebuffer device was opened successfully.\n");
struct stat stat;
fstat(fbfd, &stat);
printf("/dev/mem -> size: %u blksize: %u blkcnt: %u\n",
stat.st_size, stat.st_blksize, stat.st_blocks);
// Get fixed screen information
if (ioctl(fbfd, FBIOGET_FSCREENINFO, &finfo) == -1) {
perror("Error reading fixed information");
exit(2);
}
// Get variable screen information
if (ioctl(fbfd, FBIOGET_VSCREENINFO, &vinfo) == -1) {
perror("Error reading variable information");
exit(3);
}
printf("%dx%d, %dbpp\n", vinfo.xres, vinfo.yres, vinfo.bits_per_pixel);
// Figure out the size of the screen in bytes
screensize = vinfo.xres * vinfo.yres * vinfo.bits_per_pixel / 8;
const int PADDING = 4096;
int mmapsize = (screensize + PADDING - 1) & ~(PADDING-1);
// Map the device to memory
fbp = (char *)mmap(0, mmapsize, PROT_READ | PROT_WRITE, MAP_SHARED, fbfd, 0);
if ((int)fbp == -1) {
perror("Error: failed to map framebuffer device to memory");
exit(4);
}
printf("The framebuffer device was mapped to memory successfully.\n");
munmap(fbp, screensize);
close(fbfd);
return 0;
}
Output:
The framebuffer device was opened successfully.
/dev/mem -> size: 0 blksize: 4096 blkcnt: 0
640x480, 4bpp
Error: failed to map framebuffer device to memory: Invalid argument
strace:
...
open("/dev/fb0", O_RDWR) = 3
write(1, "The framebuffer device was opene"..., 48The framebuffer device was opened successfully.
) = 48
fstat64(3, {st_mode=S_IFCHR|0640, st_rdev=makedev(29, 0), ...}) = 0
write(1, "/dev/mem -> size: 0 blksize: 409"..., 44/dev/mem -> size: 0 blksize: 4096 blkcnt: 0
) = 44
ioctl(3, FBIOGET_FSCREENINFO or FBIOPUT_CONTRAST, 0xbfca6564) = 0
ioctl(3, FBIOGET_VSCREENINFO, 0xbfca6600) = 0
write(1, "640x480, 4bpp\n", 14640x480, 4bpp
) = 14
old_mmap(NULL, 155648, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = -1 EINVAL (Invalid argument)
write(2, "Error: failed to map framebuffer"..., 49Error: failed to map framebuffer device to memory) = 49
write(2, ": ", 2: ) = 2
write(2, "Invalid argument", 16Invalid argument) = 16
write(2, "\n", 1
) = 1
The boot-screen with console and tux is visible. And cat /dev/urandom > /dev/fb0 fills the screen with noise. The pagesize is 4096 on the system (`getconf PAGESIZE). So, 155648 (0x26000) is a multiple. Offset and pointer are both zero. Mapping and filemode are both RW .. what am I missing?
This is for an embedded device build with uClibc and busybox running a single application and I have to port it from an ancient kernel. There is code for linedrawing and such and no need for multiprocessing/ windowing .. please no hints to directfb ;).
The kernel driver that presents the framebuffer doesn't support the legacy direct mmap() of the framebuffer device; you need to use the newer DRM and KMS interface.

Allocating Memory to a recursive function

I wrote a simple program as below and straced it.
#include<stdio.h>
int foo(int i)
{
int k=9;
if(i==10)
return 1;
else
foo(++i);
open("1",1);
}
int main()
{
foo(1);
}
My intention in doing so was to checkout how is memory allocated for the variables (int k in this case) in a function on a stack. I used an open system call as a marker. The output of strace was as below:
execve("./a.out", ["./a.out"], [/* 25 vars */]) = 0
brk(0) = 0x8653000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb777e000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=95172, ...}) = 0
mmap2(NULL, 95172, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7766000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/i386-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0000\226\1\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1734120, ...}) = 0
mmap2(NULL, 1743580, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb75bc000
mmap2(0xb7760000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a4) = 0xb7760000
mmap2(0xb7763000, 10972, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7763000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb75bb000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb75bb900, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0xb7760000, 8192, PROT_READ) = 0
mprotect(0x8049000, 4096, PROT_READ) = 0
mprotect(0xb77a1000, 4096, PROT_READ) = 0
munmap(0xb7766000, 95172) = 0
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
exit_group(-1) = ?
Towards the end of the strace output you can see that no system call is being called in between the open system calls. So how is the memory allocated on to the stack , for the function being called , without a system call?
Stack memory for the main thread is allocated by the kernel during the execve() system call. During this call, other mappings defined in the executable file (and possibly also for the dynamic linker specified in the executable) are also setup. For ELF files, this is done in fs/binfmt_elf.c.
Stack memory for other threads is mmap()ed by the thread support library, which is usually part of the C runtime library.
You should also note that on virtual memory systems, the main thread stack is grown by the kernel in response to page faults, up to a configurable limit (shown by ulimit -s).
Your (single threaded) program stack size is fixed so there is no further allocation to expect.
You can query and increase this size with the ulimit -s command.
Note that even if you set this limit to "unlimited", there always will be a practical limit:
With 32 bit processes, unless you are low on RAM/swap, the virtual memory space limitation will cause address collisions
With 64 bit processes, memory (RAM + swap) exhaustion will thrash your system and eventually crash your program.
Whatever the case, there are never explicit system calls to expect that would increase the stack size, it is only set when the program starts.
Note also that the stack memory is handled exactly like heap memory, i.e. only the part of it that has been accessed is mapped to real memory (either RAM or swap). This means the stack kind of grows on demand but no other mechanism than standard virtual memory management is handling that.
Stack usage and allocation (at least on Linux) works this way:
A little bit of stack is allocated.
A guard range is setup after the "other" part of the program, at about 1/4 of the address space.
If the stack is used up to its top and above, the stack gets automatically increased.
This happens either if the ulimit limit is reached (and SIGSEGVs) or, if none such exists, until it hits the guard range (and then gets a SIGBUS).
Your program doesn't begin to make any open calls until the recursion "bottoms out". At that point, the stack is allocated, and it's just popping out of the nesting.
Why don't you step through it with a debugger.
Do you want to find out where variables are allocated to 'stack frames' created for functions?
I have revised your program to show you the memory address of your stack variable k, and a parameter variable kk,
//Show stack location for a variable, k
#include <stdio.h>
int foo(int i)
{
int k=9;
if(i>=10) //relax the condition, safer
return 1;
else
foo(++i);
open("1",1);
//return i;
}
int bar(int kk, int i)
{
int k=9;
printf("&k: %x, &kk: %x\n",&k,&kk); //address variable on stack, parameter
if(i<10) //relax the condition, safer
bar(k,++i);
else
return 1;
return k;
}
int main()
{
//foo(1);
bar(0,1);
}
And the output, on my system,
$ ./foo
&k: bfa8064c, &kk: bfa80660
&k: bfa8061c, &kk: bfa80630
&k: bfa805ec, &kk: bfa80600
&k: bfa805bc, &kk: bfa805d0
&k: bfa8058c, &kk: bfa805a0
&k: bfa8055c, &kk: bfa80570
&k: bfa8052c, &kk: bfa80540
&k: bfa804fc, &kk: bfa80510
&k: bfa804cc, &kk: bfa804e0
&k: bfa8049c, &kk: bfa804b0

How process table is updated by the Linux?

Let's say I have a c program name hello.out. It prints a hello world after every 5 seconds forever. When I execute that program. The instance of that program is called process. Now, when I execute it on a terminal - foreground. What happens to the process page table? Who fills up the process table, and adds an entry in it. How it happens? I understand there is struct task to maintains it. Who fills up this task struct, is it loader?
strace ./hello.out
execve("./hello.out", ["./hello.out"], [/* 54 vars */]) = 0
brk(0) = 0x9e3e000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7713000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=176783, ...}) = 0
mmap2(NULL, 176783, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb76e7000
close(3) = 0
open("/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\20\250\366K4\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=2012656, ...}) = 0
mmap2(0x4bf51000, 1772124, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4bf51000
mprotect(0x4c0fb000, 4096, PROT_NONE) = 0
mmap2(0x4c0fc000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1aa) = 0x4c0fc000
mmap2(0x4c0ff000, 10844, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4c0ff000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb76e6000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb76e66c0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0x4c0fc000, 8192, PROT_READ) = 0
mprotect(0x4bf49000, 4096, PROT_READ) = 0
munmap(0xb76e7000, 176783) = 0
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 7), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7712000
write(1, "Hello world\r\n", 13Hello world
) = 13
exit_group(0) = ?
+++ exited with 0 +++
Where is fork here, what is it happening here. I can see these are system calls. I can make out that printf does a write call with stdout ( the last line). But where is the fork call? What is this brk? Why execve first? what is mmap2 doing here? Also, fstat? I am trying to decode this, and I am not able to understand it in details? Please help me.
When you invoke a program on a terminal, actually you are asking shell to run your program. Once you run your program, for example,
./hello.out
Now shell will read this command and creates a process using the fork() function. Then the fork() invokes the system call sys_clone(Since Version 2.3.3, passes arguments as it behaves like fork() system call). sys_clone() invokes do_fork(). do_fork() creates the process with all resources(most of them copied from the parent) and stores all information of the process in an object of type struct task_struct.
After the fork() returns the shell program now invokes another system call execve() to replace the address space of the newly created process with the program hello.out in the child process context. execve() system call reads the contents of the hello.out with the help ELF code and fills in the memory and invokes the entry point of the program. In this process the name of the program hello.out is also written into the process structure. You can call who loads this program as a loader(a part of kernel). this is done with the help of dynamic linker if there are any shared libraries that the program hello.out uses.

How to trace per-file IO operations in Linux?

I need to track read system calls for specific files, and I'm currently doing this by parsing the output of strace. Since read operates on file descriptors I have to keep track of the current mapping between fd and path. Additionally, seek has to be monitored to keep the current position up-to-date in the trace.
Is there a better way to get per-application, per-file-path IO traces in Linux?
You could wait for the files to be opened so you can learn the fd and attach strace after the process launch like this:
strace -p pid -e trace=file -e read=fd
First, you probably don't need to keep track because mapping between fd and path is available in /proc/PID/fd/.
Second, maybe you should use the LD_PRELOAD trick and overload in C open, seek and read system call. There are some article here and there about how to overload malloc/free.
I guess it won't be too different to apply the same kind of trick for those system calls. It needs to be implemented in C, but it should take far less code and be more precise than parsing strace output.
systemtap - a kind of DTrace reimplementation for Linux - could be of help here.
As with strace you only have the fd, but with the scripting ability it is easy to maintain the filename for an fd (unless with fun stuff like dup). There is the example script iotime that illustates it.
#! /usr/bin/env stap
/*
* Copyright (C) 2006-2007 Red Hat Inc.
*
* This copyrighted material is made available to anyone wishing to use,
* modify, copy, or redistribute it subject to the terms and conditions
* of the GNU General Public License v.2.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*
* Print out the amount of time spent in the read and write systemcall
* when each file opened by the process is closed. Note that the systemtap
* script needs to be running before the open operations occur for
* the script to record data.
*
* This script could be used to to find out which files are slow to load
* on a machine. e.g.
*
* stap iotime.stp -c 'firefox'
*
* Output format is:
* timestamp pid (executabable) info_type path ...
*
* 200283135 2573 (cupsd) access /etc/printcap read: 0 write: 7063
* 200283143 2573 (cupsd) iotime /etc/printcap time: 69
*
*/
global start
global time_io
function timestamp:long() { return gettimeofday_us() - start }
function proc:string() { return sprintf("%d (%s)", pid(), execname()) }
probe begin { start = gettimeofday_us() }
global filehandles, fileread, filewrite
probe syscall.open.return {
filename = user_string($filename)
if ($return != -1) {
filehandles[pid(), $return] = filename
} else {
printf("%d %s access %s fail\n", timestamp(), proc(), filename)
}
}
probe syscall.read.return {
p = pid()
fd = $fd
bytes = $return
time = gettimeofday_us() - #entry(gettimeofday_us())
if (bytes > 0)
fileread[p, fd] += bytes
time_io[p, fd] <<< time
}
probe syscall.write.return {
p = pid()
fd = $fd
bytes = $return
time = gettimeofday_us() - #entry(gettimeofday_us())
if (bytes > 0)
filewrite[p, fd] += bytes
time_io[p, fd] <<< time
}
probe syscall.close {
if ([pid(), $fd] in filehandles) {
printf("%d %s access %s read: %d write: %d\n",
timestamp(), proc(), filehandles[pid(), $fd],
fileread[pid(), $fd], filewrite[pid(), $fd])
if (#count(time_io[pid(), $fd]))
printf("%d %s iotime %s time: %d\n", timestamp(), proc(),
filehandles[pid(), $fd], #sum(time_io[pid(), $fd]))
}
delete fileread[pid(), $fd]
delete filewrite[pid(), $fd]
delete filehandles[pid(), $fd]
delete time_io[pid(),$fd]
}
It only works up to a certain number of files because the hash map is size limited.
strace now has new options to track file descriptors:
--decode-fds=set
Decode various information associated with file descriptors. The default is decode-fds=none. set can include the following elements:
path Print file paths.
socket Print socket protocol-specific information,
dev Print character/block device numbers.
pidfd Print PIDs associated with pidfd file descriptors.
This is useful as file descriptors are reused after being closed, and /proc/$PID/fd only provides one snapshot in time, which is useless when debugging something in realtime.
Sample output, note how file names are displayed in angular brackets and FD 3 is reused for all of /etc/ld.so.cache, /lib/x86_64-linux-gnu/libc.so.6, /usr/lib/locale/locale-archive, /home/florian/hello.
$ strace -e trace=desc --decode-fds=all cat hello 1>/dev/null
execve("/usr/bin/cat", ["cat", "hello"], 0x7fff42e20710 /* 102 vars */) = 0
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3</etc/ld.so.cache>
newfstatat(3</etc/ld.so.cache>, "", {st_mode=S_IFREG|0644, st_size=167234, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 167234, PROT_READ, MAP_PRIVATE, 3</etc/ld.so.cache>, 0) = 0x7f22edeee000
close(3</etc/ld.so.cache>) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>
read(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\206\2\0\0\0\0\0"..., 832) = 832
pread64(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\6\0\0\0\4\0\0\0#\0\0\0\0\0\0\0#\0\0\0\0\0\0\0#\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\4\0\0\0 \0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0"..., 48, 848) = 48
pread64(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0+H)\227\201T\214\233\304R\352\306\3379\220%"..., 68, 896) = 68
newfstatat(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "", {st_mode=S_IFREG|0755, st_size=1983576, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f22edeec000
pread64(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\6\0\0\0\4\0\0\0#\0\0\0\0\0\0\0#\0\0\0\0\0\0\0#\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 2012056, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, 0) = 0x7f22edd00000
mmap(0x7f22edd26000, 1486848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, 0x26000) = 0x7f22edd26000
mmap(0x7f22ede91000, 311296, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, 0x191000) = 0x7f22ede91000
mmap(0x7f22ededd000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, 0x1dc000) = 0x7f22ededd000
mmap(0x7f22edee3000, 33688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f22edee3000
close(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f22edcfe000
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3</usr/lib/locale/locale-archive>
newfstatat(3</usr/lib/locale/locale-archive>, "", {st_mode=S_IFREG|0644, st_size=6055600, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 6055600, PROT_READ, MAP_PRIVATE, 3</usr/lib/locale/locale-archive>, 0) = 0x7f22ed737000
close(3</usr/lib/locale/locale-archive>) = 0
fstat(1</dev/null<char 1:3>>, {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}) = 0
openat(AT_FDCWD, "hello", O_RDONLY) = 3</home/florian/hello>
fstat(3</home/florian/hello>, {st_mode=S_IFREG|0664, st_size=6, ...}) = 0
fadvise64(3</home/florian/hello>, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f22edef5000
read(3</home/florian/hello>, "world\n", 131072) = 6
write(1</dev/null<char 1:3>>, "world\n", 6) = 6
read(3</home/florian/hello>, "", 131072) = 0
close(3</home/florian/hello>) = 0
close(1</dev/null<char 1:3>>) = 0
close(2</dev/pts/5<char 136:5>>) = 0
+++ exited with 0 +++
I think overloading open, seek and read is a good solution. But just FYI if you want to parse and analyze the strace output programmatically, I did something similar before and put my code in github: https://github.com/johnlcf/Stana/wiki
(I did that because I have to analyze the strace result of program ran by others, which is not easy to ask them to do LD_PRELOAD.)
Probably the least ugly way to do this is to use fanotify. Fanotify is a Linux kernel facility that allows cheaply watching filesystem events. I'm not sure if it allows filtering by PID, but it does pass the PID to your program so you can check if it's the one you're interested in.
Here's a nice code sample:
http://bazaar.launchpad.net/~pitti/fatrace/trunk/view/head:/fatrace.c
However, it seems to be under-documented at the moment. All the docs I could find are http://www.spinics.net/lists/linux-man/msg02302.html and http://lkml.indiana.edu/hypermail/linux/kernel/0811.1/01668.html
Parsing command-line utils like strace is cumbersome; you could use ptrace() syscall instead. See man ptrace for details.

Resources