Reading from a Closed File Descriptor - linux

I traced open, read, close and dup system calls in gimp-2.8.22 using strace, with the following command:
strace -eread,openat,open,close,dup,dup2 gimp
In gimp, I opened an image named babr.jpg. The trace shows that this image was opened (file descriptor is 14), read and closed. But, immediately after that, the same file descriptor (14 is not opened after the last close) is used for reading. How is it possible?
Here is the relevant portion of trace:
read(14, "\371\331\25\233M\311j\261b\271\332\240\33\315d\234\340y\236\217\323\206(\214\270x2\303S\212\252\254"..., 4096) = 4096
read(14, "t\260\265fv<\243.5A\324\17\221+\36\207\265&+rL\247\343\366\372\236\353\353'\226\27\27"..., 4096) = 318
close(14) = 0
openat(AT_FDCWD, "/home/ahmad/Pictures/babr.jpg", O_RDONLY) = 14
read(14, "\377\330\377\340\0\20JFIF\0\1\1\1\1,\1,\0\0\377\355(\212Photosho"..., 4096) = 4096
close(14) = 0
openat(AT_FDCWD, "/opt/gimp-2.8.22/lib/gimp/2.0/plug-ins/file-jpeg", O_RDONLY) = 19
read(19, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P[\0\0\0\0\0\0"..., 4096) = 4096
close(19) = 0
close(20) = 0
read(19, "", 8) = 0
close(19) = 0
close(17) = 0
close(16) = 0
read(4, "\2\0\0\0\0\0\0\0", 16) = 8
Gtk-^[[1;32mMessage^[[0m: ^[[34m15:09:02.956^[[0m: Failed to load module "canberra-gtk-module"
read(14, "\0\0\0\5", 4) = 4
read(14, "\0\0\0\23", 4) = 4
read(14, "gimp-progress-init\0", 19) = 19
read(14, "\0\0\0\2", 4) = 4
I also checked this using Pin and found the same result.

The second file descriptor #14 is very likely a pipe between the plugin and Gimp (the handle being free has been reused). And you don't trace the creation of pipes.
From gimpplugin.c:
/* Open two pipes. (Bidirectional communication).
*/
if ((pipe (my_read) == -1) || (pipe (my_write) == -1))
{
gimp_message (plug_in->manager->gimp, NULL, GIMP_MESSAGE_ERROR,
"Unable to run plug-in \"%s\"\n(%s)\n\npipe() failed: %s",
gimp_object_get_name (plug_in),
gimp_file_get_utf8_name (plug_in->file),
g_strerror (errno));
return FALSE;
}

Related

Process killed by SIGHUP after read returns ERESTARTSYS

We have some application which calls a PHP script which connects to an Oracle DB to do certain things. :) This does not work out well sometimes.
We are now running the PHP part via strace from the beginning.
This is how it looks when everything works ok (everything works out, the DB connection is built, the query executed, the DB is again disconnected, etc.):
10:30:17.935486 connect(8, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("10.1.1.55")}, 16) = -1 EINPROGRESS (Operation now in progress)
10:30:17.935546 times(NULL) = 2908590046
10:30:17.935569 brk(0xda4000) = 0xda4000
10:30:17.935594 poll([{fd=8, events=POLLOUT}], 1, 60000) = 1 ([{fd=8, revents=POLLOUT}])
10:30:17.940338 getsockopt(8, SOL_SOCKET, SO_ERROR, [519270883345301504], [4]) = 0
10:30:17.940368 fcntl(8, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
10:30:17.940388 fcntl(8, F_SETFL, O_RDWR) = 0
10:30:17.940408 getsockname(8, {sa_family=AF_INET, sin_port=htons(62498), sin_addr=inet_addr("192.168.22.30")}, [16]) = 0
10:30:17.940437 getsockopt(8, SOL_SOCKET, SO_SNDBUF, [-4193870156763480064], [4]) = 0
10:30:17.940458 getsockopt(8, SOL_SOCKET, SO_RCVBUF, [-4193870156763409068], [4]) = 0
10:30:17.940483 setsockopt(8, SOL_TCP, TCP_NODELAY, [1], 4) = 0
10:30:17.940506 fcntl(8, F_SETFD, FD_CLOEXEC) = 0
10:30:17.940652 rt_sigaction(SIGPIPE, {0x1, ~[ILL ABRT BUS FPE SEGV USR2 TERM XCPU XFSZ SYS RTMIN RT_1], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7f7198b2b920}, {0x1, [PIPE], SA_RESTORER|SA_RESTART, 0x7f7198b2b920}, 8) = 0
10:30:17.940725 write(8, "\x00\xe8\x00\x00\x01\x00\x00\x00\x01\x3b\x01\x2c\x0c\x41\x20\x00\xff\xff\x7f\x08\x00\x00\x01\x00\x00\xa2\x00\x46\x00\x00\x08\x00"..., 232) = 232
10:30:17.940781 read(8, "\x00\x08\x00\x00\x0b\x00\x00\x00", 8208) = 8
10:30:17.974177 write(8, "\x00\xe8\x00\x00\x01\x00\x00\x00\x01\x3b\x01\x2c\x0c\x41\x20\x00\xff\xff\x7f\x08\x00\x00\x01\x00\x00\xa2\x00\x46\x00\x00\x08\x00"..., 232) = 232
10:30:17.974247 read(8, "\x00\x29\x00\x00\x02\x00\x00\x00\x01\x3b\x0c\x41\x00\x00\x00\x00\x01\x00\x00\x00\x00\x29\x51\x41\x00\x00\x00\x00\x00\x00\x00\x00"..., 8208) = 41
10:30:17.976465 write(8, "\x00\x00\x00\xa4\x06\x20\x00\x00\x00\x00\xde\xad\xbe\xef\x00\x9a\x00\x00\x00\x00\x00\x04\x00\x00\x04\x00\x03\x00\x00\x00\x00\x00"..., 164) = 164
....
This is how it looks when everything does not work ok:
10:23:24.888170 connect(8, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("10.1.1.55")}, 16) = -1 EINPROGRESS (Operation now in progress)
10:23:24.888241 times(NULL) = 2908548738
10:23:24.888263 brk(0xda4000) = 0xda4000
10:23:24.888287 poll([{fd=8, events=POLLOUT}], 1, 60000) = 1 ([{fd=8, revents=POLLOUT}])
10:23:24.889769 getsockopt(8, SOL_SOCKET, SO_ERROR, [519270883345301504], [4]) = 0
10:23:24.889807 fcntl(8, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
10:23:24.889827 fcntl(8, F_SETFL, O_RDWR) = 0
10:23:24.889845 getsockname(8, {sa_family=AF_INET, sin_port=htons(62473), sin_addr=inet_addr("192.168.22.30")}, [16]) = 0
10:23:24.889873 getsockopt(8, SOL_SOCKET, SO_SNDBUF, [-8374476973480591360], [4]) = 0
10:23:24.889892 getsockopt(8, SOL_SOCKET, SO_RCVBUF, [-8374476973480520364], [4]) = 0
10:23:24.889915 setsockopt(8, SOL_TCP, TCP_NODELAY, [1], 4) = 0
10:23:24.889936 fcntl(8, F_SETFD, FD_CLOEXEC) = 0
10:23:24.890062 rt_sigaction(SIGPIPE, {0x1, ~[ILL ABRT BUS FPE SEGV USR2 TERM XCPU XFSZ SYS RTMIN RT_1], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7f2ee24b4920}, {0x1, [PIPE], SA_
RESTORER|SA_RESTART, 0x7f2ee24b4920}, 8) = 0
10:23:24.890129 write(8, "\x00\xe8\x00\x00\x01\x00\x00\x00\x01\x3b\x01\x2c\x0c\x41\x20\x00\xff\xff\x7f\x08\x00\x00\x01\x00\x00\xa2\x00\x46\x00\x00\x08\x00"..., 232) = 232
10:23:24.890186 read(8, 0xd705a6, 8208) = ? ERESTARTSYS (To be restarted)
10:23:24.907853 --- SIGHUP (Hangup) # 0 (0) ---
10:23:24.908708 +++ killed by SIGHUP +++
This happens sometimes and the application (or at least the PHP script and the connection to the DB) just gets killed. That's bad.
What do you make of the above straces?
Can we tell who is killed by who?
Why would read() return ERESTARTSYS?
What does SIGHUP (Hangup) # 0 (0) tell us exactly?
Your process got sent a SIGHUP, which caused the normal action of exiting.
Can't tell who did it. Try a newer version of strace. From what I can tell, going all the way back to version 4.6 from 2011 it should display more information. The version of strace you are using is from prior to 2011 and the # 0 (0) supplies the PC of the process when the signal was received and the address associated with the signal from siginfo_t. Neither will tell you anything about this problem.
A newer version will supply something like this:
--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=25064, si_uid=1000} ---
--- SIGHUP {si_signo=SIGHUP, si_code=SI_KERNEL} ---
This first is another process sending the SIGHUP. The second is one sent automatically because of certain events.
The latter can happen when the controlling terminal of the process closes or when the session leader exits because its terminal closed. If you determine it's the kernel sending the signal, then I'd look at your process while it's running and examine the "sid" and "tty" columns in the ps output. That will tell you the session leader and terminal responsible for causing the SIGHUP to be sent. Maybe sometimes your script has a controlling terminal and sometimes not?
The session leader would usually be the parent process that started your script, or the parent of that process, or the parent of that, etc. Looking at ps output and "sid" will tell you. If that leader process exits and has a controlling terminal, everything under it gets a SIGHUP. The way to solve this would be either have the leader not exit until the PHP process is finished, or at some point detach from that session or terminal. Usually a daemon or server process should not associated with a terminal. See daemon() and setsid().

Program locks up but NOT when run through strace

I am doing server-side rendering with wkhtmltoimage and each run is locking up at "88% loading" for 1-2 minutes. I decided to debug what was happening through strace but in a completely bizarre twist the program did NOT lock up. I've found this to be completely repeatable and consistent. Why on earth would strace make a program faster when by all rights the program should be slower ?!
Run without strace:
user#server:~/public_html/shapes$ time wkhtmltoimage --disable-smart-width --width 970 --format jpg '[THE URL]' '[THE PATH].jpg'
Loading page (1/2)
Rendering (2/2)
Done
real 1m45.724s
user 1m42.887s
sys 0m0.623s
Run WITH strace:
user#server:~/public_html/shapes$ time strace wkhtmltoimage --disable-smart-width --width 970 --format jpg '[THE URL]' '[THE PATH].jpg'
execve("/usr/local/bin/wkhtmltoimage", ["wkhtmltoimage", "--disable-smart-width", "--width", "970", "--format", "jpg", "[THE URL]"..., "[THE PATH]"...], [/* 21 vars */]) = 0
brk(0) = 0x311a000
...
exit_group(0) = ?
+++ exited with 0 +++
real 0m6.526s
user 0m0.582s
sys 0m0.377s
Server is private so I've redacted the URL and PATH but otherwise the output is correct. Also both runs create the correct output and I've cleared up temporary files to ensure it's not a caching issue. I've done these runs 10 times each to ensure it's not a random artifact but it happens consistently, THEREFORE the only logical conclusion is that strace is somehow changing the behaviour of wkhtmltoimage and I was really hoping somebody could tell me what. If I knew why strace makes the program not lockup I could probably find a solution.
Here is the hung process:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
772 sigapp 20 0 1485600 45760 14524 R 99.8 2.2 0:57.02 wkhtmltoimage
As root I can attach to the hung image using strace -p $(pidof wkhtmltoimage) and the result is:
gettimeofday({1436692542, 446246}, NULL) = 0
gettimeofday({1436692542, 556958}, NULL) = 0
gettimeofday({1436692542, 557161}, NULL) = 0
gettimeofday({1436692542, 659238}, NULL) = 0
gettimeofday({1436692542, 771381}, NULL) = 0
gettimeofday({1436692542, 771686}, NULL) = 0
gettimeofday({1436692542, 875783}, NULL) = 0
gettimeofday({1436692542, 987490}, NULL) = 0
gettimeofday({1436692542, 987781}, NULL) = 0
gettimeofday({1436692543, 84764}, NULL) = 0
...
System: Linux server 3.13.0-30-generic #55-Ubuntu SMP Fri Jul 4 21:40:53 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux (Running on 2 CPU Xen Guest)
Software: wkhtmltoimage 0.12.2.1 (with patched qt) <- installed from official website via .deb file
I've read this but not sure if relevant: Hung processes resume if attached to strace
UPDATE:
Tried with another program webkit2pdf and seeing lockups as well. This time strace output is different. Both programs use webkit.:
root#server:~# strace -p `pidof webkit2pdf`
Process 4699 attached
restart_syscall(<... resuming interrupted call ...>
On a successful run of strace wkhtmltoimage ... I can see similar calls but not NEARLY as many preceeded by an mmap:
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f9dbbbc1000
gettimeofday({1436696839, 361942}, NULL) = 0
gettimeofday({1436696839, 362254}, NULL) = 0
gettimeofday({1436696839, 362505}, NULL) = 0
gettimeofday({1436696839, 362787}, NULL) = 0
gettimeofday({1436696839, 363172}, NULL) = 0
gettimeofday({1436696839, 363568}, NULL) = 0
gettimeofday({1436696839, 363913}, NULL) = 0
mmap(NULL, 28672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f9db5000000
gettimeofday({1436696839, 364701}, NULL) = 0
mmap(NULL, 28672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f9db4ff9000
gettimeofday({1436696839, 365612}, NULL) = 0
gettimeofday({1436696839, 365956}, NULL) = 0
So per pvg's comment below this is most probably the locking mechanism for setAttribute working correctly under strace vs. not working normally.

odd delay in strace log when apache serving static content

Using Firefox's developer tools I noticed that static image loads from my site spent between 60-200ms in the "Connect" phase. I took an strace log and see where the delay happens, but am having trouble understanding why it's happening. Here is what I see in strace (log slightly edited to remove actual file paths):
30727 22:20:52.526972 getsockname(12, {sa_family=AF_INET6, sin6_port=htons(80), inet_pton(AF_INET6, "::ffff:192.168.100.62", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
30727 22:20:52.527203 fcntl(12, F_GETFL) = 0x2 (flags O_RDWR)
30727 22:20:52.527275 fcntl(12, F_SETFL, O_RDWR|O_NONBLOCK) = 0
30727 22:20:52.527419 read(12, "GET /images/blah/bla"..., 8000) = 540
30727 22:20:52.527737 brk(0x7f72665b7000) = 0x7f72665b7000
30727 22:20:52.527972 stat("/websites/imagesrepository/blah/blah/image.png", {st_mode=S_IFREG|0775, st_size=$
30727 22:20:52.528160 open("/.htaccess", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
30727 22:20:52.528254 open("/websites/.htaccess", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
30727 22:20:52.528335 open("/websites/imagesrepository/.htaccess", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
30727 22:20:52.528417 open("/websites/imagesrepository/blah/.htaccess", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
30727 22:20:52.528501 open("/websites/imagesrepository/blah/blah/.htaccess", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
30727 22:20:52.529090 stat("/websites/somefile.ext", 0x7f7266582d90) = -1 ENOENT (No such file or directory)
30727 22:20:52.529144 stat("/websites/someotherfile.ext", 0x7f72665812d0) = -1 ENOENT (No such file or directory)
30727 22:20:52.529431 open("/websites/imagesrepository/blah/blah/image.png", O_RDONLY|O_CLOEXEC) = 13
30727 22:20:52.529489 fcntl(13, F_GETFD) = 0x1 (flags FD_CLOEXEC)
30727 22:20:52.529519 fcntl(13, F_SETFD, FD_CLOEXEC) = 0
30727 22:20:52.529582 close(13) = 0
30727 22:20:52.529669 writev(12, [{"HTTP/1.1 304 Not Modified\r\nDate:"..., 209}], 1) = 209
30727 22:20:52.529773 write(10, "104.189.168.66 - - [04/Mar/2015:"..., 270) = 270
30727 22:20:52.529842 shutdown(12, 1 /* send */) = 0
30727 22:20:52.529905 poll([{fd=12, events=POLLIN}], 1, 2000) = 1 ([{fd=12, revents=POLLIN|POLLHUP}])
30727 22:20:52.687759 read(12, "", 512) = 0
30727 22:20:52.687849 close(12) = 0
30727 22:20:52.687951 read(8, 0x7fffc3eb081f, 1) = -1 EAGAIN (Resource temporarily unavailable)
30727 22:20:52.688062 semop(3145795, {{0, -1, SEM_UNDO}}, 1 <unfinished ...>
It looks like the delay occurs here:
30727 22:20:52.529905 poll([{fd=12, events=POLLIN}], 1, 2000) = 1 ([{fd=12, revents=POLLIN|POLLHUP}])
30727 22:20:52.687759 read(12, "", 512) = 0
File descriptor 12 is the socket the request came in on. So it looks like before closing the socket Apache polls for some event and then tries a read? Why? I have KeepAlive off in my config file (due to incompatibility with certain older mobile clients) if that matters.
EDITED TO ADD:
I dug into the Apache 2.2 source and this behavior seems to be intentional. The relevant method is ap_lingering_close() in server/connection.c:
/* we now proceed to read from the client until we get EOF, or until
* MAX_SECS_TO_LINGER has passed. the reasons for doing this are
* documented in a draft:
*
* http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-connection-00.txt
*
* in a nutshell -- if we don't make this effort we risk causing
* TCP RST packets to be sent which can tear down a connection before
* all the response data has been sent to the client.
*/
#define SECONDS_TO_LINGER 2
AP_DECLARE(void) ap_lingering_close(conn_rec *c)
{
char dummybuf[512];
apr_size_t nbytes;
apr_time_t timeup = 0;
apr_socket_t *csd = ap_get_module_config(c->conn_config, &core_module);
if (!csd) {
return;
}
ap_update_child_status(c->sbh, SERVER_CLOSING, NULL);
#ifdef NO_LINGCLOSE
ap_flush_conn(c); /* just close it */
apr_socket_close(csd);
return;
#endif
/* Close the connection, being careful to send out whatever is still
* in our buffers. If possible, try to avoid a hard close until the
* client has ACKed our FIN and/or has stopped sending us data.
*/
/* Send any leftover data to the client, but never try to again */
ap_flush_conn(c);
if (c->aborted) {
apr_socket_close(csd);
return;
}
/* Shut down the socket for write, which will send a FIN
* to the peer.
*/
if (apr_socket_shutdown(csd, APR_SHUTDOWN_WRITE) != APR_SUCCESS
|| c->aborted) {
apr_socket_close(csd);
return;
}
/* Read available data from the client whilst it continues sending
* it, for a maximum time of MAX_SECS_TO_LINGER. If the client
* does not send any data within 2 seconds (a value pulled from
* Apache 1.3 which seems to work well), give up.
*/
apr_socket_timeout_set(csd, apr_time_from_sec(SECONDS_TO_LINGER));
apr_socket_opt_set(csd, APR_INCOMPLETE_READ, 1);
/* The common path here is that the initial apr_socket_recv() call
* will return 0 bytes read; so that case must avoid the expensive
* apr_time_now() call and time arithmetic. */
do {
nbytes = sizeof(dummybuf);
if (apr_socket_recv(csd, dummybuf, &nbytes) || nbytes == 0)
break;
if (timeup == 0) {
/*
* First time through;
* calculate now + 30 seconds (MAX_SECS_TO_LINGER).
*
* If some module requested a shortened waiting period, only wait
* for 2s (SECONDS_TO_LINGER). This is useful for mitigating
* certain DoS attacks.
*/
if (apr_table_get(c->notes, "short-lingering-close")) {
timeup = apr_time_now() + apr_time_from_sec(SECONDS_TO_LINGER);
}
else {
timeup = apr_time_now() + apr_time_from_sec(MAX_SECS_TO_LINGER);
}
continue;
}
} while (apr_time_now() < timeup);
apr_socket_close(csd);
return;
}
Still: is it normal for this approach to add 50-150ms to every request? That seems excessive.
EDITED AGAIN TO ADD:
Based on this ancient discussion among Apache devs:
http://webmail.dev411.com/t/apache/dev/9727khxsk3/apache-1-2b7-dev-performance/oldest
and specifically this comment from Roy T. Fielding:
http://webmail.dev411.com/t/apache/dev/9727khxsk3/apache-1-2b7-dev-performance/oldest#19970209aph7g21hj2jepa04nk28ywj6cr
it sounds like this may be a bug. If KeepAlive is Off then it sounds like there's no need to do a lingering close because the client has already been told the connection won't be persistent.

Allocating Memory to a recursive function

I wrote a simple program as below and straced it.
#include<stdio.h>
int foo(int i)
{
int k=9;
if(i==10)
return 1;
else
foo(++i);
open("1",1);
}
int main()
{
foo(1);
}
My intention in doing so was to checkout how is memory allocated for the variables (int k in this case) in a function on a stack. I used an open system call as a marker. The output of strace was as below:
execve("./a.out", ["./a.out"], [/* 25 vars */]) = 0
brk(0) = 0x8653000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb777e000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=95172, ...}) = 0
mmap2(NULL, 95172, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7766000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/i386-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0000\226\1\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1734120, ...}) = 0
mmap2(NULL, 1743580, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb75bc000
mmap2(0xb7760000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a4) = 0xb7760000
mmap2(0xb7763000, 10972, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7763000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb75bb000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb75bb900, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0xb7760000, 8192, PROT_READ) = 0
mprotect(0x8049000, 4096, PROT_READ) = 0
mprotect(0xb77a1000, 4096, PROT_READ) = 0
munmap(0xb7766000, 95172) = 0
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
open("1", O_WRONLY) = -1 ENOENT (No such file or directory)
exit_group(-1) = ?
Towards the end of the strace output you can see that no system call is being called in between the open system calls. So how is the memory allocated on to the stack , for the function being called , without a system call?
Stack memory for the main thread is allocated by the kernel during the execve() system call. During this call, other mappings defined in the executable file (and possibly also for the dynamic linker specified in the executable) are also setup. For ELF files, this is done in fs/binfmt_elf.c.
Stack memory for other threads is mmap()ed by the thread support library, which is usually part of the C runtime library.
You should also note that on virtual memory systems, the main thread stack is grown by the kernel in response to page faults, up to a configurable limit (shown by ulimit -s).
Your (single threaded) program stack size is fixed so there is no further allocation to expect.
You can query and increase this size with the ulimit -s command.
Note that even if you set this limit to "unlimited", there always will be a practical limit:
With 32 bit processes, unless you are low on RAM/swap, the virtual memory space limitation will cause address collisions
With 64 bit processes, memory (RAM + swap) exhaustion will thrash your system and eventually crash your program.
Whatever the case, there are never explicit system calls to expect that would increase the stack size, it is only set when the program starts.
Note also that the stack memory is handled exactly like heap memory, i.e. only the part of it that has been accessed is mapped to real memory (either RAM or swap). This means the stack kind of grows on demand but no other mechanism than standard virtual memory management is handling that.
Stack usage and allocation (at least on Linux) works this way:
A little bit of stack is allocated.
A guard range is setup after the "other" part of the program, at about 1/4 of the address space.
If the stack is used up to its top and above, the stack gets automatically increased.
This happens either if the ulimit limit is reached (and SIGSEGVs) or, if none such exists, until it hits the guard range (and then gets a SIGBUS).
Your program doesn't begin to make any open calls until the recursion "bottoms out". At that point, the stack is allocated, and it's just popping out of the nesting.
Why don't you step through it with a debugger.
Do you want to find out where variables are allocated to 'stack frames' created for functions?
I have revised your program to show you the memory address of your stack variable k, and a parameter variable kk,
//Show stack location for a variable, k
#include <stdio.h>
int foo(int i)
{
int k=9;
if(i>=10) //relax the condition, safer
return 1;
else
foo(++i);
open("1",1);
//return i;
}
int bar(int kk, int i)
{
int k=9;
printf("&k: %x, &kk: %x\n",&k,&kk); //address variable on stack, parameter
if(i<10) //relax the condition, safer
bar(k,++i);
else
return 1;
return k;
}
int main()
{
//foo(1);
bar(0,1);
}
And the output, on my system,
$ ./foo
&k: bfa8064c, &kk: bfa80660
&k: bfa8061c, &kk: bfa80630
&k: bfa805ec, &kk: bfa80600
&k: bfa805bc, &kk: bfa805d0
&k: bfa8058c, &kk: bfa805a0
&k: bfa8055c, &kk: bfa80570
&k: bfa8052c, &kk: bfa80540
&k: bfa804fc, &kk: bfa80510
&k: bfa804cc, &kk: bfa804e0
&k: bfa8049c, &kk: bfa804b0

How to trace per-file IO operations in Linux?

I need to track read system calls for specific files, and I'm currently doing this by parsing the output of strace. Since read operates on file descriptors I have to keep track of the current mapping between fd and path. Additionally, seek has to be monitored to keep the current position up-to-date in the trace.
Is there a better way to get per-application, per-file-path IO traces in Linux?
You could wait for the files to be opened so you can learn the fd and attach strace after the process launch like this:
strace -p pid -e trace=file -e read=fd
First, you probably don't need to keep track because mapping between fd and path is available in /proc/PID/fd/.
Second, maybe you should use the LD_PRELOAD trick and overload in C open, seek and read system call. There are some article here and there about how to overload malloc/free.
I guess it won't be too different to apply the same kind of trick for those system calls. It needs to be implemented in C, but it should take far less code and be more precise than parsing strace output.
systemtap - a kind of DTrace reimplementation for Linux - could be of help here.
As with strace you only have the fd, but with the scripting ability it is easy to maintain the filename for an fd (unless with fun stuff like dup). There is the example script iotime that illustates it.
#! /usr/bin/env stap
/*
* Copyright (C) 2006-2007 Red Hat Inc.
*
* This copyrighted material is made available to anyone wishing to use,
* modify, copy, or redistribute it subject to the terms and conditions
* of the GNU General Public License v.2.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*
* Print out the amount of time spent in the read and write systemcall
* when each file opened by the process is closed. Note that the systemtap
* script needs to be running before the open operations occur for
* the script to record data.
*
* This script could be used to to find out which files are slow to load
* on a machine. e.g.
*
* stap iotime.stp -c 'firefox'
*
* Output format is:
* timestamp pid (executabable) info_type path ...
*
* 200283135 2573 (cupsd) access /etc/printcap read: 0 write: 7063
* 200283143 2573 (cupsd) iotime /etc/printcap time: 69
*
*/
global start
global time_io
function timestamp:long() { return gettimeofday_us() - start }
function proc:string() { return sprintf("%d (%s)", pid(), execname()) }
probe begin { start = gettimeofday_us() }
global filehandles, fileread, filewrite
probe syscall.open.return {
filename = user_string($filename)
if ($return != -1) {
filehandles[pid(), $return] = filename
} else {
printf("%d %s access %s fail\n", timestamp(), proc(), filename)
}
}
probe syscall.read.return {
p = pid()
fd = $fd
bytes = $return
time = gettimeofday_us() - #entry(gettimeofday_us())
if (bytes > 0)
fileread[p, fd] += bytes
time_io[p, fd] <<< time
}
probe syscall.write.return {
p = pid()
fd = $fd
bytes = $return
time = gettimeofday_us() - #entry(gettimeofday_us())
if (bytes > 0)
filewrite[p, fd] += bytes
time_io[p, fd] <<< time
}
probe syscall.close {
if ([pid(), $fd] in filehandles) {
printf("%d %s access %s read: %d write: %d\n",
timestamp(), proc(), filehandles[pid(), $fd],
fileread[pid(), $fd], filewrite[pid(), $fd])
if (#count(time_io[pid(), $fd]))
printf("%d %s iotime %s time: %d\n", timestamp(), proc(),
filehandles[pid(), $fd], #sum(time_io[pid(), $fd]))
}
delete fileread[pid(), $fd]
delete filewrite[pid(), $fd]
delete filehandles[pid(), $fd]
delete time_io[pid(),$fd]
}
It only works up to a certain number of files because the hash map is size limited.
strace now has new options to track file descriptors:
--decode-fds=set
Decode various information associated with file descriptors. The default is decode-fds=none. set can include the following elements:
path Print file paths.
socket Print socket protocol-specific information,
dev Print character/block device numbers.
pidfd Print PIDs associated with pidfd file descriptors.
This is useful as file descriptors are reused after being closed, and /proc/$PID/fd only provides one snapshot in time, which is useless when debugging something in realtime.
Sample output, note how file names are displayed in angular brackets and FD 3 is reused for all of /etc/ld.so.cache, /lib/x86_64-linux-gnu/libc.so.6, /usr/lib/locale/locale-archive, /home/florian/hello.
$ strace -e trace=desc --decode-fds=all cat hello 1>/dev/null
execve("/usr/bin/cat", ["cat", "hello"], 0x7fff42e20710 /* 102 vars */) = 0
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3</etc/ld.so.cache>
newfstatat(3</etc/ld.so.cache>, "", {st_mode=S_IFREG|0644, st_size=167234, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 167234, PROT_READ, MAP_PRIVATE, 3</etc/ld.so.cache>, 0) = 0x7f22edeee000
close(3</etc/ld.so.cache>) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>
read(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\206\2\0\0\0\0\0"..., 832) = 832
pread64(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\6\0\0\0\4\0\0\0#\0\0\0\0\0\0\0#\0\0\0\0\0\0\0#\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\4\0\0\0 \0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0"..., 48, 848) = 48
pread64(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0+H)\227\201T\214\233\304R\352\306\3379\220%"..., 68, 896) = 68
newfstatat(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "", {st_mode=S_IFREG|0755, st_size=1983576, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f22edeec000
pread64(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, "\6\0\0\0\4\0\0\0#\0\0\0\0\0\0\0#\0\0\0\0\0\0\0#\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 2012056, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, 0) = 0x7f22edd00000
mmap(0x7f22edd26000, 1486848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, 0x26000) = 0x7f22edd26000
mmap(0x7f22ede91000, 311296, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, 0x191000) = 0x7f22ede91000
mmap(0x7f22ededd000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3</usr/lib/x86_64-linux-gnu/libc-2.33.so>, 0x1dc000) = 0x7f22ededd000
mmap(0x7f22edee3000, 33688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f22edee3000
close(3</usr/lib/x86_64-linux-gnu/libc-2.33.so>) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f22edcfe000
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3</usr/lib/locale/locale-archive>
newfstatat(3</usr/lib/locale/locale-archive>, "", {st_mode=S_IFREG|0644, st_size=6055600, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 6055600, PROT_READ, MAP_PRIVATE, 3</usr/lib/locale/locale-archive>, 0) = 0x7f22ed737000
close(3</usr/lib/locale/locale-archive>) = 0
fstat(1</dev/null<char 1:3>>, {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}) = 0
openat(AT_FDCWD, "hello", O_RDONLY) = 3</home/florian/hello>
fstat(3</home/florian/hello>, {st_mode=S_IFREG|0664, st_size=6, ...}) = 0
fadvise64(3</home/florian/hello>, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f22edef5000
read(3</home/florian/hello>, "world\n", 131072) = 6
write(1</dev/null<char 1:3>>, "world\n", 6) = 6
read(3</home/florian/hello>, "", 131072) = 0
close(3</home/florian/hello>) = 0
close(1</dev/null<char 1:3>>) = 0
close(2</dev/pts/5<char 136:5>>) = 0
+++ exited with 0 +++
I think overloading open, seek and read is a good solution. But just FYI if you want to parse and analyze the strace output programmatically, I did something similar before and put my code in github: https://github.com/johnlcf/Stana/wiki
(I did that because I have to analyze the strace result of program ran by others, which is not easy to ask them to do LD_PRELOAD.)
Probably the least ugly way to do this is to use fanotify. Fanotify is a Linux kernel facility that allows cheaply watching filesystem events. I'm not sure if it allows filtering by PID, but it does pass the PID to your program so you can check if it's the one you're interested in.
Here's a nice code sample:
http://bazaar.launchpad.net/~pitti/fatrace/trunk/view/head:/fatrace.c
However, it seems to be under-documented at the moment. All the docs I could find are http://www.spinics.net/lists/linux-man/msg02302.html and http://lkml.indiana.edu/hypermail/linux/kernel/0811.1/01668.html
Parsing command-line utils like strace is cumbersome; you could use ptrace() syscall instead. See man ptrace for details.

Resources