How should strace be used? - linux

A colleague once told me that the last option when everything has failed to debug on Linux was to use strace.
I tried to learn the science behind this strange tool, but I am not a system admin guru and I didn’t really get results.
So,
What is it exactly and what does it do?
How and in which cases should it be used?
How should the output be understood and processed?
In brief, in simple words, how does this stuff work?

Strace Overview
strace can be seen as a light weight debugger. It allows a programmer / user to quickly find out how a program is interacting with the OS. It does this by monitoring system calls and signals.
Uses
Good for when you don't have source code or don't want to be bothered to really go through it.
Also, useful for your own code if you don't feel like opening up GDB, but are just interested in understanding external interaction.
A good little introduction
Here is a gentle introduction to using strace to debug process hangs: strace introduction

In simple words, strace traces all system calls issued by a program along with their return codes. Think things such as file/socket operations and a lot more obscure ones.
It is most useful if you have some working knowledge of C since here system calls would more accurately stand for standard C library calls.
Let's say your program is /usr/local/bin/cough. Simply use:
strace /usr/local/bin/cough <any required argument for cough here>
or
strace -o <out_file> /usr/local/bin/cough <any required argument for cough here>
to write into 'out_file'.
All strace output will go to stderr (beware, the sheer volume of it often asks for a redirection to a file). In the simplest cases, your program will abort with an error and you'll be able to see what where its last interactions with the OS in strace output.
More information should be available with:
man strace

strace lists all system calls done by the process it's applied to. If you don't know what system calls mean, you won't be able to get much mileage from it.
Nevertheless, if your problem involves files or paths or environment values, running strace on the problematic program and redirecting the output to a file and then grepping that file for your path/file/env string may help you see what your program is actually attempting to do, as distinct from what you expected it to.

Strace stands out as a tool for investigating production systems where you can't afford to run these programs under a debugger. In particular, we have used strace in the following two situations:
Program foo seems to be in deadlock and has become unresponsive. This could be a target for gdb; however, we haven't always had the source code or sometimes were dealing with scripted languages that weren't straight-forward to run under a debugger. In this case, you run strace on an already running program and you will get the list of system calls being made. This is particularly useful if you are investigating a client/server application or an application that interacts with a database
Investigating why a program is slow. In particular, we had just moved to a new distributed file system and the new throughput of the system was very slow. You can specify strace with the '-T' option which will tell you how much time was spent in each system call. This helped to determine why the file system was causing things to slow down.
For an example of analyzing using strace see my answer to this question.

I use strace all the time to debug permission issues. The technique goes like this:
$ strace -e trace=open,stat,read,write gnome-calculator
Where gnome-calculator is the command that you want to run.

strace -tfp PID will monitor the PID process's system calls, thus we can debug/monitor our process/program status.

Strace can be used as a debugging tool, or as a primitive profiler.
As a debugger, you can see how given system calls were called, executed and what they return. This is very important, as it allows you to see not only that a program failed, but WHY a program failed. Usually it's just a result of lousy coding not catching all the possible outcomes of a program. Other times it's just hardcoded paths to files. Without strace you get to guess what went wrong where and how. With strace you get a breakdown of a syscall, usually just looking at a return value tells you a lot.
Profiling is another use. You can use it to time execution of each syscalls individually, or as an aggregate. While this might not be enough to fix your problems, it will at least greatly narrow down the list of potential suspects. If you see a lot of fopen/close pairs on a single file, you probably unnecessairly open and close files every execution of a loop, instead of opening and closing it outside of a loop.
Ltrace is strace's close cousin, also very useful. You must learn to differenciate where your bottleneck is. If a total execution is 8 seconds, and you spend only 0.05secs on system calls, then stracing the program is not going to do you much good, the problem is in your code, which is usually a logic problem, or the program actually needs to take that long to run.
The biggest problem with strace/ltrace is reading their output. If you don't know how the calls are made, or at least the names of syscalls/functions, it's going to be difficult to decipher the meaning. Knowing what the functions return can also be very beneficial, especially for different error codes. While it's a pain to decipher, they sometimes really return a pearl of knowledge; once I saw a situation where I ran out of inodes, but not out of free space, thus all the usual utilities didn't give me any warning, I just couldn't make a new file. Reading the error code from strace's output pointed me in the right direction.

Minimal runnable example
If a concept is not clear, there is a simpler example that you haven't seen that explains it.
In this case, that example is the Linux x86_64 assembly freestanding (no libc) hello world:
hello.S
.text
.global _start
_start:
/* write */
mov $1, %rax /* syscall number */
mov $1, %rdi /* stdout */
mov $msg, %rsi /* buffer */
mov $len, %rdx /* buffer len */
syscall
/* exit */
mov $60, %rax /* exit status */
mov $0, %rdi /* syscall number */
syscall
msg:
.ascii "hello\n"
len = . - msg
GitHub upstream.
Assemble and run:
as -o hello.o hello.S
ld -o hello.out hello.o
./hello.out
Outputs the expected:
hello
Now let's use strace on that example:
env -i ASDF=qwer strace -o strace.log -s999 -v ./hello.out arg0 arg1
cat strace.log
We use:
env -i ASDF=qwer to control the environment variables: https://unix.stackexchange.com/questions/48994/how-to-run-a-program-in-a-clean-environment-in-bash
-s999 -v to show fuller information on the logs
strace.log now contains:
execve("./hello.out", ["./hello.out", "arg0", "arg1"], ["ASDF=qwer"]) = 0
write(1, "hello\n", 6) = 6
exit(0) = ?
+++ exited with 0 +++
With such a minimal example, every single character of the output is self evident:
execve line: shows how strace executed hello.out, including CLI arguments and environment as documented at man execve
write line: shows the write system call that we made. 6 is the length of the string "hello\n".
= 6 is the return value of the system call, which as documented in man 2 write is the number of bytes written.
exit line: shows the exit system call that we've made. There is no return value, since the program quit!
More complex examples
The application of strace is of course to see which system calls complex programs are actually doing to help debug / optimize your program.
Notably, most system calls that you are likely to encounter in Linux have glibc wrappers, many of them from POSIX.
Internally, the glibc wrappers use inline assembly more or less like this: How to invoke a system call via sysenter in inline assembly?
The next example you should study is a POSIX write hello world:
main.c
#define _XOPEN_SOURCE 700
#include <unistd.h>
int main(void) {
char *msg = "hello\n";
write(1, msg, 6);
return 0;
}
Compile and run:
gcc -std=c99 -Wall -Wextra -pedantic -o main.out main.c
./main.out
This time, you will see that a bunch of system calls are being made by glibc before main to setup a nice environment for main.
This is because we are now not using a freestanding program, but rather a more common glibc program, which allows for libc functionality.
Then, at the every end, strace.log contains:
write(1, "hello\n", 6) = 6
exit_group(0) = ?
+++ exited with 0 +++
So we conclude that the write POSIX function uses, surprise!, the Linux write system call.
We also observe that return 0 leads to an exit_group call instead of exit. Ha, I didn't know about this one! This is why strace is so cool. man exit_group then explains:
This system call is equivalent to exit(2) except that it terminates not only the calling thread, but all threads in the calling process's thread group.
And here is another example where I studied which system call dlopen uses: https://unix.stackexchange.com/questions/226524/what-system-call-is-used-to-load-libraries-in-linux/462710#462710
Tested in Ubuntu 16.04, GCC 6.4.0, Linux kernel 4.4.0.

Strace is a tool that tells you how your application interacts with your operating system.
It does this by telling you what OS system calls your application uses and with what parameters it calls them.
So for instance you see what files your program tries to open, and weather the call succeeds.
You can debug all sorts of problems with this tool. For instance if application says that it cannot find library that you know you have installed you strace would tell you where the application is looking for that file.
And that is just a tip of the iceberg.

strace is a good tool for learning how your program makes various system calls (requests to the kernel) and also reports the ones that have failed along with the error value associated with that failure. Not all failures are bugs. For example, a code that is trying to search for a file may get a ENOENT (No such file or directory) error but that may be an acceptable scenario in the logic of the code.
One good use case of using strace is to debug race conditions during temporary file creation. For example a program that may be creating files by appending the process ID (PID) to some predecided string may face problems in multi-threaded scenarios. [A PID+TID (process id + thread id) or a better system call such as mkstemp will fix this].
It is also good for debugging crashes. You may find this (my) article on strace and debugging crashes useful.

Here's some examples of how I use strace to dig into websites. Hope this is helpful.
Check for time to first byte like so:
time php index.php > timeTrace.txt
See what percentage of actions are doing what. Lots of lstat and fstat could be an indication that it's time to clear the cache:
strace -s 200 -c php index.php > traceLstat.txt
Outputs a trace.txt so you can see exactly what calls are being made.
strace -Tt -o Fulltrace.txt php index.php
Use this to check on whether anything took between .1 to .9 of a second to load:
cat Fulltrace.txt | grep "[<]0.[1-9]" > traceSlowest.txt
See what missing files or directories got caught in the strace. This will output a lot of stuff involving our system - the only relevant bits involve the customer's files:
strace -vv php index.php 2>&1 | sed -n '/= -1/p' > traceFailures.txt

I liked some of the answers where it reads strace checks how you interacts with your operating system.
This is exactly what we can see. The system calls. If you compare strace and ltrace the difference is more obvious.
$>strace -c cd
Desktop Documents Downloads examples.desktop Music Pictures Public Templates Videos
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 7 read
0.00 0.000000 0 1 write
0.00 0.000000 0 11 close
0.00 0.000000 0 10 fstat
0.00 0.000000 0 17 mmap
0.00 0.000000 0 12 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 3 brk
0.00 0.000000 0 2 rt_sigaction
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 2 ioctl
0.00 0.000000 0 8 8 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 2 getdents
0.00 0.000000 0 2 2 statfs
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 9 openat
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 1 prlimit64
------ ----------- ----------- --------- --------- ----------------
100.00 0.000000 93 10 total
On the other hand there is ltrace that traces functions.
$>ltrace -c cd
Desktop Documents Downloads examples.desktop Music Pictures Public Templates Videos
% time seconds usecs/call calls function
------ ----------- ----------- --------- --------------------
15.52 0.004946 329 15 memcpy
13.34 0.004249 94 45 __ctype_get_mb_cur_max
12.87 0.004099 2049 2 fclose
12.12 0.003861 83 46 strlen
10.96 0.003491 109 32 __errno_location
10.37 0.003303 117 28 readdir
8.41 0.002679 133 20 strcoll
5.62 0.001791 111 16 __overflow
3.24 0.001032 114 9 fwrite_unlocked
1.26 0.000400 100 4 __freading
1.17 0.000372 41 9 getenv
0.70 0.000222 111 2 fflush
0.67 0.000214 107 2 __fpending
0.64 0.000203 101 2 fileno
0.62 0.000196 196 1 closedir
0.43 0.000138 138 1 setlocale
0.36 0.000114 114 1 _setjmp
0.31 0.000098 98 1 realloc
0.25 0.000080 80 1 bindtextdomain
0.21 0.000068 68 1 opendir
0.19 0.000062 62 1 strrchr
0.18 0.000056 56 1 isatty
0.16 0.000051 51 1 ioctl
0.15 0.000047 47 1 getopt_long
0.14 0.000045 45 1 textdomain
0.13 0.000042 42 1 __cxa_atexit
------ ----------- ----------- --------- --------------------
100.00 0.031859 244 total
Although I checked the manuals several time, I haven't found the origin of the name strace but it is likely system-call trace, since this is obvious.
There are three bigger notes to say about strace.
Note 1: Both these functions strace and ltrace are using the system call ptrace. So ptrace system call is effectively how strace works.
The ptrace() system call provides a means by which one process (the
"tracer") may observe and control the execution of another process
(the "tracee"), and examine and change the tracee's memory and
registers. It is primarily used to implement breakpoint debugging
and system call tracing.
Note 2: There are different parameters you can use with strace, since strace can be very verbose. I like to experiment with -c which is like a summary of things. Based on -c you can select one system-call like -e trace=open where you will see only that call. This can be interesting if you are examining what files will be opened during the command you are tracing.
And of course, you can use the grep for the same purpose but note you need to redirect like this 2>&1 | grep etc to understand that config files are referenced when the command was issued.
Note 3: I find this very important note. You are not limited to a specific architecture. strace will blow you mind, since it can trace over binaries of different architectures.

Related

Measure LLC/L3 Cache Miss Rate on AMD Zen2 CPU

I have question related to this one.
I want to (programatically) measure L3 Hits (Accesses) and Misses on an AMD EPYC 7742 CPU (Zen2). I run Linux Kernel 5.4.0-66-generic on Ubuntu Server 20.04.2 LTS. According to the question linked above, the events rFF04 (L3LookupState) and r0106 (L3CombClstrState) should represent the L3 accesses and misses, respectively. Furthermore, Kernel 5.4 should support these events.
However, when measuring it with perf, I run into issues. Similar to the question linked above, if I run numactl -C 0 -m 0 perf stat -e instructions,cycles,r0106,rFF04 ./benchmark, I only measure 0 values. If I try to use numactl -C 0 -m 0 perf stat -e instructions,cycles,amd_l3/r8001/,amd_l3/r0106/, perf complains about "unknown terms". If I use the perf event names, i.e. numactl -C 0 -m 0 perf stat -e instructions,cycles,l3_request_g1.caching_l3_cache_accesses, l3_comb_clstr_state.request_miss perf outputs <not supported> for these events.
Furthermore, I actually want to measure this using perf's C API. Currently, I dispatch a perf_event_attr with type PERF_TYPE_RAW and config set to, e.g., 0x8001. How do I get the amd_l3 PMU stuff into my perf_event_attr object? Otherwise, it would be equivalent to numactl -C 0 -m 0 perf stat -e instructions,cycles,r0106,rFF04 ./benchmark, which is measuring undefined values.
Thank you so much for your help.

Minimal executable size now 10x larger after linking than 2 years ago, for tiny programs?

For a university course, I like to compare code-sizes of functionally similar programs if written and compiled using gcc/clang versus assembly. In the process of re-evaluating how to further shrink the size of some executables, I couldn't trust my eyes when the very same assembly code I assembled/linked 2 years ago now has grown >10x in size after building it again (which true for multiple programs, not only helloworld):
$ make
as -32 -o helloworld-asm-2020.o helloworld-asm-2020.s
ld -melf_i386 -o helloworld-asm-2020 helloworld-asm-2020.o
$ ls -l
-rwxr-xr-x 1 xxx users 708 Jul 18 2018 helloworld-asm-2018*
-rwxr-xr-x 1 xxx users 8704 Nov 25 15:00 helloworld-asm-2020*
-rwxr-xr-x 1 xxx users 4724 Nov 25 15:00 helloworld-asm-2020-n*
-rwxr-xr-x 1 xxx users 4228 Nov 25 15:00 helloworld-asm-2020-n-sstripped*
-rwxr-xr-x 1 xxx users 604 Nov 25 15:00 helloworld-asm-2020.o*
-rw-r--r-- 1 xxx users 498 Nov 25 14:44 helloworld-asm-2020.s
The assembly code is:
.code32
.section .data
msg: .ascii "Hello, world!\n"
len = . - msg
.section .text
.globl _start
_start:
movl $len, %edx # EDX = message length
movl $msg, %ecx # ECX = address of message
movl $1, %ebx # EBX = file descriptor (1 = stdout)
movl $4, %eax # EAX = syscall number (4 = write)
int $0x80 # call kernel by interrupt
# and exit
movl $0, %ebx # return code is zero
movl $1, %eax # exit syscall number (1 = exit)
int $0x80 # call kernel again
The same hello world program, compiled using GNU as and GNU ld (always using 32-bit assembly) was 708 bytes then, and has grown to 8.5K now. Even when telling the linker to turn off page alignment (ld -n), it still has almost 4.2K. stripping/sstripping doesn't pay off either.
readelf tells me that the start of section headers is much later in the code (byte 468 vs 8464), but I have no idea why. It's running on the same arch system as in 2018, the Makefile is the same and I'm not linking against any libraries (especially not libc). I guess something regarding ld has changed due to the fact that the object file is still quite small, but what and why?
Disclaimer: I'm building 32-bit executables on an x86-64 machine.
Edit: I'm using GNU binutils (as & ld) version 2.35.1 Here is a base64-encoded archive which includes the source and both executables (small old one, large new one) :
cat << EOF | base64 -d | tar xj
QlpoOTFBWSZTWVaGrEQABBp////xebj/7//Xf+a8RP/v3/rAAEVARARAeEADBAAAoCAI0AQ+NAam
ytMpCGmpDVPU0aNpGmh6Rpo9QAAeoBoADQaNAADQ09IAACSSGUwaJpTNQGE9QZGhoADQPUAA0AAA
AA0aA4AAAABoAAAAA0GgAAAAZAGgAHAAAAANAAAAAGg0AAAADIA0AASJCBIyE8hHpqPVPUPU/VAa
fqn6o0ep6BB6TQaNGj0j1ABobU00yeU9JYiuVVZKYE+dKNa3wls6x81yBpGAN71NoylDUvNryWiW
E4ER8XkfpaJcPb6ND12ULEqkQX3eaBHP70Apa5uFhWNDy+U3Ekj+OLx5MtDHxQHQLfMcgCHrGayE
Dc76F4ZC4rcRkvTW4S2EbJAsbBGbQxSbx5o48zkyk5iPBBhJowtCSwDBsQBc0koYRSO6SgJNL0Bg
EmCoxCDAs5QkEmTGmQUgqZNIoxsmwDmDQe0NIDI0KjQ64leOr1fVk6AaVhjOAJjLrEYkYy4cDbyS
iXSuILWohNh+PA9Izk0YUM4TQQGEYNgn4oEjGmAByO+kzmDIxEC3Txni6E1WdswBJLKYiANdiQ2K
00jU/zpMzuIhjTbgiBqE24dZWBcNBBAAioiEhCQEIfAR8Vir4zNQZFgvKZa67Jckh6EHZWAWuf6Q
kGy1lOtA2h9fsyD/uPPI2kjvoYL+w54IUKBEEYFBIWRNCNpuyY86v3pNiHEB7XyCX5wDjZUSF2tO
w0PVlY2FQNcLQcbZjmMhZdlCGkVHojuICHMMMB5kQQSZRwNJkYTKz6stT/MTWmozDCcj+UjtB9Cf
CUqAqqRlgJdREtMtSO4S4GpJE2I/P8vuO9ckqCM2+iSJCLRWx2Gi8VSR8BIkVX6stqIDmtG8xSVU
kk7BnC5caZXTIynyI0doXiFY1+/Csw2RUQJroC0lCNiIqVVUkTqTRMYqKNVGtCJ5yfo7e3ZpgECk
PYUEihPU0QVgfQ76JA8Eb16KCbSzP3WYiVApqmfDhUk0aVc+jyBJH13uKztUuva8F4YdbpmzomjG
kSJmP+vCFdKkHU384LdRoO0LdN7VJlywJ2xJdM+TMQ0KhMaicvRqfC5pHSu+gVDVjfiss+S00ikI
DeMgatVKKtcjsVDX09XU3SzowLWXXunnFZp/fP3eN9Rj1ubiLc0utMl3CUUkcYsmwbKKrWhaZiLO
u67kMSsW20jVBcZ5tZUKgdRtu0UleWOs1HK2QdMpyKMxTRHWhhHwMnVEsWIUEjIfFEbWhRTRMJXn
oIBSEa2Q0llTBfJV0LEYEQTBTFsDKIxhgqNwZB2dovl/kiW4TLp6aGXxmoIpVeWTEXqg1PnyKwux
caORGyBhTEPV2G7/O3y+KeAL9mUM4Zjl1DsDKyTZy8vgn31EDY08rY+64Z/LO5tcRJHttMYsz0Fh
CRN8LTYJL/I/4u5IpwoSCtDViIA=
EOF
Update:
When using ld.gold instead of ld.bfd (to which /usr/bin/ld is symlinked to by default), the executable size becomes as small as expected:
$ cat Makefile
TARGET=helloworld
all:
as -32 -o ${TARGET}-asm.o ${TARGET}-asm.s
ld.bfd -melf_i386 -o ${TARGET}-asm-bfd ${TARGET}-asm.o
ld.gold -melf_i386 -o ${TARGET}-asm-gold ${TARGET}-asm.o
rm ${TARGET}-asm.o
$ make -q
$ ls -l
total 68
-rw-r--r-- 1 eso eso 200 Dec 1 13:57 Makefile
-rwxrwxr-x 1 eso eso 8700 Dec 1 13:57 helloworld-asm-bfd
-rwxrwxr-x 1 eso eso 732 Dec 1 13:57 helloworld-asm-gold
-rw-r--r-- 1 eso eso 498 Dec 1 13:44 helloworld-asm.s
Maybe I just used gold previously without being aware.
It's not 10x in general, it's page-alignment of a couple sections as Jester says, per changes to ld's default linker script for security reasons:
First change: Making sure data from .data isn't present in any of the mapping of .text, so none of that static data is available for ROP / Spectre gadgets in an executable page. (In older ld, that meant the program-headers mapped the same disk-block twice, also into a RW-without-exec segment for the actual .data section. The executable mapping was still read-only.)
More recent change: Separate .rodata from .text into separate segments, again so static data isn't mapped into an executable page. Previously, const char code[]= {...} could be cast to a function pointer and called, without needing mprotect or gcc -z execstack or other tricks, if you wanted to test shellcode that way. (A separate Linux kernel change made -z execstack only apply to the actual stack, not READ_IMPLIES_EXEC.)
See Why an ELF executable could have 4 LOAD segments? for this history, including the strange fact that .rodata is in a separate segment from the read-only mapping for access to the ELF metadata.
That extra space is just 00 padding and will compress well in a .tar.gz or whatever.
So it has a worst-case upper bound of about 2x 4k extra pages of padding, and tiny executables are close to that worst case.
gcc -Wl,--nmagic will turn off page-alignment of sections if you want that for some reason. (see the ld(1) man page) I don't know why that doesn't pack everything down to the old size. Perhaps checking the default linker script would shed some light, but it's pretty long. Run ld --verbose to see it.
stripping won't help for padding that's part of a section; I think it can only remove whole sections.
ld -z noseparate-code uses the old layout, only 2 total segments to cover the .text and .rodata sections, and the .data and .bss sections. (And the ELF metadata that dynamic linking wants access to.)
Related:
Linking with gcc instead of ld
This question is about ld, but note that if you're using gcc -nostdlib, that used to also default to making a static executable. But modern Linux distros config GCC with -pie as the default, and GCC won't make a static-pie by default even if there aren't any shared libraries being linked. Unlike with -no-pie mode where it will simply make a static executable in that case. (A static-pie still needs startup code to apply relocations for any absolute addresses.)
So the equivalent of ld directly is gcc -nostdlib -static (which implies -no-pie). Or gcc -nostdlib -no-pie should let it default to -static when there are no shared libs being linked. You can combine this with -Wl,--nmagic and/or -Wl,-z -Wl,noseparate-code.
Also:
A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux - eventually making a 45 byte executable, with the machine code for an _exit syscall stuffed into the ELF program header itself.
FASM can make quite small executables, using its mode where it outputs a static executable (not object file) directly with no ELF section metadata, just program headers. (It's a pain to debug with GDB or disassemble with objdump; most tools assume there will be section headers, even though they're not needed to run static executables.)
What is a reasonable minimum number of assembly instructions for a small C program including setup?
What's the difference between "statically linked" and "not a dynamic executable" from Linux ldd? (static vs. static-pie vs. (dynamic) PIE that happens to have no shared libraries.)

How to monitor number of syscalls executed by kernel?

I need to monitor amount of system calls executed by Linux.
I'm aware that vmstat has ability to show this for BSD and AIX systems, but for Linux it can't (according to man page).
Is there any counter in /proc? Or is there any other way to monitor it?
I wrote a simple SystemTap script(based on syscalls_by_pid.stp).
It produces output like this:
ProcessName #SysCalls
munin-graph 38609
munin-cron 8160
fping 4502
check_http_demo 2584
check_nrpe 2045
sh 1836
nagios 886
sendmail 747
smokeping 649
check_http 571
check_nt 376
pcscd 216
ping 108
check_ping 100
crond 87
stapio 69
init 56
syslog-ng 27
sshd 17
ntpd 9
hp-asrd 8
hald-addon-stor 7
automount 6
httpd 4
stap 3
flow-capture 2
gam_server 2
Total 61686
The script itself:
#! /usr/bin/env stap
#
# Print the system call count by process name in descending order.
#
global syscalls
probe begin {
print ("Collecting data... Type Ctrl-C to exit and display results\n")
}
probe syscall.* {
syscalls[execname()]++
}
probe end {
printf ("%-20s %-s\n\n", "ProcessName", "#SysCalls")
summary = 0
foreach (procname in syscalls-) {
printf("%-20s %-10d\n", procname, syscalls[procname])
summary = summary + syscalls[procname]
}
printf ("\n%-20s %-d\n", "Total", summary)
}
You can use pstrace as said Jeff Foster to trace the system call.
Also, you can use strace and ltrace
strace - trace system calls and signals
ltrace - A library call tracer
You can use ptrace to monitor all syscalls (see here)
I believe OProfile can do this.
I am not aware of a centralized way to monitor syscalls throughout the entire OS. Maybe do a ptrace on the init process and follow all children? But I don't know if that will work.
Your best bet is to write a patch to the kernel itself to do this. The closest thing to this that I've seen is a cgroup implementation for enforcing permissions on what syscalls can be executed at runtime. You can find the patch here:
https://github.com/luksow/syscalls-cgroup
It shouldn't be too much more work to throw a counter in there, from a kernel programming perspective.

Visualising strace output

Is there a simple tool, or maybe a method to turn strace output into something that can be visualised or otherwise easier to sift through? I am having to figure out where an application is going wrong, but stracing it produces massive amounts of data. Trying to trace what this application and its threads are doing (or trying to do) on a larger scale is proving to be very difficult to do reading every system call.
I have no budget for anything, and we are a pure Linux shop.
If your problem is a network one, you could try to limit the strace output to the network related syscalls with a
strace -e trace=network your_program
Yes, use -c parameter to visualise count time, calls, and errors for each syscall and report summary in form of table, e.g.
$ strace -c -fp $(pgrep -n php)
Process 11208 attached
^CProcess 11208 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
83.78 0.112292 57 1953 152 stat
7.80 0.010454 56 188 lstat
7.79 0.010439 28 376 access
0.44 0.000584 0 5342 32 recvfrom
0.15 0.000203 0 3985 sendto
0.04 0.000052 0 27184 gettimeofday
0.00 0.000000 0 6 write
0.00 0.000000 0 3888 poll
------ ----------- ----------- --------- --------- ----------------
100.00 0.134024 42922 184 total
This will identify the problem without having analysing large amount of data.
Another way is to filter by specific syscalls (such as recvfrom/sendto) to visualize received data and sent, example debugging PHP process:
strace -e recvfrom,sendto -fp $(pgrep -n php) -s 1000 2>&1 | while read -r line; do
printf "%b" $line;
done | strings
Related: How to parse strace in shell into plain text?

What is the significance of the numbers in the name of the flush processes for newer linux kernels?

I am running kernel 2.6.33.7.
Previously, I was running v2.6.18.x. On 2.6.18, the flush processes were named pdflush.
After upgrading to 2.6.33.7, the flush processes have a format of "flush-:".
For example, currently I see flush process "flush-8:32" popping up in top.
In doing a google search to try to determine an answer to this question, I saw examples of "flush-8:38", "flush-8:64" and "flush-253:0" just to name a few.
I understand what the flush process itself does, my question is what is the significance of the numbers on the end of the process name? What do they represent?
Thanks
Device numbers used to identify block devices. A kernel thread may be spawned to handle a particular device.
(On one of my systems, block devices are currently numbered as shown below. They may change from boot to boot or hotplug to hotplug.)
$ grep ^ /sys/class/block/*/dev
/sys/class/block/dm-0/dev:254:0
/sys/class/block/dm-1/dev:254:1
/sys/class/block/dm-2/dev:254:2
/sys/class/block/dm-3/dev:254:3
/sys/class/block/dm-4/dev:254:4
/sys/class/block/dm-5/dev:254:5
/sys/class/block/dm-6/dev:254:6
/sys/class/block/dm-7/dev:254:7
/sys/class/block/dm-8/dev:254:8
/sys/class/block/dm-9/dev:254:9
/sys/class/block/loop0/dev:7:0
/sys/class/block/loop1/dev:7:1
/sys/class/block/loop2/dev:7:2
/sys/class/block/loop3/dev:7:3
/sys/class/block/loop4/dev:7:4
/sys/class/block/loop5/dev:7:5
/sys/class/block/loop6/dev:7:6
/sys/class/block/loop7/dev:7:7
/sys/class/block/md0/dev:9:0
/sys/class/block/md1/dev:9:1
/sys/class/block/sda/dev:8:0
/sys/class/block/sda1/dev:8:1
/sys/class/block/sda2/dev:8:2
/sys/class/block/sdb/dev:8:16
/sys/class/block/sdb1/dev:8:17
/sys/class/block/sdb2/dev:8:18
/sys/class/block/sdc/dev:8:32
/sys/class/block/sdc1/dev:8:33
/sys/class/block/sdc2/dev:8:34
/sys/class/block/sdd/dev:8:48
/sys/class/block/sdd1/dev:8:49
/sys/class/block/sdd2/dev:8:50
/sys/class/block/sde/dev:8:64
/sys/class/block/sdf/dev:8:80
/sys/class/block/sdg/dev:8:96
/sys/class/block/sdh/dev:8:112
/sys/class/block/sdi/dev:8:128
/sys/class/block/sr0/dev:11:0
/sys/class/block/sr1/dev:11:1
/sys/class/block/sr2/dev:11:2
You should also be able to figure this out by searching for those numbers in /proc/self/mountinfo, eg:
$ grep 8:32 /proc/self/mountinfo
25 22 8:32 / /var rw,relatime - ext4 /dev/mapper/sysvg-var rw,barrier=1,data=ordered
This has the side benefit of working with nfs as well:
$ grep 0:73 /proc/self/mountinfo
108 42 0:73 /foo /mnt/foo rw,relatime - nfs host.domain.com:/volume/path rw, ...
Note, the data I included here is fabricated, but the mechanism works just fine.

Resources