Measuring syscall overhead in Linux - linux

I have been searching for an appropriate method to measure cost of various syscalls in the Linux OS. There have been many questions raised related to this topic in the past, none provide a detailed description of how to measure it accurately. Most of the answers arbitrarily claim the cost of the syscall is 1-2us or a few 100 cycles if it caches on the CPU.
System calls overhead
Syscall overhead
The naive way I can think of measuring the syscall cost is to use rdtscp instruction across a syscall such as getpid(). However this is insufficient for measuring the cost of open(), read() or write() calls accurately. I do can modify the kernel and insert specific timer code across these functions and measure it but that would require changes in the kernel which I don't want to do. I wonder if there is a simpler solution that would allow me to measure it from the userspace itself.
Update: July 14:
After a lot of searches, I found libmicro benchmark suite from RedHat. https://github.com/redhat-performance/libMicro
However, this is created a while ago and I am wondering how good this still is. Of course, it does not use rdtscp and that adds some measurement errors. Is there anything else that is missing in this benchmark creation?

strace and perf are generally used to track and measure such kind of (kernel) operations. More specifically, perf can be used to generate flame graphs enabling you to see detailed in-kernel function calls. However, one should remember that proper rights need to be adjusted in /proc/sys/kernel/perf_event_paranoid.
I advise you to put the syscall in a loop since measuring precisely the cost of one syscall with possibly delayed/asynchronous work affected to kernel threads is either very hard to measure in user-space or simply just inaccurate (on a non-customized kernel).
Additional information:
strace work at the microsecond granularity. Some the POSIX clocks (see clock_gettime) could reach the granularity of 100 ns. Beyond this limit, rdtscp is AFAIK one of the most accurate (one should care about the reference frequency). As for perf, it makes use of hardware performance counters and kernel events. You may need to configure your kernel so trace-points can be generated and properly tracked by perf. perf can track one specific process or the complete system.

Related

Is linux perf accurate for measuring cache misses for multithread C program?

Can linux perf measure cache misses for multithread program, or it can only report the result for master thread? I used it on a C program using pthread, it seemed the cache miss number was lower than the expected number.
Yes, perf stat is an accurate total across all threads. (Unless your CPU has an erratum where a certain PMU event over or under-counts. These do happen, more often than correctness bugs for actual architectural state, so check the errata sheet, aka "spec update" for Intel CPUs.)
Make sure you understand exactly what each cache event counts, though, e.g. L1d-misses counts l1d.replacement on a modern Intel like Skylake, so multiple misses on the same line are only one replacement. (How does Linux perf calculate the cache-references and cache-misses events).
Also note that HW prefetch can avoid a lot of misses for sequential access, if memory can keep up. Also related: L2 instruction fetch misses much higher than L1 instruction fetch misses
Also related: Difference Between mem_load_uops_retired.l3_miss and offcore_response.demand_data_rd.l3_miss.local_dram Events goes into some detail about what exactly those specific events count.
Performance Counters for DRAM Accesses
What is the meaning of Perf events: dTLB-loads and dTLB-stores?
Hardware cache events and perf

What options do I have for running recurring events on a microsecond resolution from a kernel driver?

I want to create a simulation of an actual device on an x86 Linux Kernel. Part of this will involve simulating timings as close to possible as I can get. Based on some research it seems I will need at least microsecond resolution timing. I understand that on a non-realtime system it won't be possible to get perfect timing, but I don't perfect, just as close as I can get, perhaps with hacking around with thread scheduling / preemption options.
What I actually want to do is perform an action every interval, i.e. run a some code every Xµs. I've been trying to research the best ways to do this from a Kernel driver as well as some research into whether it's possible to do this reasonably accurately from user mode (keeping the above paragraph in mind). One of the first things that caught my eye was the HPET timer, that is programmable to generate interrupts based on programmable comparators. Unfortunately, it seems on many chipsets it has been rather buggy in the past, and there's not much information on using it for anything that obtaining a timestamp or using it as the main clock source. The linux Kernel provides an HPET driver that in the past, seemed to provide both kernel and user mode interfaces, but seems only to provide a barely documented usermode interface in more recent kernel versions. I've also read about various other kernel functions and interfaces such as the hrtimer interface and the various delay functions, though I'm having a bit of trouble understanding them and if they are suited for my purpose.
Given my current use case, what are the best options I have running recurring events at a µs resolution from say a kernel driver? Obviously accuracy is probably my biggest criteria, but ease of use would be second.
Well, it's possible to achieve your accuracy in userspace -- clock_nanosleep is one ideal option, which has relative and absolute mode. Since clock_nanosleep is based on hrtimer in kernel mode, you may want to use hrtimer if you'd like to implement it in kernel space.
However, to make the timer work accurately, there're two IMPORTENT things worth mentioning.
You should set the timerslack of your process (either by writing nonzero value in ns to /proc/self/timerslack_ns or via prctl(PR_SET_TIMERSLACK,...)). This value is considered as the 'tolerance' of the timer.
The CPU power management also matters here. The CPU has many different Cstates, each of which has a different exit latency. So you need to configure your cpuidle module to not use Cstates other than C0, e.g. for an Intel CPU you could simply write 1 to /sys/devices/system/cpu/cpu$c/cpuidle/state$s/disable to disable state $s of CPU $c. Or just add idle=poll to your kernel options to let CPU keep active (in C0) while kernel idle. NOTE that this significantly influences the power of the computer and leads the cooling fans to make noise.
You can get a timer with delays under 10 microseconds if the two things mentioned above are configured correctly. There is a trade-off between latency and power consumption that you should made.

Perf Imprecise Call-Graph Report

Recent Intel processors provide a hardware feature (a.k.a., Precise Event-Based Sampling (PEBS)) to access precise information about the CPU state on some sampled CPU events (e.g., e). Here is an extract from Intel 64 and IA-32 Achitecture's Software Developer's Manual: Volume 3:
18.15.7 Processor Event-Based Sampling (PEBS)
The debug store (DS) mechanism in processors based on Intel NetBurst microarchitecture allow two types of information to be collected for use in debugging and tuning programs: PEBS records and BTS records.
Based on Chapter 17 of the same reference, the DS format for x86-64 architecture is as follows:
The BTS Buffer records the last N executed branches (N is dependent on the microarchitecture), while the PEBS Buffer records the following registers:
IIUC, a counter is set and each event (e) occurrence increments its value. When the counter overflows, an entry is added to both of these buffers. Finally, when these buffers reach a certain size (BTS Absolute Maximum and PEBS Absolute Maximum), an interrupt is generated and the contents of the two buffers are dumped to disk. This will happen, periodically. It seems that the --call-graph dwarf backtrace data is also extracted in the same handler, Right?
1) Does this mean that LBR and PEBS (--call-graph --lbr) state, perfectly, match together?
2) How about the --call-graph dwarf output, which is not part of PEBS (as seems obvious in the above reference)? (Some RIP/RSPs do not match the backtrace)
Precisely, here is an LKML Thread, where Milian Wolff shows that the second question is, NO. But I do not fully understand the reason?
The answer to the first question is also, NO (expressed by Andi Kleen in the latter messages of the thread), which I do not understand at all.
3) Does this mean that the whole DWARF call-graph information is completely corrupted?
The above thread does not show this, and in my experiments I do not see any RIP not matching the backtrace. In other words, can I trust the majority of the backtraces?
I do not prefer the LBR method which may, itself, be imprecise. It is also limited in the size of the backtrace. Although, here is a patch to overcome the size issue. But this is recent and may be bogus.
UPDATE:
How is it possible to force Perf to store only a single record in PEBS Buffer? Is it only possible to force this configuration, indirectly, e.g., when call-graph information is required for a PEBS event?
The section of the manual you quoted talks about BTS, not LBR: they are not the same thing. Later in that same thread you quoted Andi Kleen seems to indicate that the LBR snap time is actually the moment of the PMI (the interrupt that runs the handler) and not the PEBS moment. So I think all three stack approaches have the same problem.
DWARF stack captures definitely do not correspond exactly to the PEBS entry. The PEBS event is recorded by the hardware at runtime, and then only some time later is the CPU interrupted, at which point the stack is unwound. If the PEBS buffer is configured to hold only a single entry, these two things should at least be close and if you are lucky, the PEBS IP will be in the same function that is still at the top of the stack when the handler runs. In that case, the stack is basically correct. Since perf shows you the actual PEBS IP at the top, plus the frames below that from the capture, this ends up working in that case.
If you aren't lucky, the function will have changed between the PEBS capture and the handler running. In this case you get a franken-stack that doesn't make sense: the top function may not be callable from the second-from-the-top function (or something). It is not totally corrupted: it's just that everything except the top frame comes from a point after the PEBS stack was captured, and the top frame comes from PEBS, or something like that. This applies also to --call-graph fp, for the same reasons.
Most likely you never saw an invalid IP because perf shows the IP from the PEBS sample (that's the theme of that whole thread). I think if you look into the raw sample, you can see both the PEBS IP, and the handler IP, and you can see they usually won't match.
Overall, you can trust the backtraces for "time" or "cycle" profiling since they are in some sense an accurate sampling representation of execution time: it's just that they don't correspond to the PEBS moment but some time later (but why is that later time any worse than the PEBS time). Basically for this type of profiling you don't really need PEBS at all.
If you are using a different type of event, and you want fine-grained accounting of where the event took place, that's what PEBS is for. You often don't need a stack trace: just the top frame is enough. If you want stack traces, use them, but know they come from a moment in time a bit later, or use --lbr (if that works).

Implementation of clock()

I was going through stackoverflow threads on various mechanisms for computing CPU time of a process.
How is clock() internally implemented ? Does it use rdtsc() ( If that's the case then it is sensitive to migration between cores ).
Also, getrusage() implemented ? Does it also depend on TSC ?
Thanks in advance
The kernel keeps track of CPU utilization for processes in sizes of ticks.
Both clock() and getrusage() are both based on these.
Ticks are accumulated by processes by the kernel using a sampling method in which the kernel receives a hardware interrupt for the clock and executes the clock handler, which adds the tick to the currently running process. At least, this is how it worked last time I looked.
So, rtdsc does not come into play at all - which is a good thing since rdtsc does not measure accurately across CPUs.
You could easily glance at some libc code. Here is the time/ directory of musl-libc
On several libraries, some low level timing syscalls are using VDSO to avoid paying the cost of a real syscall (from user-space to kernel and back), so somehow uses RTDSC.
But I am surprised that you ask. If it is curiosity, just study the source code of free software implementation. Otherwise, trust the specifications & the implementations.
Gory details could be complex, since implementation and system specific. The real implementation could be dynamically tuned at run-time (eg thru VDSO set-up in the kernel).

What is the lowest time resolution for somewhat accurate measurements of cpu usage?

Some of the things I want to measure are very short,and I can only repeat them so many times if I don't run any of the setup/dispose code in the middle.
note: on linux,reading /proc/stat
Not very portable and you'll have to take great care so it is reliable, but the Time Stamp Counter definitely has the highest resolution available (increases at every CPU tick).
The time stamp counter has, until
recently, been an excellent
high-resolution, low-overhead way of
getting CPU timing information. With
the advent of multi-core/hyperthreaded
CPUs, systems with multiple CPUs, and
"hibernating" operating systems, the
TSC cannot be relied on to provide
accurate results - unless great care
is taken to correct the possible
flaws: rate of tick and whether all
cores (processors) have identical
values in their time-keeping
registers. There is no promise that
the timestamp counters of multiple
CPUs on a single motherboard will be
synchronized. In such cases,
programmers can only get reliable
results by locking their code to a
single CPU. Even then, the CPU speed
may change due to power-saving
measures taken by the OS or BIOS, or
the system may be hibernated and later
resumed (resetting the time stamp
counter). In those latter cases, to
stay relevant, the counter must be
recalibrated periodically (according
to the time resolution your
application requires).
There's some notes there about Linux specific solutions on the page, too:
Under Linux, similar functionality is
provided by reading the value of
CLOCK_MONOTONIC clock using POSIX
clock_gettime function.

Resources