Recent Intel processors provide a hardware feature (a.k.a., Precise Event-Based Sampling (PEBS)) to access precise information about the CPU state on some sampled CPU events (e.g., e). Here is an extract from Intel 64 and IA-32 Achitecture's Software Developer's Manual: Volume 3:
18.15.7 Processor Event-Based Sampling (PEBS)
The debug store (DS) mechanism in processors based on Intel NetBurst microarchitecture allow two types of information to be collected for use in debugging and tuning programs: PEBS records and BTS records.
Based on Chapter 17 of the same reference, the DS format for x86-64 architecture is as follows:
The BTS Buffer records the last N executed branches (N is dependent on the microarchitecture), while the PEBS Buffer records the following registers:
IIUC, a counter is set and each event (e) occurrence increments its value. When the counter overflows, an entry is added to both of these buffers. Finally, when these buffers reach a certain size (BTS Absolute Maximum and PEBS Absolute Maximum), an interrupt is generated and the contents of the two buffers are dumped to disk. This will happen, periodically. It seems that the --call-graph dwarf backtrace data is also extracted in the same handler, Right?
1) Does this mean that LBR and PEBS (--call-graph --lbr) state, perfectly, match together?
2) How about the --call-graph dwarf output, which is not part of PEBS (as seems obvious in the above reference)? (Some RIP/RSPs do not match the backtrace)
Precisely, here is an LKML Thread, where Milian Wolff shows that the second question is, NO. But I do not fully understand the reason?
The answer to the first question is also, NO (expressed by Andi Kleen in the latter messages of the thread), which I do not understand at all.
3) Does this mean that the whole DWARF call-graph information is completely corrupted?
The above thread does not show this, and in my experiments I do not see any RIP not matching the backtrace. In other words, can I trust the majority of the backtraces?
I do not prefer the LBR method which may, itself, be imprecise. It is also limited in the size of the backtrace. Although, here is a patch to overcome the size issue. But this is recent and may be bogus.
UPDATE:
How is it possible to force Perf to store only a single record in PEBS Buffer? Is it only possible to force this configuration, indirectly, e.g., when call-graph information is required for a PEBS event?
The section of the manual you quoted talks about BTS, not LBR: they are not the same thing. Later in that same thread you quoted Andi Kleen seems to indicate that the LBR snap time is actually the moment of the PMI (the interrupt that runs the handler) and not the PEBS moment. So I think all three stack approaches have the same problem.
DWARF stack captures definitely do not correspond exactly to the PEBS entry. The PEBS event is recorded by the hardware at runtime, and then only some time later is the CPU interrupted, at which point the stack is unwound. If the PEBS buffer is configured to hold only a single entry, these two things should at least be close and if you are lucky, the PEBS IP will be in the same function that is still at the top of the stack when the handler runs. In that case, the stack is basically correct. Since perf shows you the actual PEBS IP at the top, plus the frames below that from the capture, this ends up working in that case.
If you aren't lucky, the function will have changed between the PEBS capture and the handler running. In this case you get a franken-stack that doesn't make sense: the top function may not be callable from the second-from-the-top function (or something). It is not totally corrupted: it's just that everything except the top frame comes from a point after the PEBS stack was captured, and the top frame comes from PEBS, or something like that. This applies also to --call-graph fp, for the same reasons.
Most likely you never saw an invalid IP because perf shows the IP from the PEBS sample (that's the theme of that whole thread). I think if you look into the raw sample, you can see both the PEBS IP, and the handler IP, and you can see they usually won't match.
Overall, you can trust the backtraces for "time" or "cycle" profiling since they are in some sense an accurate sampling representation of execution time: it's just that they don't correspond to the PEBS moment but some time later (but why is that later time any worse than the PEBS time). Basically for this type of profiling you don't really need PEBS at all.
If you are using a different type of event, and you want fine-grained accounting of where the event took place, that's what PEBS is for. You often don't need a stack trace: just the top frame is enough. If you want stack traces, use them, but know they come from a moment in time a bit later, or use --lbr (if that works).
Related
This line appears under memory events in perf tool.
CPU: Intel Xeon Gold
"Precise" events mean using PEBS instead of the traditional firing an interrupt when the counter overflows. Instead it writes a sample in a buffer to be collected later, so it can attribute it to the right instruction without pipeline / retirement effects delaying it (e.g. waiting until the currently-last instruction retires, I think to ensure forward progress, causing a "skid").
The PEBS buffer also gives it a place to put additional data, like an address associated with the event that triggered recording a sample.
https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR#processor-event-based-sampling-pebs
Also related with discussion about or details of PEBS and how perf uses it for event:pp -
Good resources on how to program PEBS (Precise event based sampling) counters?
What is the difference between "cpu/mem-loads/pp" and "cpu/mem-loads/"?
Which perf events can use PEBS?
Perf shows L1-dcache-load-misses in a block with no memory access
Below is a block of code that perf record flags as responsible for 10% of all L1-dcache misses, but the block is entirely movement between zmm registers. This is the perf command string:
perf record -e L1-dcache-load-misses -c 10000 -a -- ./Program_to_Test.exe
The code block:
Round:
vmulpd zmm1,zmm0,zmm28
VCVTTPD2QQ zmm0{k7},zmm1
VCVTUQQ2PD zmm2{k7},zmm0
vsubpd zmm3,zmm1,zmm2
vmulpd zmm4,zmm3,zmm27
VCVTTPD2QQ zmm5{k7}{z},zmm4
VPCMPGTQ k2,zmm5,zmm26
VPCMPEQQ k3 {k7},zmm5,zmm26
KADDQ k1,k2,k3
VCVTQQ2PD zmm2{k7},zmm0
VDIVPD zmm1{k7},zmm2,zmm28 ; Divide by 100
VPXORQ zmm2{k7},zmm2,zmm2
vmovupd zmm2,zmm1
VADDPD zmm2{k1},zmm1,zmm25
I get similar results for that code block with other L1 measures such as l1d.replacement.
My question is, how can a block that is only zmm register movement generate L1 cache misses? I didn't think registers go to memory at all. In fact, the last memory access is 10 instructions above this block of code; the other 9 instructions are all register-to-register instructions.
The event L1-dcache-load-misses is mapped to L1D.REPLACEMENT on Sandy Bridge and later microarchitectures (or mapped to a similar event on older microarchitectures). This event doesn't support precise sampling, which means that a sample can point to an instruction that couldn't have generated the event being sampled on. (Note that L1-dcache-load-misses is not supported on any current Atom.)
Starting with Linux 3.11 running on a Haswell+ or Silvermont+ microarchitecture, samples can be captured with eventing instruction pointers by specifying a sampling event that meets the following two conditions:
The event supports precise sampling. You can use, for example, any of the events that represent memory uop or instruction retirement. The exact names and meaning of the events depends on the microarchtiecture. Refer to the Intel SDM Volume 3 for more information. There is no event that supports precise sampling and has the same exact meaning as L1D.REPLACEMENT. On processors that support Extended PEBS, only a subset of PEBS events support precise sampling.
The precise sampling level is enabled on the event. In Linux perf, this can be done by appending ":pp" to the event name or raw event encoding, or "pp" after the terminating slash of a raw event specified in the PMU syntax. For example, on Haswell, the event mem_load_uops_retired.l1_miss:pp can be specified to Linux perf.
With such an event, when the event counter overflows, the PEBS hardware is armed, which means that it's now looking for the earliest possible opportunity to collect a precise sample. When there is at least one instruction that will cause an event during this window of time, the PEBS hardware will be eventually triggered by one of these instructions with bias toward high-latency instructions. When the instruction that triggeres PEBS retires, the PEBS microcode routine will execute and captures a PEBS record, which contains among other things the IP of the instruction that triggered PEBS (which is different from the architectural IP). The instruction pointer (IP) used by perf to display the results is this eventing IP. (I noticed there can be a negligible number of samples pointing to instructions that couldn't have caused the event.)
On older mircroarchitecures (before Haswell and Silvermont), the "pp" precise sampling level is also supported. PEBS on these processors will only capture the architectural event, which points to the static instruction that immediately follows the PEBS triggering instruction in program order. Linux perf uses LBR, if possible, which contains source-target IP pairs to determine if that captured IP is a target of a jump. If that was the case, it will add the source IP as the eventing IP to the sample record.
Some microarchitectures support one or more events with better sampling distribution (how much better depends on the microarchitecture, the event, the counter, and the instructions being executed at the time in which the counter is about to overflow). In Linux perf, precise distribution can be enabled, if supported, by specifying the precise level "ppp."
I have been searching for an appropriate method to measure cost of various syscalls in the Linux OS. There have been many questions raised related to this topic in the past, none provide a detailed description of how to measure it accurately. Most of the answers arbitrarily claim the cost of the syscall is 1-2us or a few 100 cycles if it caches on the CPU.
System calls overhead
Syscall overhead
The naive way I can think of measuring the syscall cost is to use rdtscp instruction across a syscall such as getpid(). However this is insufficient for measuring the cost of open(), read() or write() calls accurately. I do can modify the kernel and insert specific timer code across these functions and measure it but that would require changes in the kernel which I don't want to do. I wonder if there is a simpler solution that would allow me to measure it from the userspace itself.
Update: July 14:
After a lot of searches, I found libmicro benchmark suite from RedHat. https://github.com/redhat-performance/libMicro
However, this is created a while ago and I am wondering how good this still is. Of course, it does not use rdtscp and that adds some measurement errors. Is there anything else that is missing in this benchmark creation?
strace and perf are generally used to track and measure such kind of (kernel) operations. More specifically, perf can be used to generate flame graphs enabling you to see detailed in-kernel function calls. However, one should remember that proper rights need to be adjusted in /proc/sys/kernel/perf_event_paranoid.
I advise you to put the syscall in a loop since measuring precisely the cost of one syscall with possibly delayed/asynchronous work affected to kernel threads is either very hard to measure in user-space or simply just inaccurate (on a non-customized kernel).
Additional information:
strace work at the microsecond granularity. Some the POSIX clocks (see clock_gettime) could reach the granularity of 100 ns. Beyond this limit, rdtscp is AFAIK one of the most accurate (one should care about the reference frequency). As for perf, it makes use of hardware performance counters and kernel events. You may need to configure your kernel so trace-points can be generated and properly tracked by perf. perf can track one specific process or the complete system.
how compiler makes sure that the red zone is not clobbered? Is there any overallocation of space?
And what factors lead to choosing 128 byte as the size of red zone?
The compiler doesn't, it just takes advantage of the guarantee that space below RSP won't be asynchronously clobbered (e.g. by signal handlers). Making a function call will of course synchronously clobber it.
In fact, on Linux only signal handlers run asynchronously in user-space code. (The kernel stack gets interrupts: Why can't kernel code use a Red Zone)
The kernel implements the red-zone when delivering signals to user-space. I think that's about it; it's really pretty easy to implement.
The other thing that's relevant is when a debugger runs a function when you do something like print foo(123) in GDB. GDB will actually run that function using the stack of the current thread. In an ABI with a red-zone, GDB (or any other debugger) has to respect it when invoking that function by doing rsp -= 128 after saving register state to restore when the user doing continue or single-step.
In i386 System V, print foo(123) will use space right below the current ESP, stepping on whatever was below ESP. (I think; not tested).
And what factors lead to choosing 128 byte as the size of red zone?
A signed byte displacement in an addressing mode like [rsp - 128] can reach that far. IIRC, the amd64.org mailing archive I was looking through while answering Why does Windows64 use a different calling convention from all other OSes on x86-64? actually included a message citing that as the reason for that specific choice.
You want it to be large enough that many simple leaf functions don't need to move RSP. e.g. at least 16 or 32 bytes, like the 32-byte shadow space in MS's Windows x64 calling convention.
You want it to be small enough that skipping over it to invoke a signal handler doesn't need to touch huge amounts more space, like new pages. So much less than 4kB.
A leaf function that needs more than 128 bytes of locals is probably big enough that moving RSP is a drop in the bucket. And then the +-disp8 addressing mode benefit comes into play, giving access to a whole 256 bytes of space with compact addressing modes from byte [rsp+127] to byte [rsp-128] or in dword/qword chunks.
Further reading
Reading why it's not safe to use space below ESP on Windows, or Linux without a red-zone, is illuminating.
Raymond Chen's blog: Why do we even need to define a red zone? Can’t I just use my stack for anything?
Also my SO answer covers some of the same ground: Is it valid to write below ESP? (but with more guesswork and less interesting Windows details than Raymond.)
Reading Intel's SDM about Memory protection keys (MPK) doesn't suggest wrpkru instruction as being a serializing, or enforcing memory ordering implicitly.
First, it is surprising if it is not enforcing some sort of ordering, as one would suspect the programmer doesn't want memory accesses around a wrpkru to be executed out of order.
Second, does that mean wrpkru needs to be surrounded by lfence?
Linux and glibc don't use any sort of fence after the write. But shouldn't that be included in the SDM?
I'd assume that the CPU preserves the illusion of running a single thread in program order, as always. That's the cardinal rule of out-of-order execution. Accesses before wrpkru are done with the old PKRU, accesses after are done with the new PKRU.
Just like how modifying the MXCSR affects later FP instructions but not earlier instructions, or modifying a segment register affects later but not earlier loads/stores.
It's up to the implementation whether it wants to rename the PKRU, the MXCSR, or segment registers. If it doesn't rename the PKRU, then it has to complete all pending loads/stores before changing the PKRU and allowing later loads/stores to execute. (i.e. the microcode for wrpkru could include the uops for lfence if that's how it's implemented.)
All memory accesses have a dependency on the last wrpkru instruction, and the last write to the relevant segment register, and the last write to cr3 (the top-level page table), and the last change of privilege level (syscall / iret / whatever). Also on the last store to that location, and you never need a fence to see your own most recent stores. It's up to the CPU architects to build hardware that runs fast while preserving the illusion of program order.
e.g. Intel CPUs since at least Core2 have renamed the x87 FP control word, so old binaries that implement (int)fp_var by changing the x87 rounding mode to truncate and then back to nearest don't serialize the FPU. Some CPUs do rename segment registers according to Agner Fog's testing, but my testing shows that Skylake doesn't: Is a mov to a segmentation register slower than a mov to a general purpose register?.
I'm not familiar with MPK, but why would it be a problem for memory accesses to happen out of order as long as they all use the correct PKRU value, and they don't violate any of x86's normal memory-ordering rules?
(only StoreLoad reordering is allowed to be visible by other threads. Internally a CPU can execute loads earlier than they're "supposed to", but verify that the cache line wasn't invalidated before the point where it was architecturally allowed to load.
This is what the Memory Order Buffer does.)
In C/C++, of course you need some kind of barrier against compile-time reordering of accesses around the wrapper function. Normally a non-inline function call is sufficient, like for pthread_mutex_lock(). How does a mutex lock and unlock functions prevents CPU reordering?.
The earlier part of this answer is about ordering in assembly.