Linux perf reporting cache misses for unexpected instruction - linux

I'm trying to apply some performance engineering techniques to an implementation of Dijkstra's algorithm. In an attempt to find bottlenecks in the (naive and unoptimised) program, I'm using the perf command to record the number of cache misses. The snippet of code that is relevant is the following, which finds the unvisited node with the smallest distance:
for (int i = 0; i < count; i++) {
if (!visited[i]) {
if (tmp == -1 || dist[i] < dist[tmp]) {
tmp = i;
}
}
}
For the LLC-load-misses metric, perf report shows the following annotation of the assembly:
│ for (int i = 0; i < count; i++) { ▒
1.19 │ ff: add $0x1,%eax ▒
0.03 │102: cmp 0x20(%rsp),%eax ▒
│ ↓ jge 135 ▒
│ if (!visited[i]) { ▒
0.07 │ movslq %eax,%rdx ▒
│ mov 0x18(%rsp),%rdi ◆
0.70 │ cmpb $0x0,(%rdi,%rdx,1) ▒
0.53 │ ↑ jne ff ▒
│ if (tmp == -1 || dist[i] < dist[tmp]) { ▒
0.07 │ cmp $0xffffffff,%r13d ▒
│ ↑ je fc ▒
0.96 │ mov 0x40(%rsp),%rcx ▒
0.08 │ movslq %r13d,%rsi ▒
│ movsd (%rcx,%rsi,8),%xmm0 ▒
0.13 │ ucomis (%rcx,%rdx,8),%xmm0 ▒
57.99 │ ↑ jbe ff ▒
│ tmp = i; ▒
│ mov %eax,%r13d ▒
│ ↑ jmp ff ▒
│ } ▒
│ } ▒
│ }
My question then is the following: why does the jbe instruction produce so many cache misses? This instruction should not have to retrieve anything from memory at all if I am not mistaken. I figured it might have something to do with instruction cache misses, but even measuring only L1 data cache misses using L1-dcache-load-misses shows that there are a lot of cache misses in that instruction.
This stumps me somewhat. Could anyone explain this (in my eyes) odd result? Thank you in advance.

About your example:
There are several instructions before and at the high counter:
│ movsd (%rcx,%rsi,8),%xmm0
0.13 │ ucomis (%rcx,%rdx,8),%xmm0
57.99 │ ↑ jbe ff
"movsd" loads word from (%rcx,%rsi,8) (some array access) into xmm0 register, and "ucomis" loads another word from (%rcx,%rdx,8) and compares it with just loaded value in xmm0 register. "jbe" is conditional jump which depends on compare outcome.
Many modern Intel CPUs (and AMD probably too) can and will fuse (combine) some combinations of operations (realworldtech.com/nehalem/5 "into a single uop, CMP+JCC") together, and cmp + conditional jump very common instruction combination to be fused (you can check it with Intel IACA simulating tool, use ver 2.1 for your CPU). Fused pair may be reported in perf/PMUs/PEBS incorrectly with skew of most events towards one of two instructions.
This code probably means that expression "dist[i] < dist[tmp]" generates two memory accesses, and both of values are used in ucomis instruction which is (partially?) fused with jbe conditional jump. Either dist[i] or dist[tmp] or both expressions generates high number of misses. Any of such miss will block ucomis to generate result and block jbe to give next instruction to execute (or to retire predicted instructions). So, jbe may get all fame of high counters instead of real memory-access instructions (and for "far" event like cache response there is some skew towards last blocked instruction).
You may try to merge visited[N] and dist[N] arrays into array[N] of struct { int visited; float dist} to force prefetching of array[i].dist when you access array[i].visited or you may try to change order of vertex access, or renumber graph vertex, or do some software prefetch for next one or more elements (?)
About generic perf event by name problems and possible uncore skew.
perf (perf_events) tool in Linux uses predefined set of events when called as perf list, and some listed hardware events can be not implemented; others are mapped to current CPU capabilities (and some mappings are not fully correct). Some basic info about real PMU is in your https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf (but it has more details for related Nehalem-EP variant).
For your Nehalem (Intel Core i5 750 with L3 cache of 8MB and without multi-CPU/multi-socket/NUMA support) perf will map standard ("Generic cache events") LLC-load-misses event as .. "OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS" as written in the best documentation of perf event mappings (the only one) - kernel source code
http://elixir.free-electrons.com/linux/v4.8/source/arch/x86/events/intel/core.c#L1103
u64 nehalem_hw_cache_event_ids ...
[ C(LL ) ] = {
[ C(OP_READ) ] = {
/* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */
[ C(RESULT_ACCESS) ] = 0x01b7,
/* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */
[ C(RESULT_MISS) ] = 0x01b7,
...
/*
* Nehalem/Westmere MSR_OFFCORE_RESPONSE bits;
* See IA32 SDM Vol 3B 30.6.1.3
*/
#define NHM_DMND_DATA_RD (1 << 0)
#define NHM_DMND_READ (NHM_DMND_DATA_RD)
#define NHM_L3_MISS (NHM_NON_DRAM|NHM_LOCAL_DRAM|NHM_REMOTE_DRAM|NHM_REMOTE_CACHE_FWD)
...
u64 nehalem_hw_cache_extra_regs
..
[ C(LL ) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = NHM_DMND_READ|NHM_L3_ACCESS,
[ C(RESULT_MISS) ] = NHM_DMND_READ|NHM_L3_MISS,
I think this event is not precise: cpu pipeline will post (with out-of-order) load request to the cache hierarchy and will execute other instructions. After some time (around 10 cycles to reach and get response from L2 and 40 cycles to reach L3) there will be response with miss flag in the corresponding (offcore?) PMU to increment counter. On this counter overflow, profiling interrupt will be generated from this PMU. In several cpu clock cycles it will reach pipeline to interrupt it, perf_events subsystem's handler will handle this with registering current (interrupted) EIP/RIP Instruction pointer and reset PMU counter back to some negative value (for example, -100000 to get interrupt for every 100000 L3 misses counted; use perf record -e LLC-load-misses -c 100000 to set exact count or perf will autotune limit to get some default frequency). The registered EIP/RIP is not the IP of load command and it may be also not the EIP/RIP of command which wants to use the loaded data.
But if your CPU is the only socket in the system and you access normal memory (not some mapped PCI-express space), L3 miss in fact will be implemented as local memory access and there are some counters for this... (https://software.intel.com/en-us/node/596851 - "Any memory requests missing here must be serviced by local or remote DRAM").
There are some listings of PMU events for your CPU:
Official Intel's "Intel® 64 and IA-32 Architectures Software Developer Manuals" (SDM): https://software.intel.com/en-us/articles/intel-sdm, Volume 3, Appendix A
3B: https://software.intel.com/sites/default/files/managed/7c/f1/253669-sdm-vol-3b.pdf "18.8 PERFORMANCE MONITORING FOR PROCESSORS BASED ON INTEL® MICROARCHITECTURE CODE NAME NEHALEM" from page 213 "Vol 3B 18-35"
3B: https://software.intel.com/sites/default/files/managed/7c/f1/253669-sdm-vol-3b.pdf "19.8 - Processors based on Intel® microarchitecture code name Nehalem" from page 365 and "Vol. 3B 19-61")
Some other volume for Offcore response encoding? Vol. 3A 18-26?
from oprofile http://oprofile.sourceforge.net/docs/intel-corei7-events.php
from libpfm4's showevtinfo http://www.bnikolic.co.uk/blog/hpc-prof-events.html (note, this page with Sandy Bridge list, get libpfm4 ant run on your PC to get your list). There is also check_events tool in libpfm4 to help your encode event as raw for perf.
from VTune documentation: http://www.hpc.ut.ee/dokumendid/ips_xe_2015/vtune_amplifier_xe/documentation/en/help/reference/pmw_sp/events/offcore_response.html
from Nehalem PMU guide: https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf
ocperf tool from Intel's perf developer Andi Kleen, part of his pmu-tools https://github.com/andikleen/pmu-tools. ocperf is just wrapper for perf and this package will download event description and any supported event name will be converted into correct raw encoding ofperf`.
There should be some information about ANY_LLC_MISS offcore PMU event implementation and list of PEBS events for Nhm, but I can't find it now.
I can recommend you to use ocperf from https://github.com/andikleen/pmu-tools with any PMU events of your CPU without need to manually encode them. There are some PEBS events in your CPU, and there is Latency profiling / perf mem for some kind of memory access profiling (some random perf mem pdfs: 2012 post "perf: add memory access sampling support",RH 2013 - pg26-30, still not documented in 2015 - sowa pg19, ls /sys/devices/cpu/events). For newer CPUs there are newer tools like ucevent.
I also can recommend you to try cachegrind profiler/cache simulator tool of valgrind program with kcachegrind GUI to view profiles. Valgrind-based profilers may help you to get basic idea about how the code works: they collect exact instruction execution counts for every instruction, and cachegrind also simulates some abstract multi-level cache. But real CPU will execute several instruction per cycle (so, callgrind/cachegrind cost model of 1 instruction = 1 cpu clock cycle gives some error; cachegrind cache model have not the same logic as real cache). And all valgrind tools are dynamic binary instrumentation tools which will slow down your program 20-30 times compared to native run.

When you read a memory location, the processor will try to prefetch the adjacent memory locations and cache them.
That works well if you are reading an array of objects which are all allocated in memory in contiguous blocks.
However, if for example you have an array of pointers which live in the heap, it is less likely that you will be iterating over contiguous portions of memory unless you are using some sort of custom allocator specifically designed for this.
Because of this, dereferencing should be seen as some sort of cost. An array of structs can be more efficient to an array of pointers to structs.
Herb Sutter (member of C++ commitee) speaks about this in this talk https://youtu.be/TJHgp1ugKGM?t=21m31s

Related

How to enable performance counters in Rocket Core

I am trying to enable performance counters in the Rocket core.
I don't see the implementation of mhpmevent3, mhpmevent4, mhpmcounter3, mhpmcounter4 in the RTL of the defaultConfig.
I see that following parameter is set to true by default but none of the mentioned counters does not get generated in the defaultConfig.
haveBasicCounters: Boolean = true
https://github.com/chipsalliance/rocket-chip/blob/b503f8ac28a497b2463ffbac84bfe66533ace0bb/src/main/scala/rocket/RocketCore.scala#L37
Can someone please explain how to enable these basic performance counters in Rocket?
I tried increasing nPerfCounters: Int = 0 to 8 and using following code to configure and read hpmcounter3. But counter never changed, it had a random value in each run.
addi x14, zero, 0x04
slli x14, x14, 8
addi x14, x14, 2
csrrw zero, mhpmevent3, x14
...
csrr x28, mhpmcounter3 #Read counters in two different places
...
csrr x28, mhpmcounter3
I am using the configuration mentioned in https://static.dev.sifive.com/U54-MC-RVCoreIP.pdf#page=16 3.11 as suggested in rocket-chip github issues.

What is C-state Cx in cpupower monitor

I am profiling an application for execution time on an x86-64 processor running linux. Before starting to benchmark the application, I want to make sure that the Dynamic Frequency scaling and idle states are disabled.
Check on Frequency scaling
$ cat /sys/devices/system/cpu/cpufreq/boost
0
This tells me that the Frequency scaling(Intel's Turbo Boost or AMD's Turbo Core) is disabled. In fact, we set it to a constant 2GHz which is evident from the next exercise.
Check on CPU idling
$ cpupower --cpu 0-63 idle-info
CPUidle driver: none
CPUidle governor: menu
analyzing CPU 0:
CPU 0: No idle states
analyzing CPU 1:
CPU 1: No idle states
analyzing CPU 2:
CPU 2: No idle states
...
So, the idle states are disabled.
Now that I am sure both the "features" which can meddle with bench marking are disabled, I go ahead to monitor the application using cpupower.
But then, when I run my application for monitoring the C-states, I see that more than 99% time is spent in C0 state which should be the case. However, I also see something called Cx state in which the cores spend 0.01 - 0.02% of the time.
$ cpupower monitor -c ./my_app
./my_app took 32.28017 seconds and exited with status 0
|Mperf
CPU | C0 | Cx | Freq
0| 99.98| 0.02| 1998
32| 99.98| 0.02| 1998
1|100.00| 0.00| 1998
33| 99.99| 0.01| 1998
2|100.00| 0.00| 1998
34| 99.99| 0.01| 1998
3|100.00| 0.00| 1998
35| 99.99| 0.01| 1998
...
So, would be glad to understand the below.
What is Cx state? And should I be less bothered looking at such low numbers?
Are there any other features apart from Frequency scaling and CPU idling that I should care about (from a bench marking perspective)?
Bonus Question
What does CPUidle driver: none mean?
Edit 1
For the 2nd question on additional concerns during benchmarking, I recentlv found out that the local timer interrupts on a CPU core for scheduling purposes could skew the measurements, so the CONFIG_NO_HZ_FULL is enabled in Linux kernel to enable tickless mode
The beauty of open source software is that you can always go and check :)
cpupower monitor uses different monitors, the mperf monitor defines this array:
static cstate_t mperf_cstates[MPERF_CSTATE_COUNT] = {
{
.name = "C0",
.desc = N_("Processor Core not idle"),
.id = C0,
.range = RANGE_THREAD,
.get_count_percent = mperf_get_count_percent,
},
{
.name = "Cx",
.desc = N_("Processor Core in an idle state"),
.id = Cx,
.range = RANGE_THREAD,
.get_count_percent = mperf_get_count_percent,
},
{
.name = "Freq",
.desc = N_("Average Frequency (including boost) in MHz"),
.id = AVG_FREQ,
.range = RANGE_THREAD,
.get_count = mperf_get_count_freq,
},
};
Quite logically, Cx means any C-state not C0, i.e. any idle state (Note that these states are not the ACPI states, though an higher number is a deeper sleep state - for ACPI off is C6).
Note how Cx is computed:
if (id == Cx)
*percent = 100.0 - *percent;
Cx is simply the complement of C0.
This is because the IA32_M/APERF counter used do not count in any C-state but C0:
C0 TSC Frequency Clock Count
Increments at fixed interval (relative to TSC freq.) when the logical processor is in C0.
A similar definition for IA32_APERF is present in the manuals.
There are a lot of thing to consider when benchmarking, probably more than can be listed as a secondary answer.
In general the later run of the tested code will find at least part of the data hot in the caches (same for TLBs and any internal caching).
Interrupt affinity is also something to consider depending on the benchmarked program.
However I'd say that with turbo boost and scaling disabled you are pretty much ready to test.
A CPUIdle driver is a component of the kernel that control the platform-dependent part of entering and egressing to/from idle states.
For Intel CPUs (and AMD ones?) the kernel can either use the ACPI processor_idle driver (if enabled) or the intel_idle (that uses mwait).

How to calculate time for an asm delay loop on x86 linux?

I was going through this link delay in assembly to add delay in assembly. I want to perform some experiment by adding different delay value.
The useful code to generate delay
; start delay
mov bp, 43690
mov si, 43690
delay2:
dec bp
nop
jnz delay2
dec si
cmp si,0
jnz delay2
; end delay
What I understood from the code, the delay is proportion to the time it spends to execute nop instructions (43690x43690 ). So in different system and different version of OS, delay will be different. Am I right?
Can anyone explain to me how to calculate the amount of delay in nsec, the following assembly code is generating so that I can conclude my experiment with respect to delay I added in my experimental setup?
This is the code I am using to generate delay without understanding the logic behind use of 43690 value ( I used only one loop against two loops in original source code). To generate different delay (without knowing its value), I just varied number 43690 to 403690 or other value.
Code in 32bit OS
movl $43690, %esi ; ---> if I vary this 4003690 then delay value ??
.delay2:
dec %esi
nop
jnz .delay2
How much delay is generated by this assembly code ?
If I want to generate 100nsec or 1000nsec or any other delay in microsec, what will be initial value I need to load in register?
I am using ubuntu 16.04 (both 32bit as well as 64bit ), in Intel(R) Core(TM) i5-7200U CPU # 2.50GHz and Core-i3 CPU 3470 # 3.20GHz processor.
Thank you in advance.
There is no very good way to get accurate and predictable timing from fixed counts for delay loops on a modern x86 PC, especially in user-space under a non-realtime OS like Linux. (But you could spin on rdtsc for very short delays; see below). You can use a simple delay-loop if you need to sleep at least long enough and it's ok to sleep longer when things go wrong.
Normally you want to sleep and let the OS wake your process, but this doesn't work for delays of only a couple microseconds on Linux. nanosleep can express it, but the kernel doesn't schedule with such precise timing. See How to make a thread sleep/block for nanoseconds (or at least milliseconds)?. On a kernel with Meltdown + Spectre mitigation enabled, a round-trip to the kernel takes longer than a microsecond anyway.
(Or are you doing this inside the kernel? I think Linux already has a calibrated delay loop. In any case, it has a standard API for delays: https://www.kernel.org/doc/Documentation/timers/timers-howto.txt, including ndelay(unsigned long nsecs) which uses the "jiffies" clock-speed estimate to sleep for at least long enough. IDK how accurate that is, or if it sometimes sleeps much longer than needed when clock speed is low, or if it updates the calibration as the CPU freq changes.)
Your (inner) loop is totally predictable at 1 iteration per core clock cycle on recent Intel/AMD CPUs, whether or not there's a nop in it. It's under 4 fused-domain uops, so you bottleneck on the 1-per-clock loop throughput of your CPUs. (See Agner Fog's x86 microarch guide, or time it yourself for large iteration counts with perf stat ./a.out.) Unless there's competition from another hyperthread on the same physical core...
Or unless the inner loop spans a 32-byte boundary, on Skylake or Kaby Lake (loop buffer disabled by microcode updates to work around a design bug). Then your dec / jnz loop could run at 1 per 2 cycles because it would require fetching from 2 different uop-cache lines.
I'd recommend leaving out the nop to have a better chance of it being 1 per clock on more CPUs, too. You need to calibrate it anyway, so a larger code footprint isn't helpful (so leave out extra alignment, too). (Make sure calibration happens while CPU is at max turbo, if you need to ensure a minimum delay time.)
If your inner loop wasn't quite so small (e.g. more nops), see Is performance reduced when executing loops whose uop count is not a multiple of processor width? for details on front-end throughput when the uop count isn't a multiple of 8. SKL / KBL with disabled loop buffers run from the uop cache even for tiny loops.
But x86 doesn't have a fixed clock frequency (and transitions between frequency states stop the clock for ~20k clock cycles (8.5us), on a Skylake CPU).
If running this with interrupts enabled, then interrupts are another unpredictable source of delays. (Even in kernel mode, Linux usually has interrupts enabled. An interrupts-disabled delay loop for tens of thousands of clock cycles seems like a bad idea.)
If running in user-space, then I hope you're using a kernel compiled with realtime support. But even then, Linux isn't fully designed for hard-realtime operation, so I'm not sure how good you can get.
System management mode interrupts are another source of delay that even the kernel doesn't know about. PERFORMANCE IMPLICATIONS OF
SYSTEM MANAGEMENT MODE from 2013 says that 150 microseconds is considered an "acceptable" latency for an SMI, according to Intel's test suite for PC BIOSes. Modern PCs are full of voodoo. I think/hope that the firmware on most motherboards doesn't have much SMM overhead, and that SMIs are very rare in normal operation, but I'm not sure. See also Evaluating SMI (System Management Interrupt) latency on Linux-CentOS/Intel machine
Extremely low-power Skylake CPUs stop their clock with some duty-cycle, instead of clocking lower and running continuously. See this, and also Intel's IDF2015 presentation about Skylake power management.
Spin on RDTSC until the right wall-clock time
If you really need to busy-wait, spin on rdtsc waiting for the current time to reach a deadline. You need to know the reference frequency, which is not tied to the core clock, so it's fixed and nonstop (on modern CPUs; there are CPUID feature bits for invariant and nonstop TSC. Linux checks this, so you could look in /proc/cpuinfo for constant_tsc and nonstop_tsc, but really you should just check CPUID yourself on program startup and work out the RDTSC frequency (somehow...)).
I wrote such a loop as part of a silly-computer-tricks exercise: a stopwatch in the fewest bytes of x86 machine code. Most of the code size is for the string manipulation to increment a 00:00:00 display and print it. I hard-coded the 4GHz RDTSC frequency for my CPU.
For sleeps of less than 2^32 reference clocks, you only need to look at the low 32 bits of the counter. If you do your compare correctly, wrap-around takes care of itself. For the 1-second stopwatch, a 4.3GHz CPU would have a problem, but for nsec / usec sleeps there's no issue.
;;; Untested, NASM syntax
default rel
section .data
; RDTSC frequency in counts per 2^16 nanoseconds
; 3200000000 would be for a 3.2GHz CPU like your i3-3470
ref_freq_fixedpoint: dd 3200000000 * (1<<16) / 1000000000
; The actual integer value is 0x033333
; which represents a fixed-point value of 3.1999969482421875 GHz
; use a different shift count if you like to get more fractional bits.
; I don't think you need 64-bit operand-size
; nanodelay(unsigned nanos /*edi*/)
; x86-64 System-V calling convention
; clobbers EAX, ECX, EDX, and EDI
global nanodelay
nanodelay:
; take the initial clock sample as early as possible.
; ideally even inline rdtsc into the caller so we don't wait for I$ miss.
rdtsc ; edx:eax = current timestamp
mov ecx, eax ; ecx = start
; lea ecx, [rax-30] ; optionally bias the start time to account for overhead. Maybe make this a variable stored with the frequency.
; then calculate edi = ref counts = nsec * ref_freq
imul edi, [ref_freq_fixedpoint] ; counts * 2^16
shr edi, 16 ; actual counts, rounding down
.spinwait: ; do{
pause ; optional but recommended.
rdtsc ; edx:eax = reference cycles since boot
sub eax, ecx ; delta = now - start. This may wrap, but the result is always a correct unsigned 0..n
cmp eax, edi ; } while(delta < sleep_counts)
jb .spinwait
ret
To avoid floating-point for the frequency calculation, I used fixed-point like uint32_t ref_freq_fixedpoint = 3.2 * (1<<16);. This means we just use an integer multiply and shift inside the delay loop. Use C code to set ref_freq_fixedpoint during startup with the right value for the CPU.
If you recompile this for each target CPU, the multiply constant can be an immediate operand for imul instead of loading from memory.
pause sleeps for ~100 clock on Skylake, but only for ~5 clocks on previous Intel uarches. So it hurts timing precision a bit, maybe sleeping up to 100 ns past a deadline when the CPU frequency is clocked down to ~1GHz. Or at a normal ~3GHz speed, more like up to +33ns.
Running continously, this loop heated up one core of my Skylake i7-6700k at ~3.9GHz by ~15 degrees C without pause, but only by ~9 C with pause. (From a baseline of ~30C with a big CoolerMaster Gemini II heatpipe cooler, but low airflow in the case to keep fan noise low.)
Adjusting the start-time measurement to be earlier than it really is will let you compensate for some of the extra overhead, like branch-misprediction when leaving the loop, as well as the fact that the first rdtsc doesn't sample the clock until probably near the end of its execution. Out-of-order execution can let rdtsc run early; you might use lfence, or consider rdtscp, to stop the first clock sample from happening out-of-order ahead of instructions before the delay function is called.
Keeping the offset in a variable will let you calibrate the constant offset, too. If you can do this automatically at startup, that could be good to handle variations between CPUs. But you need some high-accuracy timer for that to work, and this is already based on rdtsc.
Inlining the first RDTSC into the caller and passing the low 32 bits as another function arg would make sure the "timer" starts right away even if there's an instruction-cache miss or other pipeline stall when calling the delay function. So the I$ miss time would be part of the delay interval, not extra overhead.
The advantage of spinning on rdtsc:
If anything happens that delays execution, the loop still exits at the deadline, unless execution is currently blocked when the deadline passes (in which case you're screwed with any method).
So instead of using exactly n cycles of CPU time, you use CPU time until the current time is n * freq nanoseconds later than when you first checked.
With a simple counter delay loop, a delay that's long enough at 4GHz would make you sleep more than 4x too long at 0.8GHz (typical minimum frequency on recent Intel CPUs).
This does run rdtsc twice, so it's not appropriate for delays of only a couple nanoseconds. (rdtsc itself is ~20 uops, and has a throughput of one per 25 clocks on Skylake/Kaby Lake.) I think this is probably the least bad solution for a busy-wait of hundreds or thousands of nanoseconds, though.
Downside: a migration to another core with unsynced TSC could result in sleeping for the wrong time. But unless your delays are very long, the migration time will be longer than the intended delay. The worst case is sleeping for the delay-time again after the migration. The way I do the compare: (now - start) < count, instead of looking for a certain target target count, means that unsigned wraparound will make the compare true when now-start is a large number. You can't get stuck sleeping for nearly a whole second while the counter wraps around.
Downside: maybe you want to sleep for a certain number of core cycles, or to pause the count when the CPU is asleep.
Downside: old CPUs may not have a non-stop / invariant TSC. Check these CPUID feature bits at startup, and maybe use an alternate delay loop, or at least take it into account when calibrating. See also Get CPU cycle count? for my attempt at a canonical answer about RDTSC behaviour.
Future CPUs: use tpause on CPUs with the WAITPKG CPUID feature.
(I don't know which future CPUs are expected to have this.)
It's like pause, but puts the logical core to sleep until the TSC = the value you supply in EDX:EAX. So you could rdtsc to find out the current time, add / adc the sleep time scaled to TSC ticks to EDX:EAX, then run tpause.
Interestingly, it takes another input register where you can put a 0 for a deeper sleep (more friendly to the other hyperthread, probably drops back to single-thread mode), or 1 for faster wakeup and less power-saving.
You wouldn't want to use this to sleep for seconds; you'd want to hand control back to the OS. But you could do an OS sleep to get close to your target wakeup if it's far away, then mov ecx,1 or xor ecx,ecx / tpause ecx for whatever time is left.
Semi-related (also part of the WAITPKG extension) are the even more fun umonitor / umwait, which (like privileged monitor/mwait) can have a core wake up when it sees a change to memory in an address range. For a timeout, it has the same wakeup on TSC = EDX:EAX as tpause.

What is the difference between "cpu/mem-loads/pp" and "cpu/mem-loads/"?

I read the manual of perf list, and find the following PMU event definitions of memory load/store:
mem-loads OR cpu/mem-loads/ [Kernel PMU event]
mem-stores OR cpu/mem-stores/ [Kernel PMU event]
But I always read perf script which uses "cpu/mem-loads/pp" instead of "cpu/mem-loads/". What is the difference between them? Are they the same thing? I have tried to google the answers, but can't find the explanation.
The p modifier stands for precise level When doing sampling, it used to indicate the skid you tolerate: how far can be t reported instruction from the effective instructions that generated the sample. pp means that the SAMPLE_IP is requested to have 0 skid. In other word, when you do memory accesses sampling, you want to know exactly which instruction generated the access.
See man perf list:
p - precise level
....
The p modifier can be used for specifying how precise the instruction address should be. The p modifier can be specified multiple times:
0 - SAMPLE_IP can have arbitrary skid
1 - SAMPLE_IP must have constant skid
2 - SAMPLE_IP requested to have 0 skid
3 - SAMPLE_IP must have 0 skid
For Intel systems precise event sampling is implemented with PEBS which supports up to precise-level 2.
On AMD systems it is implemented using IBS (up to precise-level 2). The precise modifier works with event types 0x76 (cpu-cycles, CPU clocks not halted) and 0xC1 (micro-ops
retired). Both events map to IBS execution sampling (IBS op) with the IBS Op Counter Control bit (IbsOpCntCtl) set respectively (see AMD64 Architecture Programmer’s Manual Volume
2: System Programming, 13.3 Instruction-Based Sampling). Examples to use IBS:
perf record -a -e cpu-cycles:p ... # use ibs op counting cycles
perf record -a -e r076:p ... # same as -e cpu-cycles:p
perf record -a -e r0C1:p ... # use ibs op counting micro-ops

What are stalled-cycles-frontend and stalled-cycles-backend in 'perf stat' result?

Does anybody know what is the meaning of stalled-cycles-frontend and stalled-cycles-backend in perf stat result ? I searched on the internet but did not find the answer. Thanks
$ sudo perf stat ls
Performance counter stats for 'ls':
0.602144 task-clock # 0.762 CPUs utilized
0 context-switches # 0.000 K/sec
0 CPU-migrations # 0.000 K/sec
236 page-faults # 0.392 M/sec
768956 cycles # 1.277 GHz
962999 stalled-cycles-frontend # 125.23% frontend cycles idle
634360 stalled-cycles-backend # 82.50% backend cycles idle
890060 instructions # 1.16 insns per cycle
# 1.08 stalled cycles per insn
179378 branches # 297.899 M/sec
9362 branch-misses # 5.22% of all branches [48.33%]
0.000790562 seconds time elapsed
The theory:
Let's start from this: nowaday's CPU's are superscalar, which means that they can execute more than one instruction per cycle (IPC). Latest Intel architectures can go up to 4 IPC (4 x86 instruction decoders). Let's not bring macro / micro fusion into discussion to complicate things more :).
Typically, workloads do not reach IPC=4 due to various resource contentions. This means that the CPU is wasting cycles (number of instructions is given by the software and the CPU has to execute them in as few cycles as possible).
We can divide the total cycles being spent by the CPU in 3 categories:
Cycles where instructions get retired (useful work)
Cycles being spent in the Back-End (wasted)
Cycles spent in the Front-End (wasted).
To get an IPC of 4, the number of cycles retiring has to be close to the total number of cycles. Keep in mind that in this stage, all the micro-operations (uOps) retire from the pipeline and commit their results into registers / caches. At this stage you can have even more than 4 uOps retiring, because this number is given by the number of execution ports. If you have only 25% of the cycles retiring 4 uOps then you will have an overall IPC of 1.
The cycles stalled in the back-end are a waste because the CPU has to wait for resources (usually memory) or to finish long latency instructions (e.g. transcedentals - sqrt, reciprocals, divisions, etc.).
The cycles stalled in the front-end are a waste because that means that the Front-End does not feed the Back End with micro-operations. This can mean that you have misses in the Instruction cache, or complex instructions that are not already decoded in the micro-op cache. Just-in-time compiled code usually expresses this behavior.
Another stall reason is branch prediction miss. That is called bad speculation. In that case uOps are issued but they are discarded because the BP predicted wrong.
The implementation in profilers:
How do you interpret the BE and FE stalled cycles?
Different profilers have different approaches on these metrics. In vTune, categories 1 to 3 add up to give 100% of the cycles. That seams reasonable because either you have your CPU stalled (no uOps are retiring) either it performs usefull work (uOps) retiring. See more here: https://software.intel.com/sites/products/documentation/doclib/stdxe/2013SP1/amplifierxe/snb/index.htm
In perf this usually does not happen. That's a problem because when you see 125% cycles stalled in the front end, you don't know how to really interpret this. You could link the >1 metric with the fact that there are 4 decoders but if you continue the reasoning, then the IPC won't match.
Even better, you don't know how big the problem is. 125% out of what? What do the #cycles mean then?
I personally look a bit suspicious on perf's BE and FE stalled cycles and hope this will get fixed.
Probably we will get the final answer by debugging the code from here: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/perf/builtin-stat.c
To convert generic events exported by perf into your CPU documentation raw events you can run:
more /sys/bus/event_source/devices/cpu/events/stalled-cycles-frontend
It will show you something like
event=0x0e,umask=0x01,inv,cmask=0x01
According to the Intel documentation SDM volume 3B (I have a core i5-2520):
UOPS_ISSUED.ANY:
Increments each cycle the # of Uops issued by the RAT to RS.
Set Cmask = 1, Inv = 1, Any= 1 to count stalled cycles of this core.
For the stalled-cycles-backend event translating to event=0xb1,umask=0x01 on my system the same documentation says:
UOPS_DISPATCHED.THREAD:
Counts total number of uops to be dispatched per- thread each cycle
Set Cmask = 1, INV =1 to count stall cycles.
Usually, stalled cycles are cycles where the processor is waiting for something (memory to be feed after executing a load operation for example) and doesn't have any other stuff to do. Moreover, the frontend part of the CPU is the piece of hardware responsible to fetch and decode instructions (convert them to UOPs) where as the backend part is responsible to effectively execute the UOPs.
A CPU cycle is “stalled” when the pipeline doesn't advance during it.
Processor pipeline is composed of many stages: the front-end is a group of these stages which is responsible for the fetch and decode phases, while the back-end executes the instructions. There is a buffer between front-end and back-end, so when the former is stalled the latter can still have some work to do.
Taken from http://paolobernardi.wordpress.com/2012/08/07/playing-around-with-perf/
According to author of these events, they defined loosely and are approximated by available CPU performance counters. As I know, perf doesn't support formulas to calculate some synthetic event based on several hardware events, so it can't use front-end/back-end stall bound method from Intel's Optimization manual (Implemented in VTune) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf "B.3.2 Hierarchical Top-Down Performance Characterization Methodology"
%FE_Bound = 100 * (IDQ_UOPS_NOT_DELIVERED.CORE / N );
%Bad_Speculation = 100 * ( (UOPS_ISSUED.ANY – UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / N) ;
%Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ N) ;
%BE_Bound = 100 * (1 – (FE_Bound + Retiring + Bad_Speculation) ) ;
N = 4*CPU_CLK_UNHALTED.THREAD" (for SandyBridge)
Right formulas can be used with some external scripting, like it was done in Andi Kleen's pmu-tools (toplev.py): https://github.com/andikleen/pmu-tools (source), http://halobates.de/blog/p/262 (description):
% toplev.py -d -l2 numademo 100M stream
...
perf stat --log-fd 4 -x, -e
{r3079,r19c,r10401c3,r100030d,rc5,r10e,cycles,r400019c,r2c2,instructions}
{r15e,r60006a3,r30001b1,r40004a3,r8a2,r10001b1,cycles}
numademo 100M stream
...
BE Backend Bound: 72.03%
This category reflects slots where no uops are being delivered due to a lack
of required resources for accepting more uops in the Backend of the pipeline.
.....
FE Frontend Bound: 54.07%
This category reflects slots where the Frontend of the processor undersupplies
its Backend.
Commit which introduced stalled-cycles-frontend and stalled-cycles-backend events instead of original universal stalled-cycles:
http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=8f62242246351b5a4bc0c1f00c0c7003edea128a
author Ingo Molnar <mingo#el...> 2011-04-29 11:19:47 (GMT)
committer Ingo Molnar <mingo#el...> 2011-04-29 12:23:58 (GMT)
commit 8f62242246351b5a4bc0c1f00c0c7003edea128a (patch)
tree 9021c99956e0f9dc64655aaa4309c0f0fdb055c9
parent ede70290046043b2638204cab55e26ea1d0c6cd9 (diff)
perf events: Add generic front-end and back-end stalled cycle event definitions
Add two generic hardware events: front-end and back-end stalled cycles.
These events measure conditions when the CPU is executing code but its
capabilities are not fully utilized. Understanding such situations and
analyzing them is an important sub-task of code optimization workflows.
Both events limit performance: most front end stalls tend to be caused
by branch misprediction or instruction fetch cachemisses, backend
stalls can be caused by various resource shortages or inefficient
instruction scheduling.
Front-end stalls are the more important ones: code cannot run fast
if the instruction stream is not being kept up.
An over-utilized back-end can cause front-end stalls and thus
has to be kept an eye on as well.
The exact composition is very program logic and instruction mix
dependent.
We use the terms 'stall', 'front-end' and 'back-end' loosely and
try to use the best available events from specific CPUs that
approximate these concepts.
Cc: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
Link: http://lkml.kernel.org/n/tip-7y40wib8n000io7hjpn1dsrm#git.kernel.org
Signed-off-by: Ingo Molnar
/* Install the stalled-cycles event: UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 */
- intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES] = 0x1803fb1;
+ intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x1803fb1;
- PERF_COUNT_HW_STALLED_CYCLES = 7,
+ PERF_COUNT_HW_STALLED_CYCLES_FRONTEND = 7,
+ PERF_COUNT_HW_STALLED_CYCLES_BACKEND = 8,

Resources