does anyone know how to get the maximum event period value (or the value that kernel actually passes to PMU) of Perf event?
I'm using perf to measure my program as follow:
perf record -d -e cpu/event=0xd0,umask=0x81/ppu,cpu/event=0xd0,umask=0x82/ppu -c 5
cpu/event=0xd0,umask=0x81/ppu means measure all loads in cpu, and cpu/event=0xd0,umask=0x82/ppu is all stores.
I tried to understand how arguments passing in perf by strace, but found nothing.
Is the PMU received a value that over its ability, will still try to reach it? If so, where can find related code and what is its maximum event period of those events?
Thanks everyone.
The perf record command accepts period values much larger than 255. Internally, the processor maintains a counter for recording all the memory loads and memory stores(or for that matter, any other supported event). Once the counter overflows, the processor will record all the information about the memory load/store that you are trying to record(information about architectural state/registers etc.) .
Also once the counter overflows, it must be reset again. Usually the counter is reset to a value less than 0. Since it is set to a value less than zero and it increments, the counter will overflow once it hits 0 again.
This counter reset value that I was talking about is the period value that you asked for. What I mean is that, if the period is specified by -c 1 , it means that the counter reset value will be set to -1, so the next memory load/store will increment the counter to 0(leading to a counter overflow) and you will record the events.
Thus, if you set the period to 1, there will be a counter overflow on each memory load/store event and you will record all of them (this is only conceptual however, the hardware usually cannot do this).
What this means is that, the period value can go as large as the size of a hardware counter for these events. Usually in modern microarchitectures , like Broadwell/Haswell/Skylake, these counters are 48-bits in size. So the period might go as large as 2^48-1. However, usage of such large values are not recommended.
Usually, the period value should be kept to a maximum of 2^32-1 in 32-bit systems and is usually the norm in other systems too.
Sources :
Chapter 18 of this book
Please read the topic Sampling with perf record in this link too
If you want you can read the answer to this question too.
Related
Linux Kernel : 4.10.0-20-generic (also tried this on 4.11.3)
Ubuntu : 17.04
I have been trying to collect stats of memory-accesses using perf stat. I am able to collect stats for memory-stores but the count for memory-loads return me a 0 value.
The below is the details for memory-stores :-
perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
158,115,510 cpu/mem-stores/u
0.559922797 seconds time elapsed
For memory-loads, I get a 0 count as can be seen below :-
perf stat -e cpu/mem-loads/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
0 cpu/mem-loads/u
0.563806170 seconds time elapsed
I cannot understand why this does not count properly. Should I use a different event in any way to get proper data ?
The mem-loads event is mapped to the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3 performance monitoring unit event on Intel processors. The events MEM_TRANS_RETIRED.LOAD_LATENCY_* are special and can only be counted by using the p modifier. That is, you have to specify mem-loads:p to perf to use the event correctly.
MEM_TRANS_RETIRED.LOAD_LATENCY_* is a precise event and it only makes sense to be counted at the precise level. According to this Intel article (emphasis mine):
When a user elects to sample one of these events, special hardware is
used that can keep track of a data load from issue to completion.
This is more complicated than simply counting instances of an event
(as with normal event-based sampling), and so only some loads are
tracked. Loads are randomly chosen, the latency determined for each,
and the correct event(s) incremented (latency >4, >8, >16, etc). Due
to the nature of the sampling for this event, only a small percentage
of an application's data loads can be tracked at any one time.
As you can see, MEM_TRANS_RETIRED.LOAD_LATENCY_* by no means count the total number of loads and it is not designed for that purpose at all.
If you want to to determine which instructions in your code are issuing load requests that take more than a specific number of cycles to complete, then MEM_TRANS_RETIRED.LOAD_LATENCY_* is the right performance event to use. In fact, that is exactly the purpose of perf-mem and it achieves its purpose by using this event.
If you want to count the total number of load uops retired, then you should use L1-dcache-loads, which is mapped to the MEM_UOPS_RETIRED.ALL_LOADS performance event on Intel processors.
On the other hand, mem-stores and L1-dcache-stores are mapped to the exact same performance event on all current Intel processors, namely, MEM_UOPS_RETIRED.ALL_STORES, which does count all retired store uops.
So in summary, if you are using perf-stat, you should (almost) always use L1-dcache-loads and L1-dcache-stores to count retired loads and stores, respectively. These are mapped to the raw events you have used in the answer you posted, only more portable because they also work on AMD processors.
I have used a Broadwell(CPU e5-2620) server machine to collect all of the below events.
To collect memory-load events, I had to use a numeric event value. I basically ran the below command -
./perf record -e "r81d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
Here r81d0 represents the raw event for counting "memory loads amongst all instructions retired". "u" as can be understood represents user-space.
The below command, on the other hand,
./perf record -e "r82d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
has "r82d0:u" as a raw event representing "memory stores amongst all instructions retired in userspace".
I have been trying to log memory accesses that are made by a program using Perf and PEBS counters. My intention was to log all of the memory accesses made by a program (I chose programs from SpecCPU2006). By tweaking certain parameters, I seem to record much more samples than there actually is for the program. I know, as has been said previously, that it is tough to record all of the memory access samples but leaving that aside, I want to know how can PEBS record more samples than there actually is?
I followed the below steps :-
First of all, I modified the /proc/sys/kernel/perf_cpu_time_max_percent value. Initially it was 25%, I changed it to 95%. This was because I wanted to see if I can record the maximum number of memory access samples. This would also allow me to probably use a much higher perf_event_max_sample_rate, which is usually 100,000 at a maximum but now I can set it to a higher value without it being lowered down.
I used a much higher value for perf_event_max_sample_rate which is 244,500, instead of the maximum allowable value of 100,000.
Now what I did was I used perf-stat to record the total count of the memory-stores information in a program. I got the below data :-
./perf stat -e cpu/mem-stores/u ../../.././libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for '../../.././libquantum_base.arnab 100':
158,115,509 cpu/mem-stores/u
0.591718162 seconds time elapsed
There are roughly ~158 million events as indicated by perf-stat, which should be a correct indicator, since this is directly coming from the hardware counter values.
But now, as I run the perf record -e command and use PEBS counters to calculate all of the memory store events that are possible :-
./perf record -e cpu/mem-stores/upp -c 1 ../../.././libquantum_base.arnab 100
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.
Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.
Samples in kernel modules won't be resolved at all.
If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.
Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
[ perf record: Woken up 32 times to write data ]
[ perf record: Captured and wrote 7.827 MB perf.data (254125 samples) ]
I can see 254125 samples being recorded. This is much much less than what was returned by perf stat. I am recording all of these accesses in the userspace only (I am using -u in both cases).
Why does this happen ? Am I recording the memory-store events in any wrong way ? Or is there a problem with the CPU behavior ?
More precisely how does the perf tool associate PMU events to functions
i already realized that when the kernel perf subsystem records the event counters it also records the Program Counter (PC) so it can associate the count to a function.
However to really get fine grain result, you need to sample the counters in a very high rate, otherwise you may associate counters to a group of functions.
But reading the counters and writing the sampled data (counters, PC, call-stack) to the perf mmap space is very intrusive.
I read in some sources that this sampling only happens when the PMU counters overflow, but this is can be very coarse unless i am setting the counters to overflow very quickly
what am i missing here ?
perf record is statistical profiling tool, it either program hardware performance event monitor unit (PMU) to overflow after some number of counts (for example with -e cycles -c 1000000 write -1000000 to counter and enable counting cycles; with -F or without freq/period argument it will autotune value), on overflow interrupt perf will reprogram it for next count. So it will have several hundreds or few thousands events per second. Or it can use OS timer interrupt (-e task-clock) to get periodic samples. On every sample (or on interrupt from hardware PMU) perf will record current PC (EIP) and/or callstack; and it does not record current value of counter (check full dump of data stored in the perf.data with perf script or perf script -D; or code of sample event dumping - there is sample->ip but not current count of PMU).
perf report will parse perf.data to get all PC recorded in it. It will count how many times each PC was sampled to build histogram [PC] -> sample_count. Every PC will be associated with the exact function it belongs (perf report will parse memory map, as mmap events are recorded in perf.data too, open every binary used, find symbols table of every binary).
Actual code of perf report is in linux/tools/perf/builtin-report.c: cmd_report/__cmd_report -> perf_session__process_events -> some magic -> process_sample_event to record all mentioned in perf.data ip (PC) values with hist_entry_iter__add(&iter, &al, rep->max_stack, rep); into histogram with hist_iter__report_callback:
hist_entry__inc_addr_samples(he, evsel->idx, al->addr);
. . . (perf/util/annotate.c) __symbol__inc_addr_samples
611 h->addr[offset]++;
Then it will output collected histogram with report__browse_hists -> perf_evlist__tty_browse_hists -> hists__fprintf_nr_sample_events(hists, rep, evname, stdout);.
Every sample is already associated with exact function (and bit inexact instruction inside it because of out-of-order nature of CPUs and not-precise PMU overflow event), and this is how statistical profiling works. When your program runs for short time (less than second) and/or you have too low sampling frequency, you may have few samples recorded in perf.data. But if you has more than several hundreds samples, you can find most cpu-heavy functions (they probably have pareto rule and runs for around several dozens percents of program run time. When you want to see smaller functions (around several percent of running time), use thousands or tens or thousands samples and do some statistical estimations (you will not get correct percent of function which runs for 0.1% of time when you have 100 or 1000 samples).
I want to use snd_pcm_delay() to query the delay until the sample I am about write to the ALSA buffer are hearable. I expect this value to vary between individual calls. Though, on two system this value is constant. The function returns a value that is always equal to the period size on one platform and on the other platform it is equal to the buffer size (two times the period size in my code).
Is my understanding of snd_pcm_delay() wrong? Is it a driver problem?
The delay is proportional to the number of samples in the buffer (the inverse of snd_pcm_avail()), plus a time that describes how much time is needed to move samples from the buffer to the speakers. The latter part is driver dependent and might not be implemented.
If the device takes out samples one entire period at a time (some DMA controllers have no better granularity for reporting the current position), then the delay value will appear to stay constant for a time, and then jump by an entire period. And you see that jump only before you have re-filled the buffer.
I am doing some performance profiling for part of my program. And I try to measure the execution with the following four methods. Interestingly they show different results and I don't fully understand their differences. My CPU is Intel(R) Core(TM) i7-4770. System is Ubuntu 14.04. Thanks in advance for any explanation.
Method 1:
Use the gettimeofday() function, result is in seconds
Method 2:
Use the rdtsc instruction similar to https://stackoverflow.com/a/14019158/3721062
Method 3 and 4 exploits Intel's Performance Counter Monitor (PCM) API
Method 3:
Use PCM's
uint64 getCycles(const CounterStateType & before, const CounterStateType &after)
Its description (I don't quite understand):
Computes the number core clock cycles when signal on a specific core is running (not halted)
Returns number of used cycles (halted cyles are not counted). The counter does not advance in the following conditions:
an ACPI C-state is other than C0 for normal operation
HLT
STPCLK+ pin is asserted
being throttled by TM1
during the frequency switching phase of a performance state transition
The performance counter for this event counts across performance state transitions using different core clock frequencies
Method 4:
Use PCM's
uint64 getInvariantTSC (const CounterStateType & before, const CounterStateType & after)
Its description:
Computes number of invariant time stamp counter ticks.
This counter counts irrespectively of C-, P- or T-states
Two samples runs generate result as follows:
(Method 1 is in seconds. Methods 2~4 are divided by a (same) number to show a per-item cost).
0.016489 0.533603 0.588103 4.15136
0.020374 0.659265 0.730308 5.15672
Some observations:
The ratio of Method 1 over Method 2 is very consistent, while the others are not. i.e., 0.016489/0.533603 = 0.020374/0.659265. Assuming gettimeofday() is sufficiently accurate, the rdtsc method exhibits the "invariant" property. (Yep I read from Internet that current generation of Intel CPU has this feature for rdtsc.)
Methods 3 reports higher than Method 2. I guess its somehow different from the TSC. But what is it?
Methods 4 is the most confusing one. It reports an order of magnitude larger number than Methods 2 and 3. Shouldn't it be also kind of cycle counts? Let alone it carries the "Invariant" name.
gettimeofday() is not designed for measuring time intervals. Don't use it for that purpose.
If you need wall time intervals, use the POSIX monotonic clock. If you need CPU time spent by a particular process or thread, use the POSIX process time or thread time clocks. See man clock_gettime.
PCM API is great for fine tuned performance measurement when you know exactly what you are doing. Which is generally obtaining a variety of separate memory, core, cache, low-power, ... performance figures. Don't start messing with it if you are not sure what exact services you need from it that you can't get from clock_gettime.