What timeframe does the "perf sched record" use? - linux

I've been trying to analyze the output of perf sched record but I don't understand with what frame of reference do I try to understand the "20624.983302 secs". It isn't Unix time for sure, so what is it? How would I go about converting this into Unix time?
*A0 20624.983302 secs A0 => migration/0:12
*. 20624.983311 secs . => swapper:0
*B0 20624.983318 secs B0 => IPC I/O Child:33924
*. 20624.983355 secs
*C0 20624.983485 secs C0 => WRScene~lder#15:39974
*. 20624.983581 secs
*D0 20624.983972 secs D0 => IPC I/O Parent:33780

These timestamps are captured using the kernel scheduler clock, which counts in nanoseconds since boot. The exact details depend on the compile-time parameters chosen to build a particular Linux distribution and the target architecture.
In general, the timestamp of a sample is captured around the same time when it's recorded. Timestamps on the same core are guaranteed to be monotonically increasing as long the core remains in an active state. The samples you've shown were all captured on the same core and the core remained active from the first sample to the last sample. So the timestamps are guaranteed to be monotonic in this case irrespective of the platform and distribution. When profiling on multiple cores, there is no guarantee that the clocks on all cores are in sync.
All perf tools use the same clock to capture timestamps, but they may differ in the way timestamps are printed and it may happen that two tools print timestamps from the same sample file differently. This depends on the kernel version.
It's possible to specify a clock source when calling perf_event_open() by setting use_clockid to 1 and setting clockid to one of the clock sources defined in linux/time.h, such as CLOCK_MONOTONIC. perf record provides the -k or --clockid option to specify the clock source for capturing timestamps.
Modern distributions on x86 typically use TSC as the source for the scheduler clock (check /sys/devices/system/clocksource/clocksource0/current_clocksource). So if you're on an x86 processor, most probably the TSC of the profiled core was used to capture the current value of TSC cycles, which internally gets converted into nanoseconds. When a timestamp is printed, it may get converted to a different unit. In this case, timestamps are printed in the format "seconds.microseconds". A summary of the behavior of TSC on Intel processors can be found at: Can constant non-invariant tsc change frequency across cpu states?.

Related

What is the sampling rate for intel_pt event i.e., perf record -e intel_pt//?

Sampling rate can be set for perf record command using -F. I want to know what is the sampling rate for intel_pt event i.e., for command
perf record -e intel_pt// -- ./a.out
With -F in user mode max sampling rate allowed is 8000. While it is possible that perf record stores the trace few thousand times per second, but the trace event that are recorded using perf record -e intel_pt// have much higher frequency.
In other words with intel_pt event a trace of an application execution is collected. Is it the case that perf record work differently while recording using intel_pt event, i.e., in some non-sampling mode?
Yes, intel_pt mode of perf record is different and is not same sampling (statistical) profiling with software (cpu-clock) or hardware (cycles) events. Sampling has 4000 of current EIP samples per second and gives you basic inexact view over code execution. intel_pt is hardware-based tracing technique which generates a lot of data about every control flow instruction (in default perf intel_pt mode) allowing to reconstruct full control flow, but it has bigger overhead. So, frequency of Intel PT is same as how many calls, branches and returns are executed per second by program code (100s of millions).
With sampling on hardware events, perf record will ask hardware PMU to count some events like CPU cycles, and to generate an overflow interrupt after for example 2 million of such events. On such interrupt perf_events subsystem in kernel will record current OS timestamp, pid/tid of current thread, EIP instruction pointer to ring buffer and reset the PMU counter for new value. perf subsystem does limit maximum frequency of interrupts by autotuning the value, and -F option can be used to change desired frequency of interrupts. When the ring buffer (around several megabytes in size) is filled, perf user-space tool will dump it contents into perf.data file, and you can view raw data with perf script or perf script -D. Or just to make histograms with perf report (sort EIPs by how often there was an interrupt on that EIP instruction address, which is proportional to time taken by that code). This mode has around 4 thousand events per second of thread execution (perf report --header | grep sample_freq), with 48 bytes per sample, or 192 kilobyte per second. Overhead is basically low enough, but the sampling is not exact.
perf wiki has separate page for intel processor trace (intel_pt) - https://perf.wiki.kernel.org/index.php/Perf_tools_support_for_Intel%C2%AE_Processor_Trace
Control flow tracing is different from other kinds of performance analysis and debugging. It provides fine-grained information on branches taken in a program, but that means there can be a vast amount of trace data. Such an enormous amount of trace data creates a number of challenges, but it raises the central question: how to reduce the amount of trace data that needs to be captured. That inverts the way performance analysis is normally done. Instead of taking a test case and creating a trace of it, you need first to create a test case that is suitable for tracing.
So, intel_pt is tracing (logging) module integrated into CPU hardware, and when armed it will generate "hundreds of megabytes of trace data per CPU per second", according to used settings. With some settings it may event generate tracing data (packet log) faster than it can be written to disk or even to RAM ("overflow packets"). According to https://lwn.net/Articles/648154/ article, perf_events (kernel-mode) in intel_pt mode will just save full packet log into separate (bigger?) ring buffer and perf tool (user-space) will just periodically save data from ring buffer into file for offline filtering, parsing and decode. (Period of saving aux or ring mmap into the file is not the same as overflow interrupt frequency option -F) PT decoder then will be used to reconstruct PT packet log into perf-compatible samples. Log data volume is huge, overhead is 1% - 5% - 10% or more depending on branch frequency in code executed.
Documentation of intel_pt is manpage man perf-intel-pt and long text stored inside linux kernel source code at
https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/perf-intel-pt.txt
Intel PT is first supported in Intel Core M and 5th generation Intel Core
processors that are based on the Intel micro-architecture code name Broadwell.
Trace data is collected by 'perf record' and stored within the perf.data file. ... Trace data must be 'decoded' which involves walking the object code and matching the trace data packets. ... Decoding is done on-the-fly. The decoder outputs samples in the same format as
samples output by perf hardware events, for example as though the "instructions"
or "branches" events had been recorded. Presently 3 tools support this:
'perf script', 'perf report' and 'perf inject'. ... The main distinguishing feature of Intel PT is that the decoder can determine
the exact flow of software execution. Intel PT can be used to understand why
and how did software get to a certain point, or behave a certain way. ... A limitation of Intel PT is that it produces huge amounts of trace data
(hundreds of megabytes per second per core) which takes a long time to decode
By default perf record -e intel_pt// is same as -e intel_pt/tsc=1,noretcomp=0/. config terms section of manpage man perf-intel-pt says what is default settings:
tsc Always supported.
Produces TSC timestamp packets to provide timing information. In some cases it is possible to decode without timing information, for example a per-thread context that does not overlap executable memory maps.
noretcomp Always supported. Disables "return compression" so a TIP
packet is produced when a function returns. Causes more packets to be
produced but might make decoding more reliable.
pt Specifies pass-through which enables the branch config term.
branch Enable branch tracing. Branch tracing is enabled by default
To represent software control flow, "branches" samples are produced.
By default a branch sample is synthesized for every single branch.
As it says, intel_pt in default mode is used to produce control flow log, by asking hardware to generate log packets for every control flow instruction like call, branch, return, and to add timestamps to synchronize pt log with some service perf samples (like exec or mmap to find actual code being loaded into memory). It tries to generate not too much, for example [single bit is used per conditional branch (tnt)](https://conference.hitb.org/hitbsecconf2017ams/materials/D1T1 - Richard Johnson - Harnessing Intel Processor Trace on Windows for Vulnerability Discovery.pdf#page=12) and several bytes per indirect branch, but there are hundreds of millions branches per second for many programs.
Some useful and short slides on perf + intel_pt:
Andi Kleen, 2015 https://halobates.de/pt-tracing-summit15.pdf (PT modes current: Full trace mode, Snapshot mode; Upcoming: Sampling mode, Core dump, System crash mode)
Andi Kleen's posts on PT: https://halobates.de/blog/p/category/pt
Suchakrapani Datt Sharma, POLYTECHNIQUE MONTREAL, 2015 https://hsdm.dorsal.polymtl.ca/system/files/10Dec2015_0.pdf (trace packets overview - PSB (Packet Stream Boundary), TNT (Taken Not-Taken), TIP (Target IP) at branches, non-default CYC Packets : Cycle counter data for IPC, MTC (Mini Timestamp Counter), ...)
Jack Henschel, 2017 about design and use-cases https://blog.cubieserver.de/publications/Henschel_Intel-PT_2017.pdf
[https://events.static.linuxfound.org/sites/events/files/slides/lcna13_kleen.pdf Efficient and Large Scale Program Flow Tracing in Linux, Alexander Shishkin], Intel, 2013 ("What is it good for? •Profiling / performance measurement •Functional debugging •Code coverage analysis")
About generic difference between sampling and (software) tracing: https://danluu.com/perf-tracing/
Update: While intel pt trace log has full trace (there are packets inside for every branch/call/return), perf report does run conversion from pt log into sample set like in classic perf.data, and there is sampling rate in sample set. This is configured with --itrace option of perf report (iNNTT, where NN is amount and TT is type - i/t/us/ns, as described in man page of perf-report:
--itrace
Options for decoding instruction tracing data. The options are:
i synthesize instructions events
g synthesize a call chain (use with i or x)
The default is all events i.e. the same as --itrace=ibxwpe,
In addition, the period (default 100000, ...)
for instructions events can be specified in units of:
i instructions
t ticks
ms milliseconds
us microseconds
ns nanoseconds (default)
So it seems like by default perf report will convert full trace log into instruction samples at sampling rate of 100000 instructions (1 perf sample generated per 100 thousands instructions). It can be changed to higher rate, but processing time will increase.
Manpage of perf-intel-pt gives more examples of itrace option usage:
Because samples are synthesized after-the-fact, the sampling period
can be selected for reporting. e.g. sample every microsecond
sudo perf report pt_ls --itrace=i1usge
See the sections below for more information about the --itrace
option.
Beware the smaller the period, the more samples that are produced,
and the longer it takes to process them.
Also note that the coarseness of Intel PT timing information will
start to distort the statistical value of the sampling as the
sampling period becomes smaller.
To see every possible IPC value, "instructions" events can be used
e.g. --itrace=i0ns
--itrace=i10us
sets the period to 10us i.e. one instruction sample is synthesized
for each 10 microseconds of trace. Alternatives to "us" are "ms"
(milliseconds), "ns" (nanoseconds), "t" (TSC ticks) or "i"
(instructions).
For Intel PT, the default period is 100us.
Setting it to a zero period means "as often as possible".
In the case of Intel PT that is the same as a period of 1 and a unit
of instructions (i.e. --itrace=i1i).
http://halobates.de/blog/p/410 has some additional examples of complex conversions:
perf script --ns --itrace=cr
Record program execution and display function call graph.
perf script by defaults “samples” the data (only dumps a sample every
100us). This can be configured using the --itrace option (see
reference below)
perf script --itrace=i0ns --ns -F time,pid,comm,sym,symoff,insn,ip | xed -F insn: -S /proc/kallsyms -64
Show every assembly instruction executed with disassembler.
perf report --itrace=g32l64i100us --branch-history
Print hot paths every 100us as call graph histograms
perf script --itrace=i100usg | stackcollapse-perf.pl > workload.folded
flamegraph.pl workloaded.folded > workload.svg
google-chrome workload.svg
Generate flame graph from execution, sampled every 100us

PERF STAT does not count memory-loads but counts memory-stores

Linux Kernel : 4.10.0-20-generic (also tried this on 4.11.3)
Ubuntu : 17.04
I have been trying to collect stats of memory-accesses using perf stat. I am able to collect stats for memory-stores but the count for memory-loads return me a 0 value.
The below is the details for memory-stores :-
perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
158,115,510 cpu/mem-stores/u
0.559922797 seconds time elapsed
For memory-loads, I get a 0 count as can be seen below :-
perf stat -e cpu/mem-loads/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
0 cpu/mem-loads/u
0.563806170 seconds time elapsed
I cannot understand why this does not count properly. Should I use a different event in any way to get proper data ?
The mem-loads event is mapped to the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3 performance monitoring unit event on Intel processors. The events MEM_TRANS_RETIRED.LOAD_LATENCY_* are special and can only be counted by using the p modifier. That is, you have to specify mem-loads:p to perf to use the event correctly.
MEM_TRANS_RETIRED.LOAD_LATENCY_* is a precise event and it only makes sense to be counted at the precise level. According to this Intel article (emphasis mine):
When a user elects to sample one of these events, special hardware is
used that can keep track of a data load from issue to completion.
This is more complicated than simply counting instances of an event
(as with normal event-based sampling), and so only some loads are
tracked. Loads are randomly chosen, the latency determined for each,
and the correct event(s) incremented (latency >4, >8, >16, etc). Due
to the nature of the sampling for this event, only a small percentage
of an application's data loads can be tracked at any one time.
As you can see, MEM_TRANS_RETIRED.LOAD_LATENCY_* by no means count the total number of loads and it is not designed for that purpose at all.
If you want to to determine which instructions in your code are issuing load requests that take more than a specific number of cycles to complete, then MEM_TRANS_RETIRED.LOAD_LATENCY_* is the right performance event to use. In fact, that is exactly the purpose of perf-mem and it achieves its purpose by using this event.
If you want to count the total number of load uops retired, then you should use L1-dcache-loads, which is mapped to the MEM_UOPS_RETIRED.ALL_LOADS performance event on Intel processors.
On the other hand, mem-stores and L1-dcache-stores are mapped to the exact same performance event on all current Intel processors, namely, MEM_UOPS_RETIRED.ALL_STORES, which does count all retired store uops.
So in summary, if you are using perf-stat, you should (almost) always use L1-dcache-loads and L1-dcache-stores to count retired loads and stores, respectively. These are mapped to the raw events you have used in the answer you posted, only more portable because they also work on AMD processors.
I have used a Broadwell(CPU e5-2620) server machine to collect all of the below events.
To collect memory-load events, I had to use a numeric event value. I basically ran the below command -
./perf record -e "r81d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
Here r81d0 represents the raw event for counting "memory loads amongst all instructions retired". "u" as can be understood represents user-space.
The below command, on the other hand,
./perf record -e "r82d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
has "r82d0:u" as a raw event representing "memory stores amongst all instructions retired in userspace".

How does perf associate events to functions?

More precisely how does the perf tool associate PMU events to functions
i already realized that when the kernel perf subsystem records the event counters it also records the Program Counter (PC) so it can associate the count to a function.
However to really get fine grain result, you need to sample the counters in a very high rate, otherwise you may associate counters to a group of functions.
But reading the counters and writing the sampled data (counters, PC, call-stack) to the perf mmap space is very intrusive.
I read in some sources that this sampling only happens when the PMU counters overflow, but this is can be very coarse unless i am setting the counters to overflow very quickly
what am i missing here ?
perf record is statistical profiling tool, it either program hardware performance event monitor unit (PMU) to overflow after some number of counts (for example with -e cycles -c 1000000 write -1000000 to counter and enable counting cycles; with -F or without freq/period argument it will autotune value), on overflow interrupt perf will reprogram it for next count. So it will have several hundreds or few thousands events per second. Or it can use OS timer interrupt (-e task-clock) to get periodic samples. On every sample (or on interrupt from hardware PMU) perf will record current PC (EIP) and/or callstack; and it does not record current value of counter (check full dump of data stored in the perf.data with perf script or perf script -D; or code of sample event dumping - there is sample->ip but not current count of PMU).
perf report will parse perf.data to get all PC recorded in it. It will count how many times each PC was sampled to build histogram [PC] -> sample_count. Every PC will be associated with the exact function it belongs (perf report will parse memory map, as mmap events are recorded in perf.data too, open every binary used, find symbols table of every binary).
Actual code of perf report is in linux/tools/perf/builtin-report.c: cmd_report/__cmd_report -> perf_session__process_events -> some magic -> process_sample_event to record all mentioned in perf.data ip (PC) values with hist_entry_iter__add(&iter, &al, rep->max_stack, rep); into histogram with hist_iter__report_callback:
hist_entry__inc_addr_samples(he, evsel->idx, al->addr);
. . . (perf/util/annotate.c) __symbol__inc_addr_samples
611 h->addr[offset]++;
Then it will output collected histogram with report__browse_hists -> perf_evlist__tty_browse_hists -> hists__fprintf_nr_sample_events(hists, rep, evname, stdout);.
Every sample is already associated with exact function (and bit inexact instruction inside it because of out-of-order nature of CPUs and not-precise PMU overflow event), and this is how statistical profiling works. When your program runs for short time (less than second) and/or you have too low sampling frequency, you may have few samples recorded in perf.data. But if you has more than several hundreds samples, you can find most cpu-heavy functions (they probably have pareto rule and runs for around several dozens percents of program run time. When you want to see smaller functions (around several percent of running time), use thousands or tens or thousands samples and do some statistical estimations (you will not get correct percent of function which runs for 0.1% of time when you have 100 or 1000 samples).

What do the ALSA timestamping function return and how do the result relate to each other?

There are several "hi-res" timestamping functions in ALSA:
snd_pcm_status_get_trigger_htstamp
snd_pcm_status_get_audio_htstamp
snd_pcm_status_get_driver_htstamp
snd_pcm_status_get_htstamp
I would like to understand what points in time the resulting functions represent.
My current understanding is that trigger_htstamp represents the time when stream was started/stopped/paused. snd_pcm_status_get_trigger_htstamp returns a constant value and when I add audio_htstamp to that value the result is very close to the current system time.
audio_htstamp seems to start from zero on my system and it is incremented by a value that is equal to the period size I use. Hence on my system it is a simple frame counter. If I understand ALSA correctly audio_htstamp can also work in different more accurate way depending on the system capabilities.
driver_htstamp I guess by the name is a timestamp generated by the audio driver.
Question 1: When is the timestamp driver_htstamp usually generated?
With htstamp I am really unsure where and when it is generated. I have a hunch that it may be related to DMA.
Question 2: Where is htstamp generated?
Question 3: When is htstamp generated?
Question 4: Is the assumption audio_htstamp < htstamp < driver_htstamp generally correct?
It seems like this with a little test program I wrote, but I want to verify my assumption.
I can not find this information in the ALSA documentation.
I just dug through the code for this stuff for my own purposes, so I figured I would share what I found.
The purpose of these timestamps is to allow you to determine subtle differences in the rate of different clocks; most importantly in this case the main system clock that Linux uses for general timekeeping compared with the different clock that determines the rate at which samples move in and out of the sound device. This can be very important for applications that need to keep audio from different hardware devices in sync, since the rates of different physical clocks are never exactly the same.
The technique used is sometimes called "cross-timestamping"; you capture timestamps from the clocks you want to compare as close to simultaneously as possible, and repeat this at regular intervals. There is usually some measurement error introduced, but some relatively simple filtering can get you a good characterization of the difference in the rate at which the clocks count.
The core PCM driver arranges to take a system clock timestamp as closely as possible to when an audio stream starts, and then it does a cross-timestamp between the system clock and audio clock (which can be measured in different ways) whenever it is asked to check the state of the hardware pointers for the DMA engine that moves samples around.
The default method of measuring the audio clock is via DMA hardware pointer comparsion. This isn't terribly precise, but over longer periods of time you can still get a good measure of the rate difference. At the start of snd_pcm_update_hw_ptr0, a system timestamp is captured; this will end up being htstamp. The DMA pointers are then checked, and if it's determined that they've moved since the last check, audio_htstamp is calculated based on the number of frames DMA has copied and the nominal frequency of the audio clock. Then, once all the DMA pointer update is done and right before snd_pcm_update_hw_ptr0 returns, another system timestamp is captured in driver_htstamp. This isn't meant to be used when you're using the DMA hw_ptr method of calculating the audio_htstamp though.
If you happen to have an audio device using the HDAudio driver, you can use an alternate and much more precise method of measuring the audio clock. It supplies an extra operation callback called get_time_info that is used instead of the default method of capturing the system and audio timestamps. It the HDAudio case, it takes a system timestamp for htstamp as close to possible to when it reads an interal counter driven by the same clock source as the audio clock; this forms the audio_htstamp. Afterwards, the same DMA hw_ptr bookkeeping is done, but the code that translates the pointer movement into time is skipped. The driver_htstamp is still taken right before the routine ends, though; this is "to let apps detect if the reference tstamp read by low-level hardware was provided with a delay" as the comment says in the code. This is because there's no guarantee that the get_time_info callback is going to take a new system timestamp; it may have previously recorded an audio timestamp along with a system timestamp as part of an interrupt handler. In this case, the timestamps you get might not match with the available frames and delay frames counts calculated by hw_ptr bookkeeping, but the driver_htstamp will let you know the closest system time to when those calculations were made.
In any case, the code is designed in both cases to capture htstamp and audio_htstamp as closely together as possible, and for htstamp - trigger_htstamp to represent the amount of system time that passed during the period measured by audio_htstamp of the audio clock. You mostly shouldn't need to use driver_htstamp, but I guess it might be used with the USB Audio driver, as I think it and HDAudio are the only ones that do anything special with these interfaces right now.
The documentation for this, although it doesn't contain all the details you might want to know, is part of the kernel documentation: http://lxr.free-electrons.com/source/Documentation/sound/alsa/timestamping.txt?v=4.9

Measuring time: differences among gettimeofday, TSC and clock ticks

I am doing some performance profiling for part of my program. And I try to measure the execution with the following four methods. Interestingly they show different results and I don't fully understand their differences. My CPU is Intel(R) Core(TM) i7-4770. System is Ubuntu 14.04. Thanks in advance for any explanation.
Method 1:
Use the gettimeofday() function, result is in seconds
Method 2:
Use the rdtsc instruction similar to https://stackoverflow.com/a/14019158/3721062
Method 3 and 4 exploits Intel's Performance Counter Monitor (PCM) API
Method 3:
Use PCM's
uint64 getCycles(const CounterStateType & before, const CounterStateType &after)
Its description (I don't quite understand):
Computes the number core clock cycles when signal on a specific core is running (not halted)
Returns number of used cycles (halted cyles are not counted). The counter does not advance in the following conditions:
an ACPI C-state is other than C0 for normal operation
HLT
STPCLK+ pin is asserted
being throttled by TM1
during the frequency switching phase of a performance state transition
The performance counter for this event counts across performance state transitions using different core clock frequencies
Method 4:
Use PCM's
uint64 getInvariantTSC (const CounterStateType & before, const CounterStateType & after)
Its description:
Computes number of invariant time stamp counter ticks.
This counter counts irrespectively of C-, P- or T-states
Two samples runs generate result as follows:
(Method 1 is in seconds. Methods 2~4 are divided by a (same) number to show a per-item cost).
0.016489 0.533603 0.588103 4.15136
0.020374 0.659265 0.730308 5.15672
Some observations:
The ratio of Method 1 over Method 2 is very consistent, while the others are not. i.e., 0.016489/0.533603 = 0.020374/0.659265. Assuming gettimeofday() is sufficiently accurate, the rdtsc method exhibits the "invariant" property. (Yep I read from Internet that current generation of Intel CPU has this feature for rdtsc.)
Methods 3 reports higher than Method 2. I guess its somehow different from the TSC. But what is it?
Methods 4 is the most confusing one. It reports an order of magnitude larger number than Methods 2 and 3. Shouldn't it be also kind of cycle counts? Let alone it carries the "Invariant" name.
gettimeofday() is not designed for measuring time intervals. Don't use it for that purpose.
If you need wall time intervals, use the POSIX monotonic clock. If you need CPU time spent by a particular process or thread, use the POSIX process time or thread time clocks. See man clock_gettime.
PCM API is great for fine tuned performance measurement when you know exactly what you are doing. Which is generally obtaining a variety of separate memory, core, cache, low-power, ... performance figures. Don't start messing with it if you are not sure what exact services you need from it that you can't get from clock_gettime.

Resources