When to filter function details in linux perf? - performance-testing

While using linux perf to record a trace using intel_pt event it is possible to filter a particular function (func1) trace.
perf record -e intel_pt/branch_type=call/u --filter ' filter func1 # a.out ' -- ./a.out
An alternative approach could be:
perf record -e intel_pt/branch_type=call/u -T --switch-events -- ./a.out
followed by
perf script --itrace -c | grep 'func1'
--itrace -c to select only those branches which are function calls.
My question is whether the accuracy of timestamps recorded in the first approach is better than the second approach?
It could be so as while the first approach specifically filter and records a particular function trace, the second approach record all traces.
It could not be so as first approach require a lot more processing while recording the trace (online filtering overhead) whereas in second approach all the filtering is offline.

In theory, filtering can be done at different levels: at hardware, in OS kernel, or in user-space code. For recent versions of Intel Processor Trace (intel_pt) the filtering is done by hardware trace PMU unit:
Manpage of perf-record
--filter=<filter>
Event filter. This option should follow an event selector (-e)
which selects either tracepoint event(s) or a hardware trace PMU
(e.g. Intel PT or CoreSight). ...
· address filters
A hardware trace PMU advertises its ability to accept a number of
address filters by specifying a non-zero value in
/sys/bus/event_source/devices/<pmu>/nr_addr_filters.
Address filters have the format:
filter|start|stop|tracestop <start> [/ <size>] [#<file name>]
Where:
- 'filter': defines a region that will be traced.
So, first approach is better in terms of trace log size. It generates tracing output only for EIP address range of the func1 function, and trace log packets are not generated by hardware for other addresses. Second approach is full tracing which should generate hundreds of megabytes per second for every call executed.
Filter processing is done in hardware, so there is no huge overhead. I think that tracing hardware will not generate any overhead for executing the code outside of filtered address range, and will do some 1-5% overhead of executing the code in the filtered (required) range.
http://halobates.de/blog/p/406 gives link to hardware description intel sdm vol 3 chapter 35 intel processor trace "35.2.4.3 Filtering by IP" (there are 4 hardware filtering range registers in Table 35-6. IA32_RTIT_CTL MSR, so up to 4 address ranges can be defined)
Trace packet generation with configurable filtering by IP is supported if CPUID.(EAX=14H, ECX=0):EBX[bit 2] = 1. Intel PT can be configured to enable the generation of packets containing architectural states only when the processor is executing code within certain IP ranges. If the IP is outside of these ranges, generation of some packets is blocked. ... When ADDRn_CFG is set to enable IP filtering (see Section 35.3.1), tracing will commence when a taken branch or event is seen whose target address is in the ADDRn range. ... Note that some packets, such as MTC (Section 35.3.7) and other timing packets, do not depend on FilterEn.
35.2.5.5 Filter Enable (FilterEn) Filter Enable indicates that the Instruction Pointer (IP) is within the range of IPs that Intel PT is configured to watch. ... When FilterEn is 0, control flow packets are not generated (e.g., TNT, TIP). However, some packets, such as PIP, MTC, and PSB, may still be generated while FilterEn is clear.
For exact timestamps you should correctly configure intel_pt options. Timestamps in intel_pt tracing mode are generated by hardware as configured at perf record intel_pt options. Manpage of perf-intel-pt says that there are tsc, mtc, cyc options to get timestamps
tsc Always supported. Produces TSC timestamp packets to provide
timing information. ...
mtc Produces MTC timing packets.
MTC packets provide finer grain timestamp information than TSC
packets. MTC packets record time using the hardware crystal
clock (CTC) which is related to TSC packets using a TMA packet.
Support for this feature is indicated by:
/sys/bus/event_source/devices/intel_pt/caps/mtc
which contains "1" if the feature is supported and
"0" otherwise.
The frequency of MTC packets can also be specified - see
mtc_period below.
cyc Produces CYC timing packets.
CYC packets provide even finer grain timestamp information than
MTC and TSC packets. A CYC packet contains the number of CPU
cycles since the last CYC packet. Unlike MTC and TSC packets,
CYC packets are only sent when another packet is also sent.

Related

Counting L3 cache access event on Amd Zen 2 processors

I am trying to figure out the event to use with the perf stat command to count L3 cache accesses on an AMD Zen 2 processor. As per the PPR (http://developer.amd.com/wordpress/media/2017/11/54945_PPR_Family_17h_Models_00h-0Fh.pdf), section 2.1.13.4.1, page 168, the event is x01 and the umask is x80 for "[L3 Cache Accesses] (L3RequestG1)". From what I understand, the event to use in perf stat command would thus be r8001. But the following command always returns the count as zero no matter what load I run:
perf stat -a -e r8001 -- sleep 10
Performance counter stats for 'system wide':
0 r8001
10.001105322 seconds time elapsed
Am I misinterpreting the PPR or does [L3 Cache Accesses] (L3RequestG1) mean something else?
Also, is there a way to specify the slice of L3 cache to monitor for events in perf as most of the newer architectures with high core counts have multiple L3 slices.
The L3 cache events can only be counted on the L3 PMU as clearly specified in both the physical mnemonic (L3PMCx01) and the logical mnemonic (Core::X86::Pmc::L3::L3RequestG1) of the event you want to measure. The L3 PMU is formally called L3PMC. This is similar to the cbox PMUs on Intel processors.
The default PMU in perf for raw events is cpu, which is the name the perf_events subsystem gives to the core PMU. An event specified using a raw event code without an explicit PMU, such as r8001, is equivalent to cpu/r8001/. The core event 0x001 represents the event Core::X86::Pmc::Core::FpSchedEmpty and the umask 0x80 is undefined for this event (see Section 2.1.15.4.1). So you're counting an undefined event. In this case, if the event happened to be implemented but not documented, then the event count may not be zero depending on whether it occurs during the execution of the program being profiled. Otherwise, the event count would be zero. perf_events doesn't stop you from counting undefined events.
Starting with upstream kernel version v5.4-rc1, the L3PMC is supported in perf_events under the name amd_l3. To determine whether you're using a kernel that supports this PMU, check whether it's enumerated using the command ls /sys/devices/*/format. If not supported, then you can't measure the L3 events on that kernel through perf.
If amd_l3 is supported, you have to explicitly specify the PMU as in amd_l3/r8001/ or amd_l3/event=0x01,umask=0x80/ to have the event counted on the right PMU. Or you can just use the perf event name l3_request_g1.caching_l3_cache_accesses.
Do you know what the event L3RequestG1 represents? The documentation only describes it as "Caching: L3 cache accesses," which isn't very meaningful. It seems to me that the types of transactions it counts are a subset of those covered by the event L3LookupState. Table 19 in Section 2.1.15.2 says that L3 accesses and misses should be counted using rFF04 (L3LookupState) and r0106 (L3CombClstrState), respectively. Don't blindly expect that any of these events actually count whatever you want to measure.
The PPR you linked is not for any Zen2 processors, it's for some Zen and Zen+ processors (specifically models 00h-0Fh). You need to know the processor model and family to locate the right PPR.

What is the sampling rate for intel_pt event i.e., perf record -e intel_pt//?

Sampling rate can be set for perf record command using -F. I want to know what is the sampling rate for intel_pt event i.e., for command
perf record -e intel_pt// -- ./a.out
With -F in user mode max sampling rate allowed is 8000. While it is possible that perf record stores the trace few thousand times per second, but the trace event that are recorded using perf record -e intel_pt// have much higher frequency.
In other words with intel_pt event a trace of an application execution is collected. Is it the case that perf record work differently while recording using intel_pt event, i.e., in some non-sampling mode?
Yes, intel_pt mode of perf record is different and is not same sampling (statistical) profiling with software (cpu-clock) or hardware (cycles) events. Sampling has 4000 of current EIP samples per second and gives you basic inexact view over code execution. intel_pt is hardware-based tracing technique which generates a lot of data about every control flow instruction (in default perf intel_pt mode) allowing to reconstruct full control flow, but it has bigger overhead. So, frequency of Intel PT is same as how many calls, branches and returns are executed per second by program code (100s of millions).
With sampling on hardware events, perf record will ask hardware PMU to count some events like CPU cycles, and to generate an overflow interrupt after for example 2 million of such events. On such interrupt perf_events subsystem in kernel will record current OS timestamp, pid/tid of current thread, EIP instruction pointer to ring buffer and reset the PMU counter for new value. perf subsystem does limit maximum frequency of interrupts by autotuning the value, and -F option can be used to change desired frequency of interrupts. When the ring buffer (around several megabytes in size) is filled, perf user-space tool will dump it contents into perf.data file, and you can view raw data with perf script or perf script -D. Or just to make histograms with perf report (sort EIPs by how often there was an interrupt on that EIP instruction address, which is proportional to time taken by that code). This mode has around 4 thousand events per second of thread execution (perf report --header | grep sample_freq), with 48 bytes per sample, or 192 kilobyte per second. Overhead is basically low enough, but the sampling is not exact.
perf wiki has separate page for intel processor trace (intel_pt) - https://perf.wiki.kernel.org/index.php/Perf_tools_support_for_Intel%C2%AE_Processor_Trace
Control flow tracing is different from other kinds of performance analysis and debugging. It provides fine-grained information on branches taken in a program, but that means there can be a vast amount of trace data. Such an enormous amount of trace data creates a number of challenges, but it raises the central question: how to reduce the amount of trace data that needs to be captured. That inverts the way performance analysis is normally done. Instead of taking a test case and creating a trace of it, you need first to create a test case that is suitable for tracing.
So, intel_pt is tracing (logging) module integrated into CPU hardware, and when armed it will generate "hundreds of megabytes of trace data per CPU per second", according to used settings. With some settings it may event generate tracing data (packet log) faster than it can be written to disk or even to RAM ("overflow packets"). According to https://lwn.net/Articles/648154/ article, perf_events (kernel-mode) in intel_pt mode will just save full packet log into separate (bigger?) ring buffer and perf tool (user-space) will just periodically save data from ring buffer into file for offline filtering, parsing and decode. (Period of saving aux or ring mmap into the file is not the same as overflow interrupt frequency option -F) PT decoder then will be used to reconstruct PT packet log into perf-compatible samples. Log data volume is huge, overhead is 1% - 5% - 10% or more depending on branch frequency in code executed.
Documentation of intel_pt is manpage man perf-intel-pt and long text stored inside linux kernel source code at
https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/perf-intel-pt.txt
Intel PT is first supported in Intel Core M and 5th generation Intel Core
processors that are based on the Intel micro-architecture code name Broadwell.
Trace data is collected by 'perf record' and stored within the perf.data file. ... Trace data must be 'decoded' which involves walking the object code and matching the trace data packets. ... Decoding is done on-the-fly. The decoder outputs samples in the same format as
samples output by perf hardware events, for example as though the "instructions"
or "branches" events had been recorded. Presently 3 tools support this:
'perf script', 'perf report' and 'perf inject'. ... The main distinguishing feature of Intel PT is that the decoder can determine
the exact flow of software execution. Intel PT can be used to understand why
and how did software get to a certain point, or behave a certain way. ... A limitation of Intel PT is that it produces huge amounts of trace data
(hundreds of megabytes per second per core) which takes a long time to decode
By default perf record -e intel_pt// is same as -e intel_pt/tsc=1,noretcomp=0/. config terms section of manpage man perf-intel-pt says what is default settings:
tsc Always supported.
Produces TSC timestamp packets to provide timing information. In some cases it is possible to decode without timing information, for example a per-thread context that does not overlap executable memory maps.
noretcomp Always supported. Disables "return compression" so a TIP
packet is produced when a function returns. Causes more packets to be
produced but might make decoding more reliable.
pt Specifies pass-through which enables the branch config term.
branch Enable branch tracing. Branch tracing is enabled by default
To represent software control flow, "branches" samples are produced.
By default a branch sample is synthesized for every single branch.
As it says, intel_pt in default mode is used to produce control flow log, by asking hardware to generate log packets for every control flow instruction like call, branch, return, and to add timestamps to synchronize pt log with some service perf samples (like exec or mmap to find actual code being loaded into memory). It tries to generate not too much, for example [single bit is used per conditional branch (tnt)](https://conference.hitb.org/hitbsecconf2017ams/materials/D1T1 - Richard Johnson - Harnessing Intel Processor Trace on Windows for Vulnerability Discovery.pdf#page=12) and several bytes per indirect branch, but there are hundreds of millions branches per second for many programs.
Some useful and short slides on perf + intel_pt:
Andi Kleen, 2015 https://halobates.de/pt-tracing-summit15.pdf (PT modes current: Full trace mode, Snapshot mode; Upcoming: Sampling mode, Core dump, System crash mode)
Andi Kleen's posts on PT: https://halobates.de/blog/p/category/pt
Suchakrapani Datt Sharma, POLYTECHNIQUE MONTREAL, 2015 https://hsdm.dorsal.polymtl.ca/system/files/10Dec2015_0.pdf (trace packets overview - PSB (Packet Stream Boundary), TNT (Taken Not-Taken), TIP (Target IP) at branches, non-default CYC Packets : Cycle counter data for IPC, MTC (Mini Timestamp Counter), ...)
Jack Henschel, 2017 about design and use-cases https://blog.cubieserver.de/publications/Henschel_Intel-PT_2017.pdf
[https://events.static.linuxfound.org/sites/events/files/slides/lcna13_kleen.pdf Efficient and Large Scale Program Flow Tracing in Linux, Alexander Shishkin], Intel, 2013 ("What is it good for? •Profiling / performance measurement •Functional debugging •Code coverage analysis")
About generic difference between sampling and (software) tracing: https://danluu.com/perf-tracing/
Update: While intel pt trace log has full trace (there are packets inside for every branch/call/return), perf report does run conversion from pt log into sample set like in classic perf.data, and there is sampling rate in sample set. This is configured with --itrace option of perf report (iNNTT, where NN is amount and TT is type - i/t/us/ns, as described in man page of perf-report:
--itrace
Options for decoding instruction tracing data. The options are:
i synthesize instructions events
g synthesize a call chain (use with i or x)
The default is all events i.e. the same as --itrace=ibxwpe,
In addition, the period (default 100000, ...)
for instructions events can be specified in units of:
i instructions
t ticks
ms milliseconds
us microseconds
ns nanoseconds (default)
So it seems like by default perf report will convert full trace log into instruction samples at sampling rate of 100000 instructions (1 perf sample generated per 100 thousands instructions). It can be changed to higher rate, but processing time will increase.
Manpage of perf-intel-pt gives more examples of itrace option usage:
Because samples are synthesized after-the-fact, the sampling period
can be selected for reporting. e.g. sample every microsecond
sudo perf report pt_ls --itrace=i1usge
See the sections below for more information about the --itrace
option.
Beware the smaller the period, the more samples that are produced,
and the longer it takes to process them.
Also note that the coarseness of Intel PT timing information will
start to distort the statistical value of the sampling as the
sampling period becomes smaller.
To see every possible IPC value, "instructions" events can be used
e.g. --itrace=i0ns
--itrace=i10us
sets the period to 10us i.e. one instruction sample is synthesized
for each 10 microseconds of trace. Alternatives to "us" are "ms"
(milliseconds), "ns" (nanoseconds), "t" (TSC ticks) or "i"
(instructions).
For Intel PT, the default period is 100us.
Setting it to a zero period means "as often as possible".
In the case of Intel PT that is the same as a period of 1 and a unit
of instructions (i.e. --itrace=i1i).
http://halobates.de/blog/p/410 has some additional examples of complex conversions:
perf script --ns --itrace=cr
Record program execution and display function call graph.
perf script by defaults “samples” the data (only dumps a sample every
100us). This can be configured using the --itrace option (see
reference below)
perf script --itrace=i0ns --ns -F time,pid,comm,sym,symoff,insn,ip | xed -F insn: -S /proc/kallsyms -64
Show every assembly instruction executed with disassembler.
perf report --itrace=g32l64i100us --branch-history
Print hot paths every 100us as call graph histograms
perf script --itrace=i100usg | stackcollapse-perf.pl > workload.folded
flamegraph.pl workloaded.folded > workload.svg
google-chrome workload.svg
Generate flame graph from execution, sampled every 100us

How does perf associate events to functions?

More precisely how does the perf tool associate PMU events to functions
i already realized that when the kernel perf subsystem records the event counters it also records the Program Counter (PC) so it can associate the count to a function.
However to really get fine grain result, you need to sample the counters in a very high rate, otherwise you may associate counters to a group of functions.
But reading the counters and writing the sampled data (counters, PC, call-stack) to the perf mmap space is very intrusive.
I read in some sources that this sampling only happens when the PMU counters overflow, but this is can be very coarse unless i am setting the counters to overflow very quickly
what am i missing here ?
perf record is statistical profiling tool, it either program hardware performance event monitor unit (PMU) to overflow after some number of counts (for example with -e cycles -c 1000000 write -1000000 to counter and enable counting cycles; with -F or without freq/period argument it will autotune value), on overflow interrupt perf will reprogram it for next count. So it will have several hundreds or few thousands events per second. Or it can use OS timer interrupt (-e task-clock) to get periodic samples. On every sample (or on interrupt from hardware PMU) perf will record current PC (EIP) and/or callstack; and it does not record current value of counter (check full dump of data stored in the perf.data with perf script or perf script -D; or code of sample event dumping - there is sample->ip but not current count of PMU).
perf report will parse perf.data to get all PC recorded in it. It will count how many times each PC was sampled to build histogram [PC] -> sample_count. Every PC will be associated with the exact function it belongs (perf report will parse memory map, as mmap events are recorded in perf.data too, open every binary used, find symbols table of every binary).
Actual code of perf report is in linux/tools/perf/builtin-report.c: cmd_report/__cmd_report -> perf_session__process_events -> some magic -> process_sample_event to record all mentioned in perf.data ip (PC) values with hist_entry_iter__add(&iter, &al, rep->max_stack, rep); into histogram with hist_iter__report_callback:
hist_entry__inc_addr_samples(he, evsel->idx, al->addr);
. . . (perf/util/annotate.c) __symbol__inc_addr_samples
611 h->addr[offset]++;
Then it will output collected histogram with report__browse_hists -> perf_evlist__tty_browse_hists -> hists__fprintf_nr_sample_events(hists, rep, evname, stdout);.
Every sample is already associated with exact function (and bit inexact instruction inside it because of out-of-order nature of CPUs and not-precise PMU overflow event), and this is how statistical profiling works. When your program runs for short time (less than second) and/or you have too low sampling frequency, you may have few samples recorded in perf.data. But if you has more than several hundreds samples, you can find most cpu-heavy functions (they probably have pareto rule and runs for around several dozens percents of program run time. When you want to see smaller functions (around several percent of running time), use thousands or tens or thousands samples and do some statistical estimations (you will not get correct percent of function which runs for 0.1% of time when you have 100 or 1000 samples).

Using perf to monitor raw event counters

I am trying to measure certain hardware events on a (Intel Xeon) machine with multiple (physical) processors. Specifically, I wish to know how many requests are issued for reading 'offcore' data.
I found the OFFCORE_REQUESTS hardware event in Intels documentation and it gives the event descriptor 0xB0 and for data demands, the additional mask 0x01.
Would it then be correct to tell perf to record the event 0xB1 (i.e. 0xB0 | 0x01) and to call it as:
perf record -e r0B1 ./mytestapp someargs
Or is this incorrect?
Because perf report shows no output for events entered like this.
The perf documentation is rather sparse in this area, apart from a tutorial entry which does not say which event it was (though this one works for me), or how it was encoded...
Any help is greatly appreciated.
Ok, so I guess I figured it out.
For the the Intel machine I use, the format is as follows:
<umask><eventselector> where both are hexadecimal values. The leading zeros of the umask can be dropped, but not for the event selector.
So for the event 0xB0 with the mask 0x01 I can call:
perf record -e r1B0 ./mytestapp someargs
I could not manage to find the exact parsing of it in the perf kernel code (any kernel hacker here?), but I found these sources:
A description of the use of perf with raw events in the c't magazine 13/03 (subscription required), which describes some raw events with their description from the Intel Architecture Software Developers Manuel (Vol 3b)
A patch on the kernel mailing list, discussing the proper way to document it. It specified that the pattern above was "... was x86 specific and imcomplete at that"
(Updated) The man page of newer versions shows an example on Intel machines: man perf-list
Update:
As pointed out in the comments (thank you!), the libpfm translator can be used to obtain the proper event descriptor. The website linked in the comments (Bojan Nikolic: How to monitor the full range of CPU performance events), discovered by user 'osgx' explains it in further detail.
It seems you can use as well:
perf record -e cpu/event=0xB1,umask=0x1/u ./mytestapp someargs
I don't know where this syntax is documented.
You can probably use the other arguments (edge, inv, cmask) as well.
There are several libraries which can be helpful to work with raw PMU events.
perf's own wiki https://perf.wiki.kernel.org/index.php/Tutorial#Events recommends perf list --help man page for info about raw events encoding. And modern perf versions will list raw events as part of perf list output ("... if linked against the libpfm4 library, provides some short description of the events."). perf list --details will also print raw ids and masks of events.
Bojan Nikolic has "How to monitor the full range of CPU performance events" blog article about libpfm4 (perfmon2) lib usage to encode raw events for perf with help of showevtinfo and check_events tools, which are provided with the same library.
There is also perf python wrapper ocperf which accepts intel's event names. It is written by Andi Kleen (Intel Open Source Technology Center) as part of pmu-tools set of utilities (LWN post from 2013, event lists by intel at https://download.01.org/perfmon/). There is a demo of ocperf (2011) http://halobates.de/modern-pmus-yokohama.pdf:
ocperf
•Perf wrapper to support Intel specific events
•Allows symbolic events and some additional events
ocperf record -a −e offcore_response.any_data.remote_dram_0 sleep 10
PAPI library also has tool to explore raw events with some descriptions - papi_native_avail.

Linux outgoing packet rate

Is there any way in Linux to measure number of outgoing packets from a machine in a certain amount of time, lets say per second or per minute?
There are quite a few programs that can do this (most of the good ones are not standard in base distrobution). The one I highly recommend is iptraf. Another one is ntop. Other than that, writing a custom shell script to cat the output of ifconfig interface's TX packets and have it loop at a desired interval can also do the trick.

Resources