How to use linux `perf` tool to generate "Off-CPU" profile

How to use linux `perf` tool to generate "Off-CPU" profile - linux

Brendan D. Gregg (author of DTrace book) has interesting variant of profiling: the "Off-CPU" profiling (and Off-CPU Flame Graph; slides 2013, p112-137) to see, where the thread or application were blocked (was not executed by CPU, but waiting for I/O, pagefault handler, or descheduled due short of CPU resources):
This time reveals which code-paths are blocked and waiting while off-CPU, and for how long exactly. This differs from traditional profiling which often samples the activity of threads at a given interval, and (usually) only examine threads if they are executing work on-CPU.
He also can combine Off-CPU profile data and On-CPU profile together: http://www.brendangregg.com/FlameGraphs/hotcoldflamegraphs.html
The examples given by Gregg are made using dtrace, which is not usually available in Linux OS. But there are some similar tools (ktap, systemtap, perf) and the perf as I think has widest installed base. Usually perf generated On-CPU profiles (which functions were executed more often on CPU).
How can I translate Gregg's Off-CPU examples to perf profiling tool in Linux?
PS: There is link to systemtap variant of Off-CPU flamegraphs in the slides from LISA13, p124: "Yichun Zhang created these, and has been using them on Linux with SystemTap to collect the proﬁle data. See: • http://agentzh.org/misc/slides/off-cpu-flame-graphs.pdf"" (CloudFlare Beer Meeting on 23 August 2013)

The perf technique I published[1] was a high-overhead workaround, until perf has BPF support for doing this.
Right now, the lowest cost way of generating an off-CPU flame graph on Linux is on a 4.6+ kernel (which has BPF stack trace support), and with bcc/BPF. I wrote a tool for it, offcputime[2], which can be run with a -f option for "folded output", suitable for feeding into flamegraph.pl. This offcputime tool does the timing and stack counting all in kernel content, and dumps a report that is then printed with symbols.
One day, I expect that perf itself will be able to do this as well: run a BPF program that does the in-kernel counting, and dumping of a report.
In the meantime, we can use bcc/BPF. If for some reason you can't use bcc, you can, right now, take that offcputime program and write it in C. A more complicated version is available in the Linux source, as samples/bpf/offwaketime*. With the new BPF features on Linux, if there's a will, there's a way.
[1] http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html
[2] https://github.com/iovisor/bcc/blob/master/tools/offcputime_example.txt

Brendan Gregg published instruction about Off-cpu flame graph generating:
http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html
and https://github.com/brendangregg/FlameGraph/issues/47#
Off-CPU time flame graphs may solve (say) 60% of the issues, with the remainder requiring walking the thread wakeups to find root cause. I explained off-CPU time flame graphs, this wakeup issue, and additional work, in my LISA13 talk on flame graphs (slides, youtube).
Here I'll show one way to do off-CPU time flame graphs using Linux perf_events.
# perf record -e sched:sched_stat_sleep -e sched:sched_switch \
-e sched:sched_process_exit -a -g -o perf.data.raw sleep 1
# perf inject -v -s -i perf.data.raw -o perf.data
# perf script -f comm,pid,tid,cpu,time,period,event,ip,sym,dso,trace | awk '
NF > 4 { exec = $1; period_ms = int($5 / 1000000) }
NF > 1 && NF <= 4 && period_ms > 0 { print $2 }
NF < 2 && period_ms > 0 { printf "%s\n%d\n\n", exec, period_ms }' | \
./stackcollapse.pl | \
./flamegraph.pl --countname=ms --title="Off-CPU Time Flame Graph" --colors=io > offcpu.svg
stackcollapse.pl and flamegraph.pl from Gregg are used to draw flamegraph.
There are perf options used from 3.17 kernels and newer...

Related

What is the sampling rate for intel_pt event i.e., perf record -e intel_pt//?

Sampling rate can be set for perf record command using -F. I want to know what is the sampling rate for intel_pt event i.e., for command
perf record -e intel_pt// -- ./a.out
With -F in user mode max sampling rate allowed is 8000. While it is possible that perf record stores the trace few thousand times per second, but the trace event that are recorded using perf record -e intel_pt// have much higher frequency.
In other words with intel_pt event a trace of an application execution is collected. Is it the case that perf record work differently while recording using intel_pt event, i.e., in some non-sampling mode?

Yes, intel_pt mode of perf record is different and is not same sampling (statistical) profiling with software (cpu-clock) or hardware (cycles) events. Sampling has 4000 of current EIP samples per second and gives you basic inexact view over code execution. intel_pt is hardware-based tracing technique which generates a lot of data about every control flow instruction (in default perf intel_pt mode) allowing to reconstruct full control flow, but it has bigger overhead. So, frequency of Intel PT is same as how many calls, branches and returns are executed per second by program code (100s of millions).
With sampling on hardware events, perf record will ask hardware PMU to count some events like CPU cycles, and to generate an overflow interrupt after for example 2 million of such events. On such interrupt perf_events subsystem in kernel will record current OS timestamp, pid/tid of current thread, EIP instruction pointer to ring buffer and reset the PMU counter for new value. perf subsystem does limit maximum frequency of interrupts by autotuning the value, and -F option can be used to change desired frequency of interrupts. When the ring buffer (around several megabytes in size) is filled, perf user-space tool will dump it contents into perf.data file, and you can view raw data with perf script or perf script -D. Or just to make histograms with perf report (sort EIPs by how often there was an interrupt on that EIP instruction address, which is proportional to time taken by that code). This mode has around 4 thousand events per second of thread execution (perf report --header | grep sample_freq), with 48 bytes per sample, or 192 kilobyte per second. Overhead is basically low enough, but the sampling is not exact.
perf wiki has separate page for intel processor trace (intel_pt) - https://perf.wiki.kernel.org/index.php/Perf_tools_support_for_Intel%C2%AE_Processor_Trace
Control flow tracing is different from other kinds of performance analysis and debugging. It provides fine-grained information on branches taken in a program, but that means there can be a vast amount of trace data. Such an enormous amount of trace data creates a number of challenges, but it raises the central question: how to reduce the amount of trace data that needs to be captured. That inverts the way performance analysis is normally done. Instead of taking a test case and creating a trace of it, you need first to create a test case that is suitable for tracing.
So, intel_pt is tracing (logging) module integrated into CPU hardware, and when armed it will generate "hundreds of megabytes of trace data per CPU per second", according to used settings. With some settings it may event generate tracing data (packet log) faster than it can be written to disk or even to RAM ("overflow packets"). According to https://lwn.net/Articles/648154/ article, perf_events (kernel-mode) in intel_pt mode will just save full packet log into separate (bigger?) ring buffer and perf tool (user-space) will just periodically save data from ring buffer into file for offline filtering, parsing and decode. (Period of saving aux or ring mmap into the file is not the same as overflow interrupt frequency option -F) PT decoder then will be used to reconstruct PT packet log into perf-compatible samples. Log data volume is huge, overhead is 1% - 5% - 10% or more depending on branch frequency in code executed.
Documentation of intel_pt is manpage man perf-intel-pt and long text stored inside linux kernel source code at
https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/perf-intel-pt.txt
Intel PT is first supported in Intel Core M and 5th generation Intel Core
processors that are based on the Intel micro-architecture code name Broadwell.
Trace data is collected by 'perf record' and stored within the perf.data file. ... Trace data must be 'decoded' which involves walking the object code and matching the trace data packets. ... Decoding is done on-the-fly. The decoder outputs samples in the same format as
samples output by perf hardware events, for example as though the "instructions"
or "branches" events had been recorded. Presently 3 tools support this:
'perf script', 'perf report' and 'perf inject'. ... The main distinguishing feature of Intel PT is that the decoder can determine
the exact flow of software execution. Intel PT can be used to understand why
and how did software get to a certain point, or behave a certain way. ... A limitation of Intel PT is that it produces huge amounts of trace data
(hundreds of megabytes per second per core) which takes a long time to decode
By default perf record -e intel_pt// is same as -e intel_pt/tsc=1,noretcomp=0/. config terms section of manpage man perf-intel-pt says what is default settings:
tsc Always supported.
Produces TSC timestamp packets to provide timing information. In some cases it is possible to decode without timing information, for example a per-thread context that does not overlap executable memory maps.
noretcomp Always supported. Disables "return compression" so a TIP
packet is produced when a function returns. Causes more packets to be
produced but might make decoding more reliable.
pt Specifies pass-through which enables the branch config term.
branch Enable branch tracing. Branch tracing is enabled by default
To represent software control flow, "branches" samples are produced.
By default a branch sample is synthesized for every single branch.
As it says, intel_pt in default mode is used to produce control flow log, by asking hardware to generate log packets for every control flow instruction like call, branch, return, and to add timestamps to synchronize pt log with some service perf samples (like exec or mmap to find actual code being loaded into memory). It tries to generate not too much, for example [single bit is used per conditional branch (tnt)](https://conference.hitb.org/hitbsecconf2017ams/materials/D1T1 - Richard Johnson - Harnessing Intel Processor Trace on Windows for Vulnerability Discovery.pdf#page=12) and several bytes per indirect branch, but there are hundreds of millions branches per second for many programs.
Some useful and short slides on perf + intel_pt:
Andi Kleen, 2015 https://halobates.de/pt-tracing-summit15.pdf (PT modes current: Full trace mode, Snapshot mode; Upcoming: Sampling mode, Core dump, System crash mode)
Andi Kleen's posts on PT: https://halobates.de/blog/p/category/pt
Suchakrapani Datt Sharma, POLYTECHNIQUE MONTREAL, 2015 https://hsdm.dorsal.polymtl.ca/system/files/10Dec2015_0.pdf (trace packets overview - PSB (Packet Stream Boundary), TNT (Taken Not-Taken), TIP (Target IP) at branches, non-default CYC Packets : Cycle counter data for IPC, MTC (Mini Timestamp Counter), ...)
Jack Henschel, 2017 about design and use-cases https://blog.cubieserver.de/publications/Henschel_Intel-PT_2017.pdf
[https://events.static.linuxfound.org/sites/events/files/slides/lcna13_kleen.pdf Efficient and Large Scale Program Flow Tracing in Linux, Alexander Shishkin], Intel, 2013 ("What is it good for? •Profiling / performance measurement •Functional debugging •Code coverage analysis")
About generic difference between sampling and (software) tracing: https://danluu.com/perf-tracing/
Update: While intel pt trace log has full trace (there are packets inside for every branch/call/return), perf report does run conversion from pt log into sample set like in classic perf.data, and there is sampling rate in sample set. This is configured with --itrace option of perf report (iNNTT, where NN is amount and TT is type - i/t/us/ns, as described in man page of perf-report:
--itrace
Options for decoding instruction tracing data. The options are:
i synthesize instructions events
g synthesize a call chain (use with i or x)
The default is all events i.e. the same as --itrace=ibxwpe,
In addition, the period (default 100000, ...)
for instructions events can be specified in units of:
i instructions
t ticks
ms milliseconds
us microseconds
ns nanoseconds (default)
So it seems like by default perf report will convert full trace log into instruction samples at sampling rate of 100000 instructions (1 perf sample generated per 100 thousands instructions). It can be changed to higher rate, but processing time will increase.
Manpage of perf-intel-pt gives more examples of itrace option usage:
Because samples are synthesized after-the-fact, the sampling period
can be selected for reporting. e.g. sample every microsecond
sudo perf report pt_ls --itrace=i1usge
See the sections below for more information about the --itrace
option.
Beware the smaller the period, the more samples that are produced,
and the longer it takes to process them.
Also note that the coarseness of Intel PT timing information will
start to distort the statistical value of the sampling as the
sampling period becomes smaller.
To see every possible IPC value, "instructions" events can be used
e.g. --itrace=i0ns
--itrace=i10us
sets the period to 10us i.e. one instruction sample is synthesized
for each 10 microseconds of trace. Alternatives to "us" are "ms"
(milliseconds), "ns" (nanoseconds), "t" (TSC ticks) or "i"
(instructions).
For Intel PT, the default period is 100us.
Setting it to a zero period means "as often as possible".
In the case of Intel PT that is the same as a period of 1 and a unit
of instructions (i.e. --itrace=i1i).
http://halobates.de/blog/p/410 has some additional examples of complex conversions:
perf script --ns --itrace=cr
Record program execution and display function call graph.
perf script by defaults “samples” the data (only dumps a sample every
100us). This can be configured using the --itrace option (see
reference below)
perf script --itrace=i0ns --ns -F time,pid,comm,sym,symoff,insn,ip | xed -F insn: -S /proc/kallsyms -64
Show every assembly instruction executed with disassembler.
perf report --itrace=g32l64i100us --branch-history
Print hot paths every 100us as call graph histograms
perf script --itrace=i100usg | stackcollapse-perf.pl > workload.folded
flamegraph.pl workloaded.folded > workload.svg
google-chrome workload.svg
Generate flame graph from execution, sampled every 100us

Is there any alterntive to wait3 to get rusage structure in shell scripting?

I was trying to monitor the peak memory usage of a child process.time -v is an option,but it is not working in solaris.So is there any way to get details that are in rusage structure from shell scripting?

You can use /usr/bin/timex
From the /usr/bin/timex man page:
The given command is executed; the elapsed time, user time and system
time spent in execution are reported in seconds. Optionally, process
accounting data for the command and all its children can be listed or
summarized, and total system activity during the execution interval
can be reported.
...
-p List process accounting records for command and all its children. This option works only if the process accounting software is installed. Suboptions f, h, k, m, r, and t modify the data items
reported. The options are as follows:
...
Start with the man page for acctadm to get process accounting enabled.
Note that on Solaris, getrusage() and wait3() do not return memory usage statistics. See the (somewhat dated) getrusage() source code at http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/syscall/rusagesys.c and the wait3() source code at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libbc/libc/sys/common/wait.c#158 (That's actually OpenSolaris source, which Oracle dropped support for, and it may not represent the current Solaris implementation, although a few tests on Solaris 11.2 show that the RSS data is in fact still zero.)
Also, from the Solaris getrusage() man page:
The ru_maxrss, ru_ixrss, ru_idrss, and ru_isrss members of the
rusage structure are set to 0 in this implementation.
There are almost certainly other ways to get the data, such as dtrace.
Edit:
dtrace doesn't look to be much help, unfortunately. Attempting to run this dtrace script with dtrace -s memuse.d -c bash
#!/usr/sbin/dtrace -s
#pragma D option quiet
profile:::profile-1001hz
/ pid == $target /
{
#pct[ pid ] = max( curpsinfo->pr_pctmem );
}
dtrace:::END
{
printa( "pct: %#u %a\n", #pct );
}
resulted in the following error message:
dtrace: failed to compile script memuse.d: line 8: translator does not define conversion for member: pr_pctmem
dtrace on Solaris doesn't appear to provide access to process memory usage. In fact, the Solaris 11.2 /usr/lib/dtrace/procfs.d translator for procfs data has this comment in it:
/*
* Translate from the kernel's proc_t structure to a proc(4) psinfo_t struct.
* We do not provide support for pr_size, pr_rssize, pr_pctcpu, and pr_pctmem.
* We also do not fill in pr_lwp (the lwpsinfo_t for the representative LWP)
* because we do not have the ability to select and stop any representative.
* Also, for the moment, pr_wstat, pr_time, and pr_ctime are not supported,
* but these could be supported by DTrace in the future using subroutines.
* Note that any member added to this translator should also be added to the
* kthread_t-to-psinfo_t translator, below.
*/
Browsing the Illumos.org source code, searching for ps_rssize, indicates that the procfs data is computed only when needed, and not updated continually as the process runs. (See http://src.illumos.org/source/search?q=pr_rssize&defs=&refs=&path=&hist=&project=illumos-gate)

Using perf probe to monitor performance stats during a particular function

I'm trying to monitor performance stats during a particular function using linux perf tool.
I was following the instructions given at https://perf.wiki.kernel.org/index.php/Jolsa_Features_Togle_Event#Example_-_using_u.28ret.29probes
I tried to get instruction count of a simple C program. (As shown below)
1) My simple C code
#include<stdio.h>
int sum=0;
int i=0;
void func(void)
{
for(i=0;i<100;i++)
{
sum=sum+i;
}
}
int main(void)
{
func();
return 0;
}
2) Compiling and adding probes
root#sunimal-laptop:/home/sunimal/temp# gcc -o ex source.c
root#sunimal-laptop:/home/sunimal/temp# perf probe -x ./ex entry=func
Added new event:
probe_ex:entry (on 0x4ed)
You can now use it in all perf tools, such as:
perf record -e probe_ex:entry -aR sleep 1
root#sunimal-laptop:/home/sunimal/temp# perf probe -x ./ex exit=func%return
Added new event:
probe_ex:exit (on 0x4ed%return)
You can now use it in all perf tools, such as:
perf record -e probe_ex:exit -aR sleep 1
3) Trying to use perf stat to measure the instruction count within the func() function. This leads to an error.
root#sunimal-laptop:/home/sunimal/temp# perf stat -e instructions:u,probe_ex:entry/on=instructions/,probe_ex:exit/off=instructions/ ./ex
invalid or unsupported event: 'instructions:u,probe_ex:entry/on=instructions/,probe_ex:exit/off=instructions/'
Run 'perf list' for a list of valid events
Could someone point me where I did wrong?
[I'm using linux kernel 3.11.0-12-generic]

I think that the instructions you are following are not yet included into the mainline Linux kernel. As a consequence, perf is telling you that the events are not supported: perf doesn't know the "toggle" mechanism mentioned on this page.
I can see two workarounds:
If you have access to the source code you want to profile you can use the perf_event_open system call directly from your source code to start and stop counting on function entry and exit.
Clone jolsa repository git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/jolsa/perf switch the core_toggle branch git co remotes/origin/perf/core_toggle and then compile and run the kernel with this support.
Regarding 2, I am not familiar at all with kernel versions and development and I think that this solution may be quie complex to use and maintain. Maybe you should ask on the perf users mailing list if there are any plans for the toggle feature to be integrated into the mainline kernel.

Using perf to monitor raw event counters

I am trying to measure certain hardware events on a (Intel Xeon) machine with multiple (physical) processors. Specifically, I wish to know how many requests are issued for reading 'offcore' data.
I found the OFFCORE_REQUESTS hardware event in Intels documentation and it gives the event descriptor 0xB0 and for data demands, the additional mask 0x01.
Would it then be correct to tell perf to record the event 0xB1 (i.e. 0xB0 | 0x01) and to call it as:
perf record -e r0B1 ./mytestapp someargs
Or is this incorrect?
Because perf report shows no output for events entered like this.
The perf documentation is rather sparse in this area, apart from a tutorial entry which does not say which event it was (though this one works for me), or how it was encoded...
Any help is greatly appreciated.

Ok, so I guess I figured it out.
For the the Intel machine I use, the format is as follows:
<umask><eventselector> where both are hexadecimal values. The leading zeros of the umask can be dropped, but not for the event selector.
So for the event 0xB0 with the mask 0x01 I can call:
perf record -e r1B0 ./mytestapp someargs
I could not manage to find the exact parsing of it in the perf kernel code (any kernel hacker here?), but I found these sources:
A description of the use of perf with raw events in the c't magazine 13/03 (subscription required), which describes some raw events with their description from the Intel Architecture Software Developers Manuel (Vol 3b)
A patch on the kernel mailing list, discussing the proper way to document it. It specified that the pattern above was "... was x86 specific and imcomplete at that"
(Updated) The man page of newer versions shows an example on Intel machines: man perf-list
Update:
As pointed out in the comments (thank you!), the libpfm translator can be used to obtain the proper event descriptor. The website linked in the comments (Bojan Nikolic: How to monitor the full range of CPU performance events), discovered by user 'osgx' explains it in further detail.

It seems you can use as well:
perf record -e cpu/event=0xB1,umask=0x1/u ./mytestapp someargs
I don't know where this syntax is documented.
You can probably use the other arguments (edge, inv, cmask) as well.

There are several libraries which can be helpful to work with raw PMU events.
perf's own wiki https://perf.wiki.kernel.org/index.php/Tutorial#Events recommends perf list --help man page for info about raw events encoding. And modern perf versions will list raw events as part of perf list output ("... if linked against the libpfm4 library, provides some short description of the events."). perf list --details will also print raw ids and masks of events.
Bojan Nikolic has "How to monitor the full range of CPU performance events" blog article about libpfm4 (perfmon2) lib usage to encode raw events for perf with help of showevtinfo and check_events tools, which are provided with the same library.
There is also perf python wrapper ocperf which accepts intel's event names. It is written by Andi Kleen (Intel Open Source Technology Center) as part of pmu-tools set of utilities (LWN post from 2013, event lists by intel at https://download.01.org/perfmon/). There is a demo of ocperf (2011) http://halobates.de/modern-pmus-yokohama.pdf:
ocperf
•Perf wrapper to support Intel specific events
•Allows symbolic events and some additional events
ocperf record -a −e offcore_response.any_data.remote_dram_0 sleep 10
PAPI library also has tool to explore raw events with some descriptions - papi_native_avail.

Get CPU usage in shell script?

I'm running some JMeter tests against a Java process to determine how responsive a web application is under load (500+ users). JMeter will give the response time for each web request, and I've written a script to ping the Tomcat Manager every X seconds which will get me the current size of the JVM heap.
I'd like to collect stats on the server of the % of CPU being used by Tomcat. I tried to do it in a shell script using ps like this:
PS_RESULTS=`ps -o pcpu,pmem,nlwp -p $PID`
...running the command every X seconds and appending the results to a text file. (for anyone wondering, pmem = % mem usage and nlwp is number of threads)
However I've found that this gives a different definition of "% of CPU Utilization" than I'd like - according to the manpages for ps, pcpu is defined as:
cpu utilization of the process in "##.#" format. It is the CPU time used divided by the time the process has been running (cputime/realtime ratio), expressed as a percentage.
In other words, pcpu gives me the % CPU utilization for the process for the lifetime of the process.
Since I want to take a sample every X seconds, I'd like to be collecting the CPU utilization of the process at the current time only - similar to what top would give me
(CPU utilization of the process since the last update).
How can I collect this from within a shell script?

Use top -b (and other switches if you want different outputs). It will just dump to stdout instead of jumping into a curses window.

The most useful tool I've found for monitoring a server while performing a test such as JMeter on it is dstat. It not only gives you a range of stats from the server, it outputs to csv for easy import into a spreadsheet and lets you extend the tool with modules written in Python.

User load: top -b -n 2 |grep Cpu |tail -n 1 |awk '{print $2}' |sed 's/.[^.]*$//'
System load: top -b -n 2 |grep Cpu |tail -n 1 |awk '{print $3}' |sed 's/.[^.]*$//'
Idle load: top -b -n 1 |grep Cpu |tail -n 1 |awk '{print $5}' |sed 's/.[^.]*$//'
Every outcome is a round decimal.

Off the top of my head, I'd use the /proc filesystem view of the system state - Look at man 5 proc to see the format of the entry for /proc/PID/stat, which contains total CPU usage information, and use /proc/stat to get global system information. To obtain "current time" usage, you probably really mean "CPU used in the last N seconds"; take two samples a short distance apart to see the current rate of CPU consumption. You can then munge these values into something useful. Really though, this is probably more a Perl/Ruby/Python job than a pure shell script.
You might be able to get the rough data you're after with /proc/PID/status, which gives a Sleep average for the process. Pretty coarse data though.

also use 1 as iteration count, so you will get current snapshot without waiting to get another one in $delay time.
top -b -n 1

This will not give you a per-process metric, but the Stress Terminal UI is super useful to know how badly you're punishing your boxes. Add -c flag to make it dump the data to a CSV file.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string