Measure CPU usage of multithreaded program

Measure CPU usage of multithreaded program - linux

In a GCC bug report (Bug 51617), the poster times his asynchronous (C++11) program's execution with an output looking like this:
/tmp/tst 81.54s user 0.23s system 628% cpu 13.001 total
What could the poster be using which gives that (or similar) output?
NB: An inspection of my man entry for time doesn't suggest anything useful to this end.

The time(1) man page on my (Ubuntu) Linux system says:
-f FORMAT, --format FORMAT
Use FORMAT as the format string that controls the output of
time. See the below more information.
:
FORMATTING THE OUTPUT
The format string FORMAT controls the contents of the time output. The
format string can be set using the `-f' or `--format', `-v' or `--ver‐
bose', or `-p' or `--portability' options. If they are not given, but
the TIME environment variable is set, its value is used as the format
string. Otherwise, a built-in default format is used. The default
format is:
%Uuser %Ssystem %Eelapsed %PCPU (%Xtext+%Ddata %Mmax)k
%Iinputs+%Ooutputs (%Fmajor+%Rminor)pagefaults %Wswaps
:
The resource specifiers, which are a superset of those recognized by
the tcsh(1) builtin `time' command, are:
% A literal `%'.
C Name and command line arguments of the command being
timed.
D Average size of the process's unshared data area, in
Kilobytes.
E Elapsed real (wall clock) time used by the process, in
[hours:]minutes:seconds.
F Number of major, or I/O-requiring, page faults that oc‐
curred while the process was running. These are faults
where the page has actually migrated out of primary memo‐
ry.
I Number of file system inputs by the process.
K Average total (data+stack+text) memory use of the
process, in Kilobytes.
M Maximum resident set size of the process during its life‐
time, in Kilobytes.
O Number of file system outputs by the process.
P Percentage of the CPU that this job got. This is just
user + system times divided by the total running time. It
also prints a percentage sign.
R Number of minor, or recoverable, page faults. These are
pages that are not valid (so they fault) but which have
not yet been claimed by other virtual pages. Thus the
data in the page is still valid but the system tables
must be updated.
S Total number of CPU-seconds used by the system on behalf
of the process (in kernel mode), in seconds.
U Total number of CPU-seconds that the process used direct‐
ly (in user mode), in seconds.
W Number of times the process was swapped out of main memo‐
ry.
X Average amount of shared text in the process, in Kilo‐
bytes.
Z System's page size, in bytes. This is a per-system con‐
stant, but varies between systems.
c Number of times the process was context-switched involun‐
tarily (because the time slice expired).
e Elapsed real (wall clock) time used by the process, in
seconds.
k Number of signals delivered to the process.
p Average unshared stack size of the process, in Kilobytes.
r Number of socket messages received by the process.
s Number of socket messages sent by the process.
t Average resident set size of the process, in Kilobytes.
w Number of times that the program was context-switched
voluntarily, for instance while waiting for an I/O opera‐
tion to complete.
x Exit status of the command.
So you can get CPU percentage as %P in the format.
Note that this is for the /usr/bin/time binary -- the shell time builtin is usually different (and less capable)

Related

How to know max event period of perf-record

does anyone know how to get the maximum event period value (or the value that kernel actually passes to PMU) of Perf event?
I'm using perf to measure my program as follow:
perf record -d -e cpu/event=0xd0,umask=0x81/ppu,cpu/event=0xd0,umask=0x82/ppu -c 5
cpu/event=0xd0,umask=0x81/ppu means measure all loads in cpu, and cpu/event=0xd0,umask=0x82/ppu is all stores.
I tried to understand how arguments passing in perf by strace, but found nothing.
Is the PMU received a value that over its ability, will still try to reach it? If so, where can find related code and what is its maximum event period of those events?
Thanks everyone.

The perf record command accepts period values much larger than 255. Internally, the processor maintains a counter for recording all the memory loads and memory stores(or for that matter, any other supported event). Once the counter overflows, the processor will record all the information about the memory load/store that you are trying to record(information about architectural state/registers etc.) .
Also once the counter overflows, it must be reset again. Usually the counter is reset to a value less than 0. Since it is set to a value less than zero and it increments, the counter will overflow once it hits 0 again.
This counter reset value that I was talking about is the period value that you asked for. What I mean is that, if the period is specified by -c 1 , it means that the counter reset value will be set to -1, so the next memory load/store will increment the counter to 0(leading to a counter overflow) and you will record the events.
Thus, if you set the period to 1, there will be a counter overflow on each memory load/store event and you will record all of them (this is only conceptual however, the hardware usually cannot do this).
What this means is that, the period value can go as large as the size of a hardware counter for these events. Usually in modern microarchitectures , like Broadwell/Haswell/Skylake, these counters are 48-bits in size. So the period might go as large as 2^48-1. However, usage of such large values are not recommended.
Usually, the period value should be kept to a maximum of 2^32-1 in 32-bit systems and is usually the norm in other systems too.
Sources :
Chapter 18 of this book
Please read the topic Sampling with perf record in this link too
If you want you can read the answer to this question too.

PERF STAT does not count memory-loads but counts memory-stores

Linux Kernel : 4.10.0-20-generic (also tried this on 4.11.3)
Ubuntu : 17.04
I have been trying to collect stats of memory-accesses using perf stat. I am able to collect stats for memory-stores but the count for memory-loads return me a 0 value.
The below is the details for memory-stores :-
perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
158,115,510 cpu/mem-stores/u
0.559922797 seconds time elapsed
For memory-loads, I get a 0 count as can be seen below :-
perf stat -e cpu/mem-loads/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
0 cpu/mem-loads/u
0.563806170 seconds time elapsed
I cannot understand why this does not count properly. Should I use a different event in any way to get proper data ?

The mem-loads event is mapped to the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3 performance monitoring unit event on Intel processors. The events MEM_TRANS_RETIRED.LOAD_LATENCY_* are special and can only be counted by using the p modifier. That is, you have to specify mem-loads:p to perf to use the event correctly.
MEM_TRANS_RETIRED.LOAD_LATENCY_* is a precise event and it only makes sense to be counted at the precise level. According to this Intel article (emphasis mine):
When a user elects to sample one of these events, special hardware is
used that can keep track of a data load from issue to completion.
This is more complicated than simply counting instances of an event
(as with normal event-based sampling), and so only some loads are
tracked. Loads are randomly chosen, the latency determined for each,
and the correct event(s) incremented (latency >4, >8, >16, etc). Due
to the nature of the sampling for this event, only a small percentage
of an application's data loads can be tracked at any one time.
As you can see, MEM_TRANS_RETIRED.LOAD_LATENCY_* by no means count the total number of loads and it is not designed for that purpose at all.
If you want to to determine which instructions in your code are issuing load requests that take more than a specific number of cycles to complete, then MEM_TRANS_RETIRED.LOAD_LATENCY_* is the right performance event to use. In fact, that is exactly the purpose of perf-mem and it achieves its purpose by using this event.
If you want to count the total number of load uops retired, then you should use L1-dcache-loads, which is mapped to the MEM_UOPS_RETIRED.ALL_LOADS performance event on Intel processors.
On the other hand, mem-stores and L1-dcache-stores are mapped to the exact same performance event on all current Intel processors, namely, MEM_UOPS_RETIRED.ALL_STORES, which does count all retired store uops.
So in summary, if you are using perf-stat, you should (almost) always use L1-dcache-loads and L1-dcache-stores to count retired loads and stores, respectively. These are mapped to the raw events you have used in the answer you posted, only more portable because they also work on AMD processors.

I have used a Broadwell(CPU e5-2620) server machine to collect all of the below events.
To collect memory-load events, I had to use a numeric event value. I basically ran the below command -
./perf record -e "r81d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
Here r81d0 represents the raw event for counting "memory loads amongst all instructions retired". "u" as can be understood represents user-space.
The below command, on the other hand,
./perf record -e "r82d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
has "r82d0:u" as a raw event representing "memory stores amongst all instructions retired in userspace".

PEBS records much less memory-access samples than actually present

I have been trying to log memory accesses that are made by a program using Perf and PEBS counters. My intention was to log all of the memory accesses made by a program (I chose programs from SpecCPU2006). By tweaking certain parameters, I seem to record much more samples than there actually is for the program. I know, as has been said previously, that it is tough to record all of the memory access samples but leaving that aside, I want to know how can PEBS record more samples than there actually is?
I followed the below steps :-
First of all, I modified the /proc/sys/kernel/perf_cpu_time_max_percent value. Initially it was 25%, I changed it to 95%. This was because I wanted to see if I can record the maximum number of memory access samples. This would also allow me to probably use a much higher perf_event_max_sample_rate, which is usually 100,000 at a maximum but now I can set it to a higher value without it being lowered down.
I used a much higher value for perf_event_max_sample_rate which is 244,500, instead of the maximum allowable value of 100,000.
Now what I did was I used perf-stat to record the total count of the memory-stores information in a program. I got the below data :-
./perf stat -e cpu/mem-stores/u ../../.././libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for '../../.././libquantum_base.arnab 100':
158,115,509 cpu/mem-stores/u
0.591718162 seconds time elapsed
There are roughly ~158 million events as indicated by perf-stat, which should be a correct indicator, since this is directly coming from the hardware counter values.
But now, as I run the perf record -e command and use PEBS counters to calculate all of the memory store events that are possible :-
./perf record -e cpu/mem-stores/upp -c 1 ../../.././libquantum_base.arnab 100
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.
Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.
Samples in kernel modules won't be resolved at all.
If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.
Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
[ perf record: Woken up 32 times to write data ]
[ perf record: Captured and wrote 7.827 MB perf.data (254125 samples) ]
I can see 254125 samples being recorded. This is much much less than what was returned by perf stat. I am recording all of these accesses in the userspace only (I am using -u in both cases).
Why does this happen ? Am I recording the memory-store events in any wrong way ? Or is there a problem with the CPU behavior ?

What do the numbers in /proc/loadavg mean on Linux?

When issuing this command on Linux:
# cat /proc/loadavg
0.75 0.35 0.25 1/25 1747
The first three numbers are load averages. What are the last 2 numbers?
The last one keeps increasing by 2 every second, should I be worried?

/proc/loadavg
The first three fields in this file are load average figures giving
the number of jobs in the run queue (state R) or waiting for disk
I/O (state D) averaged over 1, 5, and 15 minutes. They are the
same as the load average numbers given by uptime(1) and other
programs.
The fourth field consists of two numbers separated by a
slash (/). The first of these is the number of currently executing
kernel scheduling entities (processes, threads); this will be less
than or equal to the number of CPUs. The value after the slash is the
number of kernel scheduling entities that currently exist on the
system.
The fifth field is the PID of the process that was most
recently created on the system.

I would like to comment the accepted answer.
The fourth field consists of two numbers separated by a slash (/). The
first of these is the number of currently executing kernel scheduling
entities (processes, threads); this will be less than or equal to the
number of CPUs.
I did a test program that reads integer N from input and then creates N threads and their run them forever. On RHEL 6.5 computer I have 8 processor and each processor has hyper threading. Anyway if I run my test and it creates 128 threads I see in the fourth field values that are greater than 128, for example 135. It is clearly greater than the number of CPU. This post supports my observation: http://juliano.info/en/Blog:Memory_Leak/Understanding_the_Linux_load_average
It is worth noting that the current explanation in proc(5) manual page
(as of man-pages version 3.21, March 2009) is wrong. It reports the
first number of the forth field as the number of currently executing
scheduling entities, and so predicts it can't be greater than the
number of CPUs. That doesn't match the real implementation, where this
value reports the current number of runnable threads.

The first three columns measure CPU and I/O utilization of the last one, five, and 15 minute periods. The fourth column shows the number of currently running processes and the total number of processes. The last column displays the last process ID used.
https://docs.fedoraproject.org/en-US/Fedora/17/html/System_Administrators_Guide/s2-proc-loadavg.html

The following page explains these in detail:
http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
Some interpretations:
If the averages are 0.0, then your system is idle.
If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing.
If the 1 minute average is lower than the 5 or 15 minute averages, then load is decreasing.
If they are higher than your CPU count, then you might have a performance problem (it depends).

You can consult the proc manual page for /proc/loadavg :
$ man proc | sed -n '/loadavg/,/^$/ p'
/proc/loadavg
The first three fields in this file are load average figures giving the number of jobs in the run queue
(state R) or waiting for disk I/O (state D) averaged over 1, 5, and 15 minutes. They are the same as
the load average numbers given by uptime(1) and other programs. The fourth field consists of two num‐
bers separated by a slash (/). The first of these is the number of currently runnable kernel schedul‐
ing entities (processes, threads). The value after the slash is the number of kernel scheduling enti‐
ties that currently exist on the system. The fifth field is the PID of the process that was most
recently created on the system.
For that, you need to install the man-pages package on CentOS7/RedHat7 or the manpages package on Ubuntu 20.04/22.04 LTS.

profile program's speed on Linux

I have a couple variants of a program that I want to compare on performance. Both perform essentially the same task.
One does it all in C and memory. The other calls an external utility and does file IO.
How do I reliably compare them?
1) Getting "time on CPU" using "time" favors the second variant for calling system() and doing IO. Even if I add "system" time to "user" time, it'll still not count for time spent blocked on wait().
2) I can't just clock them for they run on a server and can be pushed off the CPU any time. Averaging across 1000s of experiments is a soft option, since I have no idea how my server is utilized - it's a VM on a cluster, it's kind of complicated.
3) profilers do not help since they'll give me time spent in the code, which again favors the version that does system()
I need to add up all CPU time that these programs consume, including user, kernel, IO, and children's recursively.
I expected this to be a common problem, but still don't seem to find a solution.
(Solved with times() - see below. Thanks everybody)

If I've understood, typing "time myapplication" on a bash command line is not what you are looking for.
If you want accuracy, you must use a profiler... You have the source, yes?
Try something like Oprofile or Valgrind, or take a look at this for a more extended list.
If you haven't the source, honestly I don't know...

/usr/bin/time (not built-in "time" in bash) can give some interesting stats.
$ /usr/bin/time -v xeyes
Command being timed: "xeyes"
User time (seconds): 0.00
System time (seconds): 0.01
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.57
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 0
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 9
Minor (reclaiming a frame) page faults: 517
Voluntary context switches: 243
Involuntary context switches: 0
Swaps: 0
File system inputs: 1072
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Run them a thousand times, measure actual time taken, then average the results. That should smooth out any variances due to other applications running on your server.

I seem to have found it at last.
NAME
times - get process times
SYNOPSIS
#include
clock_t times(struct tms *buf);
DESCRIPTION
times() stores the current process times in the struct tms that buf
points to. The struct tms is as defined in :
struct tms {
clock_t tms_utime; /* user time */
clock_t tms_stime; /* system time */
clock_t tms_cutime; /* user time of children */
clock_t tms_cstime; /* system time of children */
};
The children's times are a recursive sum of all waited-for children.
I wonder why it hasn't been made a standard CLI utility yet. Or may be I'm just ignorant.

I'd probably lean towards adding "time -o somefile" to the front of the system command, and then adding it to the time given by time'ing your main program to get a total. Unless I had to do this lots of times, then I'd find a way to take two time outputs and add them up to the screen (using awk or shell or perl or something).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string