Measure LLC/L3 Cache Miss Rate on AMD Zen2 CPU - linux

I have question related to this one.
I want to (programatically) measure L3 Hits (Accesses) and Misses on an AMD EPYC 7742 CPU (Zen2). I run Linux Kernel 5.4.0-66-generic on Ubuntu Server 20.04.2 LTS. According to the question linked above, the events rFF04 (L3LookupState) and r0106 (L3CombClstrState) should represent the L3 accesses and misses, respectively. Furthermore, Kernel 5.4 should support these events.
However, when measuring it with perf, I run into issues. Similar to the question linked above, if I run numactl -C 0 -m 0 perf stat -e instructions,cycles,r0106,rFF04 ./benchmark, I only measure 0 values. If I try to use numactl -C 0 -m 0 perf stat -e instructions,cycles,amd_l3/r8001/,amd_l3/r0106/, perf complains about "unknown terms". If I use the perf event names, i.e. numactl -C 0 -m 0 perf stat -e instructions,cycles,l3_request_g1.caching_l3_cache_accesses, l3_comb_clstr_state.request_miss perf outputs <not supported> for these events.
Furthermore, I actually want to measure this using perf's C API. Currently, I dispatch a perf_event_attr with type PERF_TYPE_RAW and config set to, e.g., 0x8001. How do I get the amd_l3 PMU stuff into my perf_event_attr object? Otherwise, it would be equivalent to numactl -C 0 -m 0 perf stat -e instructions,cycles,r0106,rFF04 ./benchmark, which is measuring undefined values.
Thank you so much for your help.

Related

What causes overhead in QEMU in case of trivial `sleep 1`?

Experiment:
I ran sleep 1 under strace -tt (which reports timestamps of all syscalls) in host and QEMU guest, and noticed that the time required to reach a certain syscall (clock_nanosleep) is almost twice larger in case of the guest:
1.813 ms on the host vs
3.396 ms in the guest.
Here is full host strace -tt sleep 1 and here is full QEMU strace -tt sleep 1.
Below are excerpts where you can already see the difference:
Host:
Time diff timestamp (as reported by strace)
0.000 / 0.653 ms: 13:13:56.452820 execve("/usr/bin/sleep", ["sleep", "1"], 0x7ffded01ecb0 /* 53 vars */) = 0
0.653 / 0.023 ms: 13:13:56.453473 brk(NULL) = 0x5617efdea000
0.676 / 0.063 ms: 13:13:56.453496 arch_prctl(0x3001 /* ARCH_??? */, 0x7fffeb7041b0) = -1 EINVAL (Invalid argument)
QEMU:
Time diff timestamp (as reported by strace)
0.000 / 1.008 ms: 12:12:03.164063 execve("/usr/bin/sleep", ["sleep", "1"], 0x7ffd0bd93e50 /* 13 vars */) = 0
1.008 / 0.119 ms: 12:12:03.165071 brk(NULL) = 0x55b78c484000
1.127 / 0.102 ms: 12:12:03.165190 arch_prctl(0x3001 /* ARCH_??? */, 0x7ffcb5dfd850) = -1 EINVAL (Invalid argument)
The questions:
What causes the slowdown & overhead? It is not using any hardware (like GPU, disks, etc), so there is no translation layers. I also tried running the command several times to ensure everything that can be cached is cached in the guest.
Is there a way to speed it up?
Update:
With cpupower frequency-set --governor performance the timings are:
Host: 0.922ms
Guest: 1.412ms
With image in /dev/shm (-drive file=/dev/shm/root):
Host: 0.922ms
Guest: 1.280ms
PS
I modified "bare" output of strace so that it include (1) time that starts from 0 with the first syscall followed by (2) duration of the syscall, for easier understanding. For completeness, the script is here.
I started qemu in this way:
qemu-system-x86_64 -enable-kvm -cpu host -smp 4 -m 4G -nodefaults -no-user-config -nographic -no-reboot \
-kernel $HOME/devel/vmlinuz-5.13.0-20-generic \
-append 'earlyprintk=hvc0 console=hvc0 root=/dev/sda rw' \
-drive file=$HOME/devel/images/root,if=ide,index=0,media=disk,format=raw \
-device virtio-serial,id=virtio-serial0 -chardev stdio,mux=on,id=host-io,signal=off -device virtconsole,chardev=host-io,id=console0
It turned out that my (custom-built kernel) was missing CONFIG_HYPERVISOR_GUEST=y option (and a couple of nested options).
That's expected, considering the way strace is implemented, i.e. via the ptrace(2) system call: every time the traced process performs a system call or gets a signal, the process is forcefully stopped and the control is passed to the tracing process, which in the case of strace does all the unpacking & printing synchronously, i.e. while keeping the traced process stopped. That's the kind of path which increases any emulation overhead exponentially.
It would be instructive to strace strace itself -- you will see that does not let the traced process continue (with ptrace(PTRACE_SYSCALL, ...)) until it has processed & written out everything related to the current system call.
Notice that in order to run a "trivial" sleep 1 command, the dynamic linker will perform a couple dozen system calls before even getting to the entry point of the sleep binary.
I don't think that optimizing strace is worth spending time on; if you were planning to run strace as an auditing instead of a debugging tool (by running production tasks under strace or similar), you should reconsider your designs ;-)
Running qemu on my mac, I found 'sleep 1' at the bash command line usually taking 10 seconds while 'sleep 2' usually taking 5 seconds. At least as measured by time on a 6.0.8 archlinux. Oddly time seemed to be measuring the passage of time correctly while sleep was not working.
But I had been running
qemu-system-x86_64 \
-m 1G \
-nic user,hostfwd=tcp::10022-:22 \
img1.cow
Then reading about the -icount parameter, I found the following makes the sleep pretty accurate.
qemu-system-x86_64 \
-icount shift=auto,sleep=on \
-m 1G \
-nic user,hostfwd=tcp::10022-:22 \
img1.cow
I mention it here because my search for qemu and slow sleep 1 led me here first.

Linux perf record not generating any samples

I am trying to profile my userspace program on aria10 fpga board (with 2 ARM Cortex A9 CPUs) which has PMU support. I am running windriver linux version 9.x. I built my kernel with almost all of the CONFIG_ options people suggested over the internet. Also, my pgm is compiled with –fno-omit-frame-pointer and –g options.
What I see is that ‘perf record’ doesn’t generate any samples at all. ‘perf stat true’ output looks to be valid though (not sure what to make out of it). Does anyone have suggestion/ideas why I am not seeing any sample being generated?
~: perf record --call-graph dwarf -- my_app
^C
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.003 MB perf.data ]
~: perf report -g graph --no-children
Error:
The perf.data file has no samples!
To display the perf.data header info, please use --header/--header-only options.
~: perf stat true
Performance counter stats for 'true':
1.095300 task-clock (msec) # 0.526 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
22 page-faults # 0.020 M/sec
1088056 cycles # 0.993 GHz
312708 instructions # 0.29 insn per cycle
29159 branches # 26.622 M/sec
16386 branch-misses # 56.20% of all branches
0.002082030 seconds time elapsed
I don't use a VM in this setup. Arria10 is intel FPGA with 2 ARM CPUs that supports PMU.
Edit:
1. I realize now that ARM CPU has HW PMU support (opposite to what I mentioned earlier). Even with HW PMU support, I am not able to do 'perf record' successfully.
This is an old question, but for people who find this via search:
perf record -e cpu-clock <command>
works for me. The problem seems to be that th default event (cycles) is not available

Can I increase linux entropy by using rng-daemon without hardware generator?

I want to continuously increase /prco/sys/random/entropy_avail when it reduced.
I first check the rngd (https://wiki.archlinux.org/index.php/Rng-tools)
It says /dev/random is very slow since it only collects entropy from device drivers and other (slow) sources and I think that is why we use rngd.
And it says rngd mainly uses hardware random number generators (TRNG), present in modern hardware like recent AMD/Intel processors, VIA Nano or even Raspberry Pi.
However, when I start rngd it says
[root#localhost init.d]# rngd
can't open entropy source(tpm or intel/amd rng)
Maybe RNG device modules are not loaded
But I don't have Intel RDRAND confirmed by cat /proc/cpuinfo | grep rdrand:
[root#localhost init.d]# cat /proc/cpuinfo | grep rdrand | wc -l
0
If there is any possible resources that I can use?
Alternatively, is it possible making script to increase /proc/sys/random/entropy_avail?
Try this:
sudo apt-get install haveged

Why does perf fail to collect any samples?

sudo perf top shows "Events: 0 cycles".
sudo perf record -ag sleep 10 shows
[ perf record: Woken up 1 time to write data ]
[ perf record: Captured and wrote 0.154 MB perf.data (~6725 samples) ]
However, sudo perf report shows "The perf.data file has no samples!". Also I checked the perf.data recorded and confirmed there is no any samples in it.
The system is "3.2.0-86-virtual #123-Ubuntu SMP Sun Jun 14 18:25:12 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux".
perf version 3.2.69
Inputs are appreciated.
There may be no real samples on idle virtualized system (your linux kernel version has "-virtual" suffix); or there may be no access to hardware counters (-e cycles), which are used by default.
Try to profile some real application like
echo '2^2345678%2'| sudo perf record /usr/bin/bc
Also check software counters like -e cpu-clock:
echo '2^2345678%2'| sudo perf record -e cpu-clock /usr/bin/bc
You may try perf stat (perf stat -d) with same example to find which basic counters are really incremented in your system:
echo '2^2345678%2'| sudo perf stat /usr/bin/bc
About "(~6725 samples)" output - perf record doesn't not count samples in its output, it just estimates their count but this estimation is always wrong. There is some fixed part of any perf.data file without any sample, it may use tens of kb in system-wide mode; and estimation incorrectly counts this part as containing some events of mean length.

cpupower utility linux - how to get a list of available frequencies

I am trying to get a list of available frequencies for my cpu using the cpupower tool.
I am executing following command-
cpupower -c 0,1,4,5 frequency-info
This gives me much information but I need to see a list of available frequencies to which I can set these CPUs to.
On older versions of Fedora, I used to do this
$ cat /system/cpu/cpu3/cpufreq/scaling_available_frequencies
2201000 2200000 2100000 2000000 1800000 1700000 1600000 1500000 1300000 1200000 1100000
but on Fedora 20, cpufreq is obsolete. I googled and found that cpupower has same functionality like cpufreq.
How do you use it to get a list of available frequencies?
cpupower frequency-info
should give you your info

Resources