Why doesn't perf report cache misses? - linux

According to perf tutorials, perf stat is supposed to report cache misses using hardware counters. However, on my system (up-to-date Arch Linux), it doesn't:
[joel#panda goog]$ perf stat ./hash
Performance counter stats for './hash':
869.447863 task-clock # 0.997 CPUs utilized
92 context-switches # 0.106 K/sec
4 cpu-migrations # 0.005 K/sec
1,041 page-faults # 0.001 M/sec
2,628,646,296 cycles # 3.023 GHz
819,269,992 stalled-cycles-frontend # 31.17% frontend cycles idle
132,355,435 stalled-cycles-backend # 5.04% backend cycles idle
4,515,152,198 instructions # 1.72 insns per cycle
# 0.18 stalled cycles per insn
1,060,739,808 branches # 1220.015 M/sec
2,653,157 branch-misses # 0.25% of all branches
0.871766141 seconds time elapsed
What am I missing? I already searched the man page and the web, but didn't find anything obvious.
Edit: my CPU is an Intel i5 2300K, if that matters.

On my system, an Intel Xeon X5570 # 2.93 GHz I was able to get perf stat to report cache references and misses by requesting those events explicitly like this
perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations sleep 5
Performance counter stats for 'sleep 5':
10573 cache-references
1949 cache-misses # 18.434 % of all cache refs
1077328 cycles # 0.000 GHz
715248 instructions # 0.66 insns per cycle
151188 branches
154 faults
0 migrations
5.002776842 seconds time elapsed
The default set of events did not include cache events, matching your results, I don't know why
perf stat -B sleep 5
Performance counter stats for 'sleep 5':
0.344308 task-clock # 0.000 CPUs utilized
1 context-switches # 0.003 M/sec
0 CPU-migrations # 0.000 M/sec
154 page-faults # 0.447 M/sec
977183 cycles # 2.838 GHz
586878 stalled-cycles-frontend # 60.06% frontend cycles idle
430497 stalled-cycles-backend # 44.05% backend cycles idle
720815 instructions # 0.74 insns per cycle
# 0.81 stalled cycles per insn
152217 branches # 442.095 M/sec
7646 branch-misses # 5.02% of all branches
5.002763199 seconds time elapsed

In the latest source code, the default event does not include cache-misses and cache-references again:
struct perf_event_attr default_attrs[] = {
So the man and most web are out of date as so far.

I've spent some minutes trying to understand perf. I found out the cache-misses by first recording and then reporting the data (both perf tools).
To see a list of events:
perf list
For example, in order to check the last-level-cache load misses, you will need to use the event LLC-loads-misses like this
perf record -e LLC-loads-misses ./your_program
then report the results
perf report -v


Perf stat HW counters

perf stat ./myapp
and the result must be like this (it's just an example)
Performance counter stats for 'myapp':
83723.452481 task-clock:u (msec) # 1.004 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
3,228,188 page-faults:u # 0.039 M/sec
229,570,665,834 cycles:u # 2.742 GHz
313,163,853,778 instructions:u # 1.36 insn per cycle
69,704,684,856 branches:u # 832.559 M/sec
2,078,861,393 branch-misses:u # 2.98% of all branches
83.409183620 seconds time elapsed
74.684747000 seconds user
8.739217000 seconds sys
Perf stat prints user time and system time, and the HW counter will be incremented whatever application the cpu executes.
For HW counters like cycles or instructions, does the perf count them only for "myapp"?
For instance, (cs for context switch)
myapp cs cs myapp cs cs end
inst 0 10 20 50 80 100
60 instructions for "myapp" , but the value of HW counter is 100, then does the perf stat prints out 60?

Perf output is less than the number of actual instruction

I tried to count the number of instructions of add loop application in RISC-V FPGA, using very simple RV32IM core with Linux 5.4.0 buildroot.
int main()
int a = 0;
for (int i = 0; i < 1024*1024; i++)
printf("RESULT: %d\n", a);
return a;
I used -O0 compile option so that the loop really loop, and the resulting dump file is following:
000103c8 <main>:
103c8: fe010113 addi sp,sp,-32
103cc: 00812e23 sw s0,28(sp)
103d0: 02010413 addi s0,sp,32
103d4: fe042623 sw zero,-20(s0)
103d8: fe042423 sw zero,-24(s0)
103dc: 01c0006f j 103f8 <main+0x30>
103e0: fec42783 lw a5,-20(s0)
103e4: 00178793 addi a5,a5,1 # 12001 <__TMC_END__+0x1>
103e8: fef42623 sw a5,-20(s0)
103ec: fe842783 lw a5,-24(s0)
103f0: 00178793 addi a5,a5,1
103f4: fef42423 sw a5,-24(s0)
103f8: fe842703 lw a4,-24(s0)
103fc: 001007b7 lui a5,0x100
10400: fef740e3 blt a4,a5,103e0 <main+0x18>
10404: fec42783 lw a5,-20(s0)
10408: 00078513 mv a0,a5
1040c: 01c12403 lw s0,28(sp)
10410: 02010113 addi sp,sp,32
10414: 00008067 ret
As you can see, the application loops from 103e0 ~ 10400, which is 9 instructions, so the number of total instruction must be at least 9 * 1024^2
But the result of perf stat is pretty weird
RESULT: 1048576
Performance counter stats for './add.out':
3170.45 msec task-clock # 0.841 CPUs utilized
20 context-switches # 0.006 K/sec
0 cpu-migrations # 0.000 K/sec
38 page-faults # 0.012 K/sec
156192046 cycles # 0.049 GHz (11.17%)
8482441 instructions # 0.05 insn per cycle (11.12%)
1145775 branches # 0.361 M/sec (11.25%)
3.771031341 seconds time elapsed
0.075933000 seconds user
3.559385000 seconds sys
The total number of instructions perf counted was lower than 9 * 1024^2. Difference is about 10%.
How is this happening? I think the output of perf should be larger than that, because perf tool measures not only overall add.out, but also overhead of perf itself and context-switching.

OpenMPI cannot fully utilize 10 GE

I tried to perform data exchange between two machines connected with 10GE. The size of data is large enough (8 GB) to expect network utilization near the maximum. But surprisingly I observed absolutely different behavior.
To check the throughput I have used two different programs - nethogs and nload, both of them show that network utilization is much lower than expected. Moreover the results are unpredictable - sometimes in and out channels are utilized simultaneously, but sometimes transmission and reception are separated as if there is a half-duplex channel. Sample output of nload:
Device enp1s0f0 [] (1/1):
##################### Curr: 0.00 GBit/s
##################### Avg: 2.08 GBit/s
.##################### Min: 0.00 GBit/s
####################### Max: 6.32 GBit/s
####################### Ttl: 57535.38 GByte
#################### Curr: 0.00 GBit/s
#################### Avg: 2.09 GBit/s
.#################### Min: 0.00 GBit/s
#####################. Max: 6.74 GBit/s
###################### Ttl: 57934.64 GByte
The code I use is here:
int main(int argc, char** argv) {
boost::mpi::environment env{};
boost::mpi::communicator world{};
boost::mpi::request reqs[2];
int k = 10;
if(argc > 1)
k = std::atoi(argv[1]);
uint64_t n = (1ul << k);
std::vector<std::complex<double>> sv(n, world.rank());
std::vector<std::complex<double>> rv(n);
int dest = world.rank() == 0 ? 1 : 0;
int src = dest;
reqs[0] = world.irecv(src, 0, rv.data(), n);
reqs[1] = world.isend(dest, 0, sv.data(), n);
boost::mpi::wait_all(reqs, reqs + 2);
return 0;
And here is the command I use to run on cluster:
mpirun --mca btl_tcp_if_include --hostfile ./host_file -n 2 --bind-to core /path/to/shared/folder/mpi_exp 29
29 here means that 2^(29 + 4) = 8 GBytes will be sent
What I have done:
Proved that there is no hardware problem by successful saturation of the channel with netcat.
Checked with tcpdump that the size of TCP packets during the communication is unstable and rarely reach the maximum size (in netcat case it is stable).
Checked with strace that socket operations are correct.
Checked TCP parameters in sysctl - they are ok.
Could you please advise me why OpenMPI doesn't work as expected?
EDIT (14.08.2018):
Finally I was able to continue to dig into this problem. Below is the output of OSU bandwidth benchmark (it was run without any mca options):
# OSU MPI Bandwidth Test v5.3
# Size Bandwidth (MB/s)
1 0.50
2 0.98
4 1.91
8 3.82
16 6.92
32 10.32
64 22.03
128 43.95
256 94.74
512 163.96
1024 264.90
2048 400.01
4096 533.47
8192 640.02
16384 705.02
32768 632.03
65536 667.29
131072 842.00
262144 743.82
524288 654.09
1048576 775.50
2097152 759.44
4194304 774.81
Actually I think that such poor performance is caused by CPU bound. Each MPI process is single-threaded by default, and it is just not able to saturate 10GE channel.
I know it is possible to communicate with several threads by enabling multithreading when building OpenMPI. But such approach will lead to increased complexity on application level.
So is it possible to have multithreaded sending/receiving in OpenMPI internally on the level responsible for point-to-point data transfer?

Use linux perf utility to report counters every second like vmstat

There is perf command-linux utility in Linux to access hardware performance-monitoring counters, it works using perf_events kernel subsystems.
perf itself has basically two modes: perf record/perf top to record sampling profile (the sample is for example every 100000th cpu clock cycle or executed command), and perf stat mode to report total count of cycles/executed commands for the application (or for the whole system).
Is there mode of perf to print system-wide or per-CPU summary on total count every second (every 3, 5, 10 seconds), like it is printed in vmstat and systat-family tools (iostat, mpstat, sar -n DEV... like listed in http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html)? For example, with cycles and instructions counters I will get mean IPC for every second of system (or of every CPU).
Is there any non-perf tool (in https://perf.wiki.kernel.org/index.php/Tutorial or http://www.brendangregg.com/perf.html) which can get such statistics with perf_events kernel subsystem? What about system-wide per-process IPC calculation with resolution of seconds?
There is perf stat option "interval-print" of -I N where N is millisecond interval to do interval counter printing every N milliseconds (N>=10): http://man7.org/linux/man-pages/man1/perf-stat.1.html
-I msecs, --interval-print msecs
Print count deltas every N milliseconds (minimum: 10ms) The
overhead percentage could be high in some cases, for instance
with small, sub 100ms intervals. Use with caution. example: perf
stat -I 1000 -e cycles -a sleep 5
For best results it is usually a good idea to use it with interval
mode like -I 1000, as the bottleneck of workloads can change often.
There is also importing results in machine-readable form, and with -I first field is datetime:
With -x, perf stat is able to output a not-quite-CSV format output ... optional usec time stamp in fractions of second (with -I xxx)
vmstat, systat-family tools iostat, mpstat, etc periodic printing is -I 1000 of perf stat (every second), for example system-wide (add -A to separate cpu counters):
perf stat -a -I 1000
The option is implemented in builtin-stat.c http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8 __run_perf_stat function
531 static int __run_perf_stat(int argc, const char **argv)
532 {
533 int interval = stat_config.interval;
For perf stat -I 1000 with some program argument (forks=1), for example perf stat -I 1000 sleep 10 there is interval loop (ts is the millisecond interval converted to struct timespec):
639 enable_counters();
641 if (interval) {
642 while (!waitpid(child_pid, &status, WNOHANG)) {
643 nanosleep(&ts, NULL);
644 process_interval();
645 }
646 }
666 disable_counters();
For variant of system-wide hardware performance monitor counting and forks=0 there is other interval loop
658 enable_counters();
659 while (!done) {
660 nanosleep(&ts, NULL);
661 if (interval)
662 process_interval();
663 }
666 disable_counters();
process_interval() http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8#L347 from the same file uses read_counters(); which loops over event list and invokes read_counter() which loops over all known threads and all cpus and starts actual reading function:
306 for (thread = 0; thread < nthreads; thread++) {
307 for (cpu = 0; cpu < ncpus; cpu++) {
310 count = perf_counts(counter->counts, cpu, thread);
311 if (perf_evsel__read(counter, cpu, thread, count))
312 return -1;
perf_evsel__read is the real counter read while program is still running:
1207 int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
1208 struct perf_counts_values *count)
1209 {
1210 memset(count, 0, sizeof(*count));
1212 if (FD(evsel, cpu, thread) < 0)
1213 return -EINVAL;
1215 if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) < 0)
1216 return -errno;
1218 return 0;
1219 }

fast conversion from string time to milliseconds

For a vector or list of times, I'd like to go from a string time, e.g. 12:34:56.789 to milliseconds from midnight, which would be equal to 45296789.
This is what I do now:
toms = function(time) {
sapply(strsplit(time, ':', fixed = T),
function(x) sum(as.numeric(x)*c(3600000,60000,1000)))
and would like to do it faster.
Here's an example data set for benchmarking:
times = rep('12:34:56.789', 1e6)
# user system elapsed
# 9.00 0.04 9.05
You could use the fasttime package, which seems to be about an order of magnitude faster.
fasttoms <- function(time) {
times <- rep('12:34:56.789', 1e6)
# user system elapsed
# 6.61 0.03 6.68
# user system elapsed
# 0.53 0.00 0.53
# [1] TRUE
