Perf stat HW counters - linux

perf stat ./myapp
and the result must be like this (it's just an example)
Performance counter stats for 'myapp':
83723.452481 task-clock:u (msec) # 1.004 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
3,228,188 page-faults:u # 0.039 M/sec
229,570,665,834 cycles:u # 2.742 GHz
313,163,853,778 instructions:u # 1.36 insn per cycle
69,704,684,856 branches:u # 832.559 M/sec
2,078,861,393 branch-misses:u # 2.98% of all branches
83.409183620 seconds time elapsed
74.684747000 seconds user
8.739217000 seconds sys
Perf stat prints user time and system time, and the HW counter will be incremented whatever application the cpu executes.
For HW counters like cycles or instructions, does the perf count them only for "myapp"?
For instance, (cs for context switch)
|--------------------|-------|-------------------|------------|------------------|
myapp cs cs myapp cs cs end
inst 0 10 20 50 80 100
60 instructions for "myapp" , but the value of HW counter is 100, then does the perf stat prints out 60?

Related

Perf output is less than the number of actual instruction

I tried to count the number of instructions of add loop application in RISC-V FPGA, using very simple RV32IM core with Linux 5.4.0 buildroot.
add.c:
int main()
{
int a = 0;
for (int i = 0; i < 1024*1024; i++)
a++;
printf("RESULT: %d\n", a);
return a;
}
I used -O0 compile option so that the loop really loop, and the resulting dump file is following:
000103c8 <main>:
103c8: fe010113 addi sp,sp,-32
103cc: 00812e23 sw s0,28(sp)
103d0: 02010413 addi s0,sp,32
103d4: fe042623 sw zero,-20(s0)
103d8: fe042423 sw zero,-24(s0)
103dc: 01c0006f j 103f8 <main+0x30>
103e0: fec42783 lw a5,-20(s0)
103e4: 00178793 addi a5,a5,1 # 12001 <__TMC_END__+0x1>
103e8: fef42623 sw a5,-20(s0)
103ec: fe842783 lw a5,-24(s0)
103f0: 00178793 addi a5,a5,1
103f4: fef42423 sw a5,-24(s0)
103f8: fe842703 lw a4,-24(s0)
103fc: 001007b7 lui a5,0x100
10400: fef740e3 blt a4,a5,103e0 <main+0x18>
10404: fec42783 lw a5,-20(s0)
10408: 00078513 mv a0,a5
1040c: 01c12403 lw s0,28(sp)
10410: 02010113 addi sp,sp,32
10414: 00008067 ret
As you can see, the application loops from 103e0 ~ 10400, which is 9 instructions, so the number of total instruction must be at least 9 * 1024^2
But the result of perf stat is pretty weird
RESULT: 1048576
Performance counter stats for './add.out':
3170.45 msec task-clock # 0.841 CPUs utilized
20 context-switches # 0.006 K/sec
0 cpu-migrations # 0.000 K/sec
38 page-faults # 0.012 K/sec
156192046 cycles # 0.049 GHz (11.17%)
8482441 instructions # 0.05 insn per cycle (11.12%)
1145775 branches # 0.361 M/sec (11.25%)
3.771031341 seconds time elapsed
0.075933000 seconds user
3.559385000 seconds sys
The total number of instructions perf counted was lower than 9 * 1024^2. Difference is about 10%.
How is this happening? I think the output of perf should be larger than that, because perf tool measures not only overall add.out, but also overhead of perf itself and context-switching.

Use linux perf utility to report counters every second like vmstat

There is perf command-linux utility in Linux to access hardware performance-monitoring counters, it works using perf_events kernel subsystems.
perf itself has basically two modes: perf record/perf top to record sampling profile (the sample is for example every 100000th cpu clock cycle or executed command), and perf stat mode to report total count of cycles/executed commands for the application (or for the whole system).
Is there mode of perf to print system-wide or per-CPU summary on total count every second (every 3, 5, 10 seconds), like it is printed in vmstat and systat-family tools (iostat, mpstat, sar -n DEV... like listed in http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html)? For example, with cycles and instructions counters I will get mean IPC for every second of system (or of every CPU).
Is there any non-perf tool (in https://perf.wiki.kernel.org/index.php/Tutorial or http://www.brendangregg.com/perf.html) which can get such statistics with perf_events kernel subsystem? What about system-wide per-process IPC calculation with resolution of seconds?
There is perf stat option "interval-print" of -I N where N is millisecond interval to do interval counter printing every N milliseconds (N>=10): http://man7.org/linux/man-pages/man1/perf-stat.1.html
-I msecs, --interval-print msecs
Print count deltas every N milliseconds (minimum: 10ms) The
overhead percentage could be high in some cases, for instance
with small, sub 100ms intervals. Use with caution. example: perf
stat -I 1000 -e cycles -a sleep 5
For best results it is usually a good idea to use it with interval
mode like -I 1000, as the bottleneck of workloads can change often.
There is also importing results in machine-readable form, and with -I first field is datetime:
With -x, perf stat is able to output a not-quite-CSV format output ... optional usec time stamp in fractions of second (with -I xxx)
vmstat, systat-family tools iostat, mpstat, etc periodic printing is -I 1000 of perf stat (every second), for example system-wide (add -A to separate cpu counters):
perf stat -a -I 1000
The option is implemented in builtin-stat.c http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8 __run_perf_stat function
531 static int __run_perf_stat(int argc, const char **argv)
532 {
533 int interval = stat_config.interval;
For perf stat -I 1000 with some program argument (forks=1), for example perf stat -I 1000 sleep 10 there is interval loop (ts is the millisecond interval converted to struct timespec):
639 enable_counters();
641 if (interval) {
642 while (!waitpid(child_pid, &status, WNOHANG)) {
643 nanosleep(&ts, NULL);
644 process_interval();
645 }
646 }
666 disable_counters();
For variant of system-wide hardware performance monitor counting and forks=0 there is other interval loop
658 enable_counters();
659 while (!done) {
660 nanosleep(&ts, NULL);
661 if (interval)
662 process_interval();
663 }
666 disable_counters();
process_interval() http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8#L347 from the same file uses read_counters(); which loops over event list and invokes read_counter() which loops over all known threads and all cpus and starts actual reading function:
306 for (thread = 0; thread < nthreads; thread++) {
307 for (cpu = 0; cpu < ncpus; cpu++) {
...
310 count = perf_counts(counter->counts, cpu, thread);
311 if (perf_evsel__read(counter, cpu, thread, count))
312 return -1;
perf_evsel__read is the real counter read while program is still running:
1207 int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
1208 struct perf_counts_values *count)
1209 {
1210 memset(count, 0, sizeof(*count));
1211
1212 if (FD(evsel, cpu, thread) < 0)
1213 return -EINVAL;
1214
1215 if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) < 0)
1216 return -errno;
1217
1218 return 0;
1219 }

Go server performance is the same when adding more cores

I am trying to understand how the go server scales when adding more cores but it seems that I can't see an improvement and I don't know why.
There does not seem to be a change in any way when increasing cores. Do I need to do something in the code to let it know that I want to use more than 1 core? Would that help on performance?
The code I am using for the test is a simple server that outputs "Hello World".
package main
import (
"net/http"
)
func main() {
http.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) {
w.Write([]byte("Hello World"))
})
http.ListenAndServe(":80", nil)
}
I am doing the tests on virtualbox.
These results are with 1 core:
$ nproc
1
Testing with ab with 1 core:
$ ab -n 10000 -c 1000 http://127.0.0.1/
Result from ab with 1 core:
Concurrency Level: 1000
Time taken for tests: 1.467 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 1280000 bytes
HTML transferred: 110000 bytes
Requests per second: 6815.42 [#/sec] (mean)
Time per request: 146.726 [ms] (mean)
Time per request: 0.147 [ms] (mean, across all concurrent requests)
Transfer rate: 851.93 [Kbytes/sec] received
Testing with wrk with 1 core:
$ wrk -t1 -c1000 -d5s http://127.0.0.1:80/
Result from wrk with 1 core:
Running 5s test # http://127.0.0.1:80/
1 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 32.32ms 15.79ms 279.10ms 77.01%
Req/Sec 24.61k 1.89k 27.77k 64.58%
121709 requests in 5.01s, 14.86MB read
Requests/sec: 24313.72
Transfer/sec: 2.97MB
Changing to 2 cores:
$ nproc
2
Testing with ab with 2 cores:
$ ab -n 10000 -c 1000 http://127.0.0.1/
Result from ab with 2 cores:
Concurrency Level: 1000
Time taken for tests: 1.247 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 1280000 bytes
HTML transferred: 110000 bytes
Requests per second: 8021.12 [#/sec] (mean)
Time per request: 124.671 [ms] (mean)
Time per request: 0.125 [ms] (mean, across all concurrent requests)
Transfer rate: 1002.64 [Kbytes/sec] received
Testing with wrk with 2 cores:
$ wrk -t1 -c1000 -d5s http://127.0.0.1:80/
Result with wrk with 2 cores:
Running 5s test # http://127.0.0.1:80/
1 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 37.04ms 5.67ms 64.92ms 79.73%
Req/Sec 26.98k 1.97k 29.71k 66.00%
134040 requests in 5.06s, 16.36MB read
Requests/sec: 26481.38
Transfer/sec: 3.23MB
Testing with wrk with 2 cores and 2 threads:
$ wrk -t2 -c1000 -d5s http://127.0.0.1:80/
Results with wrk with 2 cores and 2 threads:
Running 5s test # http://127.0.0.1:80/
2 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 29.32ms 29.21ms 505.22ms 98.47%
Req/Sec 13.48k 2.11k 18.16k 63.00%
134121 requests in 5.03s, 16.37MB read
Requests/sec: 26680.46
Transfer/sec: 3.26MB
Changing to 4 cores:
$ nproc
4
Testing with ab with 4 cores:
$ ab -n 10000 -c 1000 http://127.0.0.1/
Result with ab with 4 cores:
Concurrency Level: 1000
Time taken for tests: 1.301 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 1280000 bytes
HTML transferred: 110000 bytes
Requests per second: 7683.90 [#/sec] (mean)
Time per request: 130.142 [ms] (mean)
Time per request: 0.130 [ms] (mean, across all concurrent requests)
Transfer rate: 960.49 [Kbytes/sec] received
Testing with wrk with 4 cores:
$ wrk -t1 -c1000 -d5s http://127.0.0.1:80/
Result with wrk with 4 cores:
Running 5s test # http://127.0.0.1:80/
1 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 36.84ms 5.78ms 58.23ms 77.43%
Req/Sec 26.69k 2.06k 30.19k 64.00%
132604 requests in 5.06s, 16.19MB read
Requests/sec: 26207.42
Transfer/sec: 3.20MB
Testing with wrk with 4 cores and 4 threads:
$ wrk -t4 -c1000 -d5s http://127.0.0.1:80/
Results with wrk with 4 cores and 4 threads:
Running 5s test # http://127.0.0.1:80/
4 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 35.58ms 26.65ms 508.77ms 98.44%
Req/Sec 5.82k 2.21k 10.44k 64.85%
117089 requests in 5.10s, 14.29MB read
Requests/sec: 22972.33
Transfer/sec: 2.80MB
I don't know if I can use go if it "does not scale" at all, with multiple cores. I don't understand how go works compared to other languages. When I run tests with facebooks HHVM it scales no problem out of the box when adding more cores.
What can I do to see a performance gain in the go server when adding more cores?
EDIT:
After changing the initial code to:
package main
import (
"net/http"
"runtime"
)
func main() {
runtime.GOMAXPROCS(4)
http.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) {
w.Write([]byte("Hello World"))
})
http.ListenAndServe(":80", nil)
}
The results from wrk were different, changing the GOMAXPROCS from 1 to 4 resulted in significant increase.
Testing 1 thread 4 cores:
$ wrk -t1 -c1000 -d5s http://127.0.0.1:80/
Result for 1 thread and 4 cores:
Running 5s test # http://127.0.0.1:80/
1 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 11.00ms 4.33ms 53.58ms 83.83%
Req/Sec 48.65k 3.30k 55.18k 81.25%
242131 requests in 5.08s, 29.56MB read
Requests/sec: 47658.92
Transfer/sec: 5.82MB
Testing 4 thread 4 cores:
$ wrk -t4 -c1000 -d5s http://127.0.0.1:80/
Result for 4 thread and 4 cores:
Running 5s test # http://127.0.0.1:80/
4 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 15.47ms 8.49ms 99.35ms 80.88%
Req/Sec 14.98k 2.98k 27.42k 78.65%
298885 requests in 5.10s, 36.48MB read
Requests/sec: 58639.84
Transfer/sec: 7.16MB
But the tests with ab were the same. Does anyone know why it does not affect ab? When benchmarking with HHVM ab results also gets affected. But on go I get same results.
$ ab -n 10000 -c 1000 http://127.0.0.1/
Results:
Concurrency Level: 1000
Time taken for tests: 1.410 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 1280000 bytes
HTML transferred: 110000 bytes
Requests per second: 7094.18 [#/sec] (mean)
Time per request: 140.961 [ms] (mean)
Time per request: 0.141 [ms] (mean, across all concurrent requests)
Transfer rate: 886.77 [Kbytes/sec] received
You need to tell the Go runtime to use more cores by setting the environment variable GOMAXPROCS to your desired core count. Alternatively, there is also a function to change it.
By default, this is set to one. As of Go 1.5 it will be the number of cores in your system.

fast conversion from string time to milliseconds

For a vector or list of times, I'd like to go from a string time, e.g. 12:34:56.789 to milliseconds from midnight, which would be equal to 45296789.
This is what I do now:
toms = function(time) {
sapply(strsplit(time, ':', fixed = T),
function(x) sum(as.numeric(x)*c(3600000,60000,1000)))
}
and would like to do it faster.
Here's an example data set for benchmarking:
times = rep('12:34:56.789', 1e6)
system.time(toms(times))
# user system elapsed
# 9.00 0.04 9.05
You could use the fasttime package, which seems to be about an order of magnitude faster.
library(fasttime)
fasttoms <- function(time) {
1000*unclass(fastPOSIXct(paste("1970-01-01",time)))
}
times <- rep('12:34:56.789', 1e6)
system.time(toms(times))
# user system elapsed
# 6.61 0.03 6.68
system.time(fasttoms(times))
# user system elapsed
# 0.53 0.00 0.53
identical(fasttoms(times),toms(times))
# [1] TRUE

Why doesn't perf report cache misses?

According to perf tutorials, perf stat is supposed to report cache misses using hardware counters. However, on my system (up-to-date Arch Linux), it doesn't:
[joel#panda goog]$ perf stat ./hash
Performance counter stats for './hash':
869.447863 task-clock # 0.997 CPUs utilized
92 context-switches # 0.106 K/sec
4 cpu-migrations # 0.005 K/sec
1,041 page-faults # 0.001 M/sec
2,628,646,296 cycles # 3.023 GHz
819,269,992 stalled-cycles-frontend # 31.17% frontend cycles idle
132,355,435 stalled-cycles-backend # 5.04% backend cycles idle
4,515,152,198 instructions # 1.72 insns per cycle
# 0.18 stalled cycles per insn
1,060,739,808 branches # 1220.015 M/sec
2,653,157 branch-misses # 0.25% of all branches
0.871766141 seconds time elapsed
What am I missing? I already searched the man page and the web, but didn't find anything obvious.
Edit: my CPU is an Intel i5 2300K, if that matters.
On my system, an Intel Xeon X5570 # 2.93 GHz I was able to get perf stat to report cache references and misses by requesting those events explicitly like this
perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations sleep 5
Performance counter stats for 'sleep 5':
10573 cache-references
1949 cache-misses # 18.434 % of all cache refs
1077328 cycles # 0.000 GHz
715248 instructions # 0.66 insns per cycle
151188 branches
154 faults
0 migrations
5.002776842 seconds time elapsed
The default set of events did not include cache events, matching your results, I don't know why
perf stat -B sleep 5
Performance counter stats for 'sleep 5':
0.344308 task-clock # 0.000 CPUs utilized
1 context-switches # 0.003 M/sec
0 CPU-migrations # 0.000 M/sec
154 page-faults # 0.447 M/sec
977183 cycles # 2.838 GHz
586878 stalled-cycles-frontend # 60.06% frontend cycles idle
430497 stalled-cycles-backend # 44.05% backend cycles idle
720815 instructions # 0.74 insns per cycle
# 0.81 stalled cycles per insn
152217 branches # 442.095 M/sec
7646 branch-misses # 5.02% of all branches
5.002763199 seconds time elapsed
In the latest source code, the default event does not include cache-misses and cache-references again:
struct perf_event_attr default_attrs[] = {
{ .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_TASK_CLOCK },
{ .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_CONTEXT_SWITCHES },
{ .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_CPU_MIGRATIONS },
{ .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_PAGE_FAULTS },
{ .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CPU_CYCLES },
{ .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_FRONTEND },
{ .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_BACKEND },
{ .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_INSTRUCTIONS },
{ .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
{ .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_BRANCH_MISSES },
};
So the man and most web are out of date as so far.
I've spent some minutes trying to understand perf. I found out the cache-misses by first recording and then reporting the data (both perf tools).
To see a list of events:
perf list
For example, in order to check the last-level-cache load misses, you will need to use the event LLC-loads-misses like this
perf record -e LLC-loads-misses ./your_program
then report the results
perf report -v

Resources