Perf does not support some performance events - linux

I want to measure stalled cycles for my application using perf.
When I try: perf stat -B dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB) copied, 0.218456 s, 2.3 GB/s
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':
218.420011 task-clock # 0.995 CPUs utilized
25 context-switches # 0.000 M/sec
1 CPU-migrations # 0.000 M/sec
255 page-faults # 0.001 M/sec
821,183,099 cycles # 3.760 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,526,427,190 instructions # 1.86 insns per cycle
292,281,624 branches # 1338.163 M/sec
1,013,837 branch-misses # 0.35% of all branches
0.219551862 seconds time elapsed
As you can see, I'm getting for stalled-cycles* events. I couldn't find a solution or explanation for this online.
My kernel version is 3.2.0-59, perf version is 3.2.54, and my CPU is an i7-3770.

Related

How to keep the default events when using `perf stat` with custom events

When the perf stat command is used, many default events are measured. For example, when I run perf stat ls, I obtain the following output:
Performance counter stats for 'ls':
0,55 msec task-clock # 0,598 CPUs utilized
0 context-switches # 0,000 /sec
0 cpu-migrations # 0,000 /sec
99 page-faults # 179,071 K/sec
2 324 694 cycles # 4,205 GHz
1 851 372 instructions # 0,80 insn per cycle
357 918 branches # 647,403 M/sec
12 897 branch-misses # 3,60% of all branches
0,000923884 seconds time elapsed
0,000993000 seconds user
0,000000000 seconds sys
Now, let's suppose I also want to measure the cache-references and cache-misses events.
If I run perf stat -e cache-references,cache-misses, the output is:
Performance counter stats for 'ls':
101 148 cache-references
34 261 cache-misses # 33,872 % of all cache refs
0,000973384 seconds time elapsed
0,001014000 seconds user
0,000000000 seconds sys
Is there a way to add events with the -e flag, but also keep the default events shown when not using -e (without having to list all of them explicitly in the command) ?

why is cpu-cycles much less than cpu current frequency?

My cpu max frequency is 2.8GHZ and cpu frequency mode is performance, but cpu-cycles is only 0.105GHZ from perf, why??
The cpu-cycles event is 0x3c, it is CPU_CLK_UNHALTED.THREAD_P or CPU_CLK_THREAD_UNHALTED.REF_XCLK ?
Could I read the PMC register from perf directly?
Now the usage of cpu-8 reaches 90% by the command 'mpstat'.
CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
8 0.00 0.00 0.98 0.00 0.00 0.00 0.00 89.22 0.00 9.80
8 0.00 0.00 0.99 0.00 0.00 0.00 0.00 88.12 0.00 10.89
The cpu is Intel(R) Xeon(R) CPU E5-2680 v2 # 2.80GHz.
processor : 8
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2680 v2 # 2.80GHz
stepping : 4
microcode : 0x428
cpu MHz : 2800.000
cache size : 25600 KB
I want to get some idea about the cpu-8 by perf.
perf stat -C 8
Performance counter stats for 'CPU(s) 8':
8828.237941 task-clock (msec) # 1.000 CPUs utilized
11,550 context-switches # 0.001 M/sec
0 cpu-migrations # 0.000 K/sec
0 page-faults # 0.000 K/sec
926,167,840 cycles # 0.105 GHz
4,012,135,689 stalled-cycles-frontend # 433.20% frontend cycles idle
473,099,833 instructions # 0.51 insn per cycle
# 8.48 stalled cycles per insn
98,346,040 branches # 11.140 M/sec
1,254,592 branch-misses # 1.28% of all branches
8.828177754 seconds time elapsed
The cpu-cycles is only 0.105GHZ,it is really strange.
I try to understand the cpu-cycles meaning.
cat /sys/bus/event_source/devices/cpu/events/cpu-cycles
event=0x3c
I look up the document "Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3", at 19.6 session, page 40.
I also check the cpu frequency setting, the cpu should be running at the max frequency.
cat scaling_governor
performance
cat scaling_governor
performance
==============================================
I try this command:
taskset -c 8 stress --cpu 1
perf stat -C 8 sleep 10
Performance counter stats for 'CPU(s) 8':
10000.633899 task-clock (msec) # 1.000 CPUs utilized
1,823 context-switches # 0.182 K/sec
0 cpu-migrations # 0.000 K/sec
8 page-faults # 0.001 K/sec
29,792,267,638 cycles # 2.979 GHz
5,866,181,553 stalled-cycles-frontend # 19.69% frontend cycles idle
54,171,961,339 instructions # 1.82 insn per cycle
# 0.11 stalled cycles per insn
16,356,002,578 branches # 1635.497 M/sec
33,041,249 branch-misses # 0.20% of all branches
10.000592203 seconds time elapsed
some detail information about my environment
I run a application, let's call it 'A', in a virtual machine 'V', in a host 'H'。
The virtual machine is created by qume-kvm.
The application is used to receive packets from network and deal with them.
cpu-cycles could be frozen due to that CPU enters C1 or C2 idle state.

CPU Usage Server and Monitoring info

I need to check, using a shell script, possibly without installing any particular package (OS:Linux Suse 12), the total CPU % usage in order to monitor the level without pass the critical threshold.
It is a Huge server with 2x E5-2667 v4 8/core.
Looking over the questions I found something and I tried it:
1-top -bn1 | grep "Cpu(s)" | \sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | \awk '{print 100 - $1"%"}'
2-CPU_LOAD=$(sar -P ALL 1 2 |grep 'Average.*all' |awk -F" " '{print 100.0 -$NF}')
I also tried to do 100-idle from iostat
Is that really correct on a multi cpu/multi core system?
Is correct calculate the cpu total usage by using the cpu load from the uptime?
Using the code i got an avg of single core, While i need a result of a total CPU used in %
Regards,
Thanks
You are probably better implementing the solution completely in awk:
top -bn1 | awk -F, '/id/ { for (i=1;i<=NF;i++) { if ( $i ~ /[[:digit:]]{2}.[[:digit:]][[:blank:]]+id/ ) { split($i,arry," ");print arry[1]" - idle" }'
Take the output from top and then check for any output containing id. If the condition is met, take each comma delimited piece of data on the line and pattern match against 2 numbers, a decimal and then one or more numbers, a blank and then id. If this is the case, split the variable based on a blank space into an array and print the first element.
If you would like to get any detailed stats you might also use perf.
In this example you may see the number of all CPU cycles during 1 second:
-bash-4.1# perf stat -a sleep 1
Performance counter stats for 'system wide':
4002.822144 task-clock (msec) # 3.999 CPUs utilized (100.00%)
22809 context-switches # 0.006 M/sec (100.00%)
1332 cpu-migrations # 0.333 K/sec (100.00%)
23794 page-faults # 0.006 M/sec
5409531172 cycles # 1.351 GHz (100.00%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
3874289082 instructions # 0.72 insns per cycle (100.00%)
715152901 branches # 178.662 M/sec (100.00%)
20583742 branch-misses # 2.88% of all branches
1.001065623 seconds time elapsed
You may also check uptime.
Uptime gives a one line display of the following information. The current time, how long the system has been running, how many users are currently logged on, and the system load averages for the past 1, 5, and 15 minutes.

Why does perf show that sleep takes all cores?

I am trying to familiarize myself with perf and run it against various programs I wrote.
When I launch it against program that is 100% single threaded, perf shows that it takes two cores on machine (task-clock event).
Here's the example output:
perf stat -a --per-core python3 test.py
Performance counter stats for 'system wide':
S0-C0 1 19004.951263 task-clock (msec) # 1.000 CPUs utilized (100.00%)
S0-C0 1 5,582 context-switches (100.00%)
S0-C0 1 19 cpu-migrations (100.00%)
S0-C0 1 3,746 page-faults
S0-C0 1 <not supported> cycles
S0-C0 1 <not supported> stalled-cycles-frontend
S0-C0 1 <not supported> stalled-cycles-backend
S0-C0 1 <not supported> instructions
S0-C0 1 <not supported> branches
S0-C0 1 <not supported> branch-misses
S0-C1 1 19004.950059 task-clock (msec) # 1.000 CPUs utilized (100.00%)
S0-C1 1 6,752 context-switches (100.00%)
S0-C1 1 25 cpu-migrations (100.00%)
S0-C1 1 935 page-faults
S0-C1 1 <not supported> cycles
S0-C1 1 <not supported> stalled-cycles-frontend
S0-C1 1 <not supported> stalled-cycles-backend
S0-C1 1 <not supported> instructions
S0-C1 1 <not supported> branches
S0-C1 1 <not supported> branch-misses
19.004688019 seconds time elapsed
It even shows that simple sleep command takes two cores on my computer and I can't explain this. I understand that OS scheduler can reassign active core for any process, but in this case CPU utilization would reflect that.
Can anyone explain this?
According to man page of perf stat subocmmand, you have -a option to profile full system:
http://man7.org/linux/man-pages/man1/perf-stat.1.html
-a, --all-cpus
system-wide collection from all CPUs (default if no target is
specified)
In this "system-wide" mode perf stat (and perf record too) will count events on (or profile for record) all CPUs in the system. When used without additional argument of command, perf will run until interrupted by Ctrl-C. With argument of command, perf will count/profile until the command works. Typical usage is
perf stat -a sleep 10 # Profile counting every CPU for 10 seconds
perf record -a sleep 10 # Profile with cycles every CPU for 10 seconds to perf.data
For getting stats of single command use single process profiling (without -a option)
perf stat python3 test.py
For profiling (perf record) you may run without -a option; or you may use -a and later do some manual filtering in perf report, focusing only on the pids/tids/dsos of your application (This can be very useful if command to profile uses some interprocess requests to other daemons to do lot of CPU work).
--per-core, -A, -C <cpulist>, --per-socket options are only for system-wide -a mode. Try --per-thread with -p pid attach to process option.

perf get time elasped with field separator option

I have a program which parses the output of the linux command perf. It requires the use of option -x, (the field separator option. I want to extract elapsed time (not task-time or cpu-clock) using perf. However when I use the -x option, the elapsed time is not present in the output and I cannot find a corresponding perf event. Here are the sample outputs
perf stat ls
============
Performance counter stats for 'ls':
0.934889 task-clock (msec) # 0.740 CPUs utilized
6 context-switches # 0.006 M/sec
0 cpu-migrations # 0.000 K/sec
261 page-faults # 0.279 M/sec
1,937,910 cycles # 2.073 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,616,944 instructions # 0.83 insns per cycle
317,016 branches # 339.095 M/sec
12,439 branch-misses # 3.92% of all branches
0.001262625 seconds time elapsed //here we have it
Now with field separator option
perf stat -x, ls
================
2.359807,task-clock
6,context-switches
0,cpu-migrations
261,page-faults
1863028,cycles
<not supported>,stalled-cycles-frontend
<not supported>,stalled-cycles-backend
1670644,instructions
325047,branches
12251,branch-misses
Any help is appreciated
# perf stat ls 2>&1 >/dev/null | tail -n 2 | sed 's/ \+//' | sed 's/ /,/'
0.002272536,seconds time elapsed
Starting with kernel 5.2-rc1, a new event called duration_time is exposed by perf statto solve exactly this problem. The value of this event is exactly equal to the time elapsed value, but the unit is nanoseconds instead of seconds.

Resources