I just made a fresh install of Ubuntu 20 on an AMD Ryzen 5000 series. I installed papi tools 6 by following the install instructions. The installation was succsessful and I verified the installed version:
$ sudo papi_version
PAPI Version: 6.0.0.0
Then, when I verified the available events, I got none:
$ sudo papi_avail
Available PAPI preset and user defined events plus hardware information.
--------------------------------------------------------------------------------
PAPI version : 6.0.0.0
Operating system : Linux 5.8.0-55-generic
Vendor string and code : AuthenticAMD (2, 0x2)
Model string and code : AMD Ryzen 9 5900 12-Core Processor (33, 0x21)
CPU revision : 0.000000
CPUID : Family/Model/Stepping 25/33/0, 0x19/0x21/0x00
CPU Max MHz : 6084
CPU Min MHz : 2200
Total cores : 24
SMT threads per core : 2
Cores per socket : 12
Sockets : 1
Cores per NUMA region : 24
NUMA regions : 1
Running in a VM : no
Number Hardware Counters : 0
Max Multiplex Counters : 384
Fast counter read (rdpmc): yes
--------------------------------------------------------------------------------
================================================================================
PAPI Preset Events
================================================================================
Name Code Avail Deriv Description (Note)
PAPI_L1_DCM 0x80000000 No No Level 1 data cache misses
PAPI_L1_ICM 0x80000001 No No Level 1 instruction cache misses
PAPI_L2_DCM 0x80000002 No No Level 2 data cache misses
PAPI_L2_ICM 0x80000003 No No Level 2 instruction cache misses
PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses
PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses
PAPI_L1_TCM 0x80000006 No No Level 1 cache misses
PAPI_L2_TCM 0x80000007 No No Level 2 cache misses
PAPI_L3_TCM 0x80000008 No No Level 3 cache misses
PAPI_CA_SNP 0x80000009 No No Requests for a snoop
PAPI_CA_SHR 0x8000000a No No Requests for exclusive access to shared cache line
PAPI_CA_CLN 0x8000000b No No Requests for exclusive access to clean cache line
PAPI_CA_INV 0x8000000c No No Requests for cache line invalidation
PAPI_CA_ITV 0x8000000d No No Requests for cache line intervention
PAPI_L3_LDM 0x8000000e No No Level 3 load misses
PAPI_L3_STM 0x8000000f No No Level 3 store misses
PAPI_BRU_IDL 0x80000010 No No Cycles branch units are idle
PAPI_FXU_IDL 0x80000011 No No Cycles integer units are idle
PAPI_FPU_IDL 0x80000012 No No Cycles floating point units are idle
PAPI_LSU_IDL 0x80000013 No No Cycles load/store units are idle
PAPI_TLB_DM 0x80000014 No No Data translation lookaside buffer misses
PAPI_TLB_IM 0x80000015 No No Instruction translation lookaside buffer misses
PAPI_TLB_TL 0x80000016 No No Total translation lookaside buffer misses
PAPI_L1_LDM 0x80000017 No No Level 1 load misses
PAPI_L1_STM 0x80000018 No No Level 1 store misses
PAPI_L2_LDM 0x80000019 No No Level 2 load misses
PAPI_L2_STM 0x8000001a No No Level 2 store misses
PAPI_BTAC_M 0x8000001b No No Branch target address cache misses
PAPI_PRF_DM 0x8000001c No No Data prefetch cache misses
PAPI_L3_DCH 0x8000001d No No Level 3 data cache hits
PAPI_TLB_SD 0x8000001e No No Translation lookaside buffer shootdowns
PAPI_CSR_FAL 0x8000001f No No Failed store conditional instructions
PAPI_CSR_SUC 0x80000020 No No Successful store conditional instructions
PAPI_CSR_TOT 0x80000021 No No Total store conditional instructions
PAPI_MEM_SCY 0x80000022 No No Cycles Stalled Waiting for memory accesses
PAPI_MEM_RCY 0x80000023 No No Cycles Stalled Waiting for memory Reads
PAPI_MEM_WCY 0x80000024 No No Cycles Stalled Waiting for memory writes
PAPI_STL_ICY 0x80000025 No No Cycles with no instruction issue
PAPI_FUL_ICY 0x80000026 No No Cycles with maximum instruction issue
PAPI_STL_CCY 0x80000027 No No Cycles with no instructions completed
PAPI_FUL_CCY 0x80000028 No No Cycles with maximum instructions completed
PAPI_HW_INT 0x80000029 No No Hardware interrupts
PAPI_BR_UCN 0x8000002a No No Unconditional branch instructions
PAPI_BR_CN 0x8000002b No No Conditional branch instructions
PAPI_BR_TKN 0x8000002c No No Conditional branch instructions taken
PAPI_BR_NTK 0x8000002d No No Conditional branch instructions not taken
PAPI_BR_MSP 0x8000002e No No Conditional branch instructions mispredicted
PAPI_BR_PRC 0x8000002f No No Conditional branch instructions correctly predicted
PAPI_FMA_INS 0x80000030 No No FMA instructions completed
PAPI_TOT_IIS 0x80000031 No No Instructions issued
PAPI_TOT_INS 0x80000032 No No Instructions completed
PAPI_INT_INS 0x80000033 No No Integer instructions
PAPI_FP_INS 0x80000034 No No Floating point instructions
PAPI_LD_INS 0x80000035 No No Load instructions
PAPI_SR_INS 0x80000036 No No Store instructions
PAPI_BR_INS 0x80000037 No No Branch instructions
PAPI_VEC_INS 0x80000038 No No Vector/SIMD instructions (could include integer)
PAPI_RES_STL 0x80000039 No No Cycles stalled on any resource
PAPI_FP_STAL 0x8000003a No No Cycles the FP unit(s) are stalled
PAPI_TOT_CYC 0x8000003b No No Total cycles
PAPI_LST_INS 0x8000003c No No Load/store instructions completed
PAPI_SYC_INS 0x8000003d No No Synchronization instructions completed
PAPI_L1_DCH 0x8000003e No No Level 1 data cache hits
PAPI_L2_DCH 0x8000003f No No Level 2 data cache hits
PAPI_L1_DCA 0x80000040 No No Level 1 data cache accesses
PAPI_L2_DCA 0x80000041 No No Level 2 data cache accesses
PAPI_L3_DCA 0x80000042 No No Level 3 data cache accesses
PAPI_L1_DCR 0x80000043 No No Level 1 data cache reads
PAPI_L2_DCR 0x80000044 No No Level 2 data cache reads
PAPI_L3_DCR 0x80000045 No No Level 3 data cache reads
PAPI_L1_DCW 0x80000046 No No Level 1 data cache writes
PAPI_L2_DCW 0x80000047 No No Level 2 data cache writes
PAPI_L3_DCW 0x80000048 No No Level 3 data cache writes
PAPI_L1_ICH 0x80000049 No No Level 1 instruction cache hits
PAPI_L2_ICH 0x8000004a No No Level 2 instruction cache hits
PAPI_L3_ICH 0x8000004b No No Level 3 instruction cache hits
PAPI_L1_ICA 0x8000004c No No Level 1 instruction cache accesses
PAPI_L2_ICA 0x8000004d No No Level 2 instruction cache accesses
PAPI_L3_ICA 0x8000004e No No Level 3 instruction cache accesses
PAPI_L1_ICR 0x8000004f No No Level 1 instruction cache reads
PAPI_L2_ICR 0x80000050 No No Level 2 instruction cache reads
PAPI_L3_ICR 0x80000051 No No Level 3 instruction cache reads
PAPI_L1_ICW 0x80000052 No No Level 1 instruction cache writes
PAPI_L2_ICW 0x80000053 No No Level 2 instruction cache writes
PAPI_L3_ICW 0x80000054 No No Level 3 instruction cache writes
PAPI_L1_TCH 0x80000055 No No Level 1 total cache hits
PAPI_L2_TCH 0x80000056 No No Level 2 total cache hits
PAPI_L3_TCH 0x80000057 No No Level 3 total cache hits
PAPI_L1_TCA 0x80000058 No No Level 1 total cache accesses
PAPI_L2_TCA 0x80000059 No No Level 2 total cache accesses
PAPI_L3_TCA 0x8000005a No No Level 3 total cache accesses
PAPI_L1_TCR 0x8000005b No No Level 1 total cache reads
PAPI_L2_TCR 0x8000005c No No Level 2 total cache reads
PAPI_L3_TCR 0x8000005d No No Level 3 total cache reads
PAPI_L1_TCW 0x8000005e No No Level 1 total cache writes
PAPI_L2_TCW 0x8000005f No No Level 2 total cache writes
PAPI_L3_TCW 0x80000060 No No Level 3 total cache writes
PAPI_FML_INS 0x80000061 No No Floating point multiply instructions
PAPI_FAD_INS 0x80000062 No No Floating point add instructions
PAPI_FDV_INS 0x80000063 No No Floating point divide instructions
PAPI_FSQ_INS 0x80000064 No No Floating point square root instructions
PAPI_FNV_INS 0x80000065 No No Floating point inverse instructions
PAPI_FP_OPS 0x80000066 No No Floating point operations
PAPI_SP_OPS 0x80000067 No No Floating point operations; optimized to count scaled single precision vector operations
PAPI_DP_OPS 0x80000068 No No Floating point operations; optimized to count scaled double precision vector operations
PAPI_VEC_SP 0x80000069 No No Single precision vector/SIMD instructions
PAPI_VEC_DP 0x8000006a No No Double precision vector/SIMD instructions
PAPI_REF_CYC 0x8000006b No No Reference clock cycles
--------------------------------------------------------------------------------
Of 108 possible events, 0 are available, of which 0 are derived.
No events detected! Check papi_component_avail to find out why.
So I followed the instructions and I ran the papi_component_avail:
papi_component_avail
Available components and hardware information.
--------------------------------------------------------------------------------
PAPI version : 6.0.0.0
Operating system : Linux 5.8.0-55-generic
Vendor string and code : AuthenticAMD (2, 0x2)
Model string and code : AMD Ryzen 9 5900 12-Core Processor (33, 0x21)
CPU revision : 0.000000
CPUID : Family/Model/Stepping 25/33/0, 0x19/0x21/0x00
CPU Max MHz : 6084
CPU Min MHz : 2200
Total cores : 24
SMT threads per core : 2
Cores per socket : 12
Sockets : 1
Cores per NUMA region : 24
NUMA regions : 1
Running in a VM : no
Number Hardware Counters : 0
Max Multiplex Counters : 384
Fast counter read (rdpmc): yes
--------------------------------------------------------------------------------
Compiled-in components:
Name: perf_event Linux perf_event CPU counters
\-> Disabled: Unknown libpfm4 related error
Name: perf_event_uncore Linux perf_event CPU uncore and northbridge
\-> Disabled: No uncore PMUs or events found
Active components:
--------------------------------------------------------------------------------
I tried applying the fix explained in the stack overflow question papi_avail: No events available - 32308175 which in the past has worked on my Ubuntu 18 running an Intel 6700HQ, but with no avail here. Of course, the problem descriptions were different.I am uncertain on how to resolve the "Unknown libpfm4 related error" error.
I also tried installing the current availabe version of Papi using the "apt get install papi tools"
sudo apt-get update
sudo apt-get install papi-tools
The installation went fine but I keep getting the same errore messages. What have I missed?
Your CPU is part of the AMD ZEN3 family. libpfm4 seems to support this CPU architecture family from v4.12.0. If you have papi version 6.0.0.1 or less, it comes with a built-in version of libpfm4 that is older than v4.12.0 so I suppose papi is expected not to work in that case?
Nevertheless, we are still experiencing the same issue when installing the master branch of papi (that comes with a libpfm4 version that should support AMD ZEN3 according to documentation).
Related
I am trying to figure out the event to use with the perf stat command to count L3 cache accesses on an AMD Zen 2 processor. As per the PPR (http://developer.amd.com/wordpress/media/2017/11/54945_PPR_Family_17h_Models_00h-0Fh.pdf), section 2.1.13.4.1, page 168, the event is x01 and the umask is x80 for "[L3 Cache Accesses] (L3RequestG1)". From what I understand, the event to use in perf stat command would thus be r8001. But the following command always returns the count as zero no matter what load I run:
perf stat -a -e r8001 -- sleep 10
Performance counter stats for 'system wide':
0 r8001
10.001105322 seconds time elapsed
Am I misinterpreting the PPR or does [L3 Cache Accesses] (L3RequestG1) mean something else?
Also, is there a way to specify the slice of L3 cache to monitor for events in perf as most of the newer architectures with high core counts have multiple L3 slices.
The L3 cache events can only be counted on the L3 PMU as clearly specified in both the physical mnemonic (L3PMCx01) and the logical mnemonic (Core::X86::Pmc::L3::L3RequestG1) of the event you want to measure. The L3 PMU is formally called L3PMC. This is similar to the cbox PMUs on Intel processors.
The default PMU in perf for raw events is cpu, which is the name the perf_events subsystem gives to the core PMU. An event specified using a raw event code without an explicit PMU, such as r8001, is equivalent to cpu/r8001/. The core event 0x001 represents the event Core::X86::Pmc::Core::FpSchedEmpty and the umask 0x80 is undefined for this event (see Section 2.1.15.4.1). So you're counting an undefined event. In this case, if the event happened to be implemented but not documented, then the event count may not be zero depending on whether it occurs during the execution of the program being profiled. Otherwise, the event count would be zero. perf_events doesn't stop you from counting undefined events.
Starting with upstream kernel version v5.4-rc1, the L3PMC is supported in perf_events under the name amd_l3. To determine whether you're using a kernel that supports this PMU, check whether it's enumerated using the command ls /sys/devices/*/format. If not supported, then you can't measure the L3 events on that kernel through perf.
If amd_l3 is supported, you have to explicitly specify the PMU as in amd_l3/r8001/ or amd_l3/event=0x01,umask=0x80/ to have the event counted on the right PMU. Or you can just use the perf event name l3_request_g1.caching_l3_cache_accesses.
Do you know what the event L3RequestG1 represents? The documentation only describes it as "Caching: L3 cache accesses," which isn't very meaningful. It seems to me that the types of transactions it counts are a subset of those covered by the event L3LookupState. Table 19 in Section 2.1.15.2 says that L3 accesses and misses should be counted using rFF04 (L3LookupState) and r0106 (L3CombClstrState), respectively. Don't blindly expect that any of these events actually count whatever you want to measure.
The PPR you linked is not for any Zen2 processors, it's for some Zen and Zen+ processors (specifically models 00h-0Fh). You need to know the processor model and family to locate the right PPR.
Linux Kernel : 4.10.0-20-generic (also tried this on 4.11.3)
Ubuntu : 17.04
I have been trying to collect stats of memory-accesses using perf stat. I am able to collect stats for memory-stores but the count for memory-loads return me a 0 value.
The below is the details for memory-stores :-
perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
158,115,510 cpu/mem-stores/u
0.559922797 seconds time elapsed
For memory-loads, I get a 0 count as can be seen below :-
perf stat -e cpu/mem-loads/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
0 cpu/mem-loads/u
0.563806170 seconds time elapsed
I cannot understand why this does not count properly. Should I use a different event in any way to get proper data ?
The mem-loads event is mapped to the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3 performance monitoring unit event on Intel processors. The events MEM_TRANS_RETIRED.LOAD_LATENCY_* are special and can only be counted by using the p modifier. That is, you have to specify mem-loads:p to perf to use the event correctly.
MEM_TRANS_RETIRED.LOAD_LATENCY_* is a precise event and it only makes sense to be counted at the precise level. According to this Intel article (emphasis mine):
When a user elects to sample one of these events, special hardware is
used that can keep track of a data load from issue to completion.
This is more complicated than simply counting instances of an event
(as with normal event-based sampling), and so only some loads are
tracked. Loads are randomly chosen, the latency determined for each,
and the correct event(s) incremented (latency >4, >8, >16, etc). Due
to the nature of the sampling for this event, only a small percentage
of an application's data loads can be tracked at any one time.
As you can see, MEM_TRANS_RETIRED.LOAD_LATENCY_* by no means count the total number of loads and it is not designed for that purpose at all.
If you want to to determine which instructions in your code are issuing load requests that take more than a specific number of cycles to complete, then MEM_TRANS_RETIRED.LOAD_LATENCY_* is the right performance event to use. In fact, that is exactly the purpose of perf-mem and it achieves its purpose by using this event.
If you want to count the total number of load uops retired, then you should use L1-dcache-loads, which is mapped to the MEM_UOPS_RETIRED.ALL_LOADS performance event on Intel processors.
On the other hand, mem-stores and L1-dcache-stores are mapped to the exact same performance event on all current Intel processors, namely, MEM_UOPS_RETIRED.ALL_STORES, which does count all retired store uops.
So in summary, if you are using perf-stat, you should (almost) always use L1-dcache-loads and L1-dcache-stores to count retired loads and stores, respectively. These are mapped to the raw events you have used in the answer you posted, only more portable because they also work on AMD processors.
I have used a Broadwell(CPU e5-2620) server machine to collect all of the below events.
To collect memory-load events, I had to use a numeric event value. I basically ran the below command -
./perf record -e "r81d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
Here r81d0 represents the raw event for counting "memory loads amongst all instructions retired". "u" as can be understood represents user-space.
The below command, on the other hand,
./perf record -e "r82d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
has "r82d0:u" as a raw event representing "memory stores amongst all instructions retired in userspace".
I want to get into PAPI. I have Version 5.3.2.0 on Debian GNU/Linux. papi_avail just tells me that no hardware events are available:
$ papi_avail
Available events and hardware information.
--------------------------------------------------------------------------------
PAPI Version : 5.3.2.0
Vendor string and code : GenuineIntel (1)
Model string and code : Intel(R) Core(TM) i3-5010U CPU # 2.10GHz (61)
CPU Revision : 4.000000
CPUID Info : Family: 6 Model: 61 Stepping: 4
CPU Max Megahertz : 2100
CPU Min Megahertz : 500
Hdw Threads per core : 2
Cores per Socket : 2
Sockets : 1
NUMA Nodes : 1
CPUs per Node : 4
Total CPUs : 4
Running in a VM : no
Number Hardware Counters : 11
Max Multiplex Counters : 128
--------------------------------------------------------------------------------
Name Code Avail Deriv Description (Note)
PAPI_L1_DCM 0x80000000 No No Level 1 data cache misses
PAPI_L1_ICM 0x80000001 No No Level 1 instruction cache misses
PAPI_L2_DCM 0x80000002 No No Level 2 data cache misses
PAPI_L2_ICM 0x80000003 No No Level 2 instruction cache misses
PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses
PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses
PAPI_L1_TCM 0x80000006 No No Level 1 cache misses
PAPI_L2_TCM 0x80000007 No No Level 2 cache misses
PAPI_L3_TCM 0x80000008 No No Level 3 cache misses
PAPI_CA_SNP 0x80000009 No No Requests for a snoop
PAPI_CA_SHR 0x8000000a No No Requests for exclusive access to shared cache line
PAPI_CA_CLN 0x8000000b No No Requests for exclusive access to clean cache line
PAPI_CA_INV 0x8000000c No No Requests for cache line invalidation
PAPI_CA_ITV 0x8000000d No No Requests for cache line intervention
PAPI_L3_LDM 0x8000000e No No Level 3 load misses
PAPI_L3_STM 0x8000000f No No Level 3 store misses
PAPI_BRU_IDL 0x80000010 No No Cycles branch units are idle
PAPI_FXU_IDL 0x80000011 No No Cycles integer units are idle
PAPI_FPU_IDL 0x80000012 No No Cycles floating point units are idle
PAPI_LSU_IDL 0x80000013 No No Cycles load/store units are idle
PAPI_TLB_DM 0x80000014 No No Data translation lookaside buffer misses
PAPI_TLB_IM 0x80000015 No No Instruction translation lookaside buffer misses
PAPI_TLB_TL 0x80000016 No No Total translation lookaside buffer misses
PAPI_L1_LDM 0x80000017 No No Level 1 load misses
PAPI_L1_STM 0x80000018 No No Level 1 store misses
PAPI_L2_LDM 0x80000019 No No Level 2 load misses
PAPI_L2_STM 0x8000001a No No Level 2 store misses
PAPI_BTAC_M 0x8000001b No No Branch target address cache misses
PAPI_PRF_DM 0x8000001c No No Data prefetch cache misses
PAPI_L3_DCH 0x8000001d No No Level 3 data cache hits
PAPI_TLB_SD 0x8000001e No No Translation lookaside buffer shootdowns
PAPI_CSR_FAL 0x8000001f No No Failed store conditional instructions
PAPI_CSR_SUC 0x80000020 No No Successful store conditional instructions
PAPI_CSR_TOT 0x80000021 No No Total store conditional instructions
PAPI_MEM_SCY 0x80000022 No No Cycles Stalled Waiting for memory accesses
PAPI_MEM_RCY 0x80000023 No No Cycles Stalled Waiting for memory Reads
PAPI_MEM_WCY 0x80000024 No No Cycles Stalled Waiting for memory writes
PAPI_STL_ICY 0x80000025 No No Cycles with no instruction issue
PAPI_FUL_ICY 0x80000026 No No Cycles with maximum instruction issue
PAPI_STL_CCY 0x80000027 No No Cycles with no instructions completed
PAPI_FUL_CCY 0x80000028 No No Cycles with maximum instructions completed
PAPI_HW_INT 0x80000029 No No Hardware interrupts
PAPI_BR_UCN 0x8000002a No No Unconditional branch instructions
PAPI_BR_CN 0x8000002b No No Conditional branch instructions
PAPI_BR_TKN 0x8000002c No No Conditional branch instructions taken
PAPI_BR_NTK 0x8000002d No No Conditional branch instructions not taken
PAPI_BR_MSP 0x8000002e No No Conditional branch instructions mispredicted
PAPI_BR_PRC 0x8000002f No No Conditional branch instructions correctly predicted
PAPI_FMA_INS 0x80000030 No No FMA instructions completed
PAPI_TOT_IIS 0x80000031 No No Instructions issued
PAPI_TOT_INS 0x80000032 No No Instructions completed
PAPI_INT_INS 0x80000033 No No Integer instructions
PAPI_FP_INS 0x80000034 No No Floating point instructions
PAPI_LD_INS 0x80000035 No No Load instructions
PAPI_SR_INS 0x80000036 No No Store instructions
PAPI_BR_INS 0x80000037 No No Branch instructions
PAPI_VEC_INS 0x80000038 No No Vector/SIMD instructions (could include integer)
PAPI_RES_STL 0x80000039 No No Cycles stalled on any resource
PAPI_FP_STAL 0x8000003a No No Cycles the FP unit(s) are stalled
PAPI_TOT_CYC 0x8000003b No No Total cycles
PAPI_LST_INS 0x8000003c No No Load/store instructions completed
PAPI_SYC_INS 0x8000003d No No Synchronization instructions completed
PAPI_L1_DCH 0x8000003e No No Level 1 data cache hits
PAPI_L2_DCH 0x8000003f No No Level 2 data cache hits
PAPI_L1_DCA 0x80000040 No No Level 1 data cache accesses
PAPI_L2_DCA 0x80000041 No No Level 2 data cache accesses
PAPI_L3_DCA 0x80000042 No No Level 3 data cache accesses
PAPI_L1_DCR 0x80000043 No No Level 1 data cache reads
PAPI_L2_DCR 0x80000044 No No Level 2 data cache reads
PAPI_L3_DCR 0x80000045 No No Level 3 data cache reads
PAPI_L1_DCW 0x80000046 No No Level 1 data cache writes
PAPI_L2_DCW 0x80000047 No No Level 2 data cache writes
PAPI_L3_DCW 0x80000048 No No Level 3 data cache writes
PAPI_L1_ICH 0x80000049 No No Level 1 instruction cache hits
PAPI_L2_ICH 0x8000004a No No Level 2 instruction cache hits
PAPI_L3_ICH 0x8000004b No No Level 3 instruction cache hits
PAPI_L1_ICA 0x8000004c No No Level 1 instruction cache accesses
PAPI_L2_ICA 0x8000004d No No Level 2 instruction cache accesses
PAPI_L3_ICA 0x8000004e No No Level 3 instruction cache accesses
PAPI_L1_ICR 0x8000004f No No Level 1 instruction cache reads
PAPI_L2_ICR 0x80000050 No No Level 2 instruction cache reads
PAPI_L3_ICR 0x80000051 No No Level 3 instruction cache reads
PAPI_L1_ICW 0x80000052 No No Level 1 instruction cache writes
PAPI_L2_ICW 0x80000053 No No Level 2 instruction cache writes
PAPI_L3_ICW 0x80000054 No No Level 3 instruction cache writes
PAPI_L1_TCH 0x80000055 No No Level 1 total cache hits
PAPI_L2_TCH 0x80000056 No No Level 2 total cache hits
PAPI_L3_TCH 0x80000057 No No Level 3 total cache hits
PAPI_L1_TCA 0x80000058 No No Level 1 total cache accesses
PAPI_L2_TCA 0x80000059 No No Level 2 total cache accesses
PAPI_L3_TCA 0x8000005a No No Level 3 total cache accesses
PAPI_L1_TCR 0x8000005b No No Level 1 total cache reads
PAPI_L2_TCR 0x8000005c No No Level 2 total cache reads
PAPI_L3_TCR 0x8000005d No No Level 3 total cache reads
PAPI_L1_TCW 0x8000005e No No Level 1 total cache writes
PAPI_L2_TCW 0x8000005f No No Level 2 total cache writes
PAPI_L3_TCW 0x80000060 No No Level 3 total cache writes
PAPI_FML_INS 0x80000061 No No Floating point multiply instructions
PAPI_FAD_INS 0x80000062 No No Floating point add instructions
PAPI_FDV_INS 0x80000063 No No Floating point divide instructions
PAPI_FSQ_INS 0x80000064 No No Floating point square root instructions
PAPI_FNV_INS 0x80000065 No No Floating point inverse instructions
PAPI_FP_OPS 0x80000066 No No Floating point operations
PAPI_SP_OPS 0x80000067 No No Floating point operations; optimized to count scaled single precision vector operations
PAPI_DP_OPS 0x80000068 No No Floating point operations; optimized to count scaled double precision vector operations
PAPI_VEC_SP 0x80000069 No No Single precision vector/SIMD instructions
PAPI_VEC_DP 0x8000006a No No Double precision vector/SIMD instructions
PAPI_REF_CYC 0x8000006b No No Reference clock cycles
-------------------------------------------------------------------------
Of 108 possible events, 0 are available, of which 0 are derived.
I could not find anything in the documentation, nor in the FAQ. Does anyone know what is wrong here?
I just had the same problem. In my case it was an overly paranoid kernel flag:
root#*****:/home/*****# ./papi/src/utils/papi_component_avail
Available components and hardware information.
--------------------------------------------------------------------------------
[...]
Compiled-in components:
Name: perf_event Linux perf_event CPU counters
\-> Disabled: perf_event support disabled by Linux with paranoid=3
Name: perf_event_uncore Linux perf_event CPU uncore and northbridge
Active components:
Name: perf_event_uncore Linux perf_event CPU uncore and northbridge
Native: 2, Preset: 0, Counters: 3
PMUs supported: rapl
My understanding of perf_event_paranoid was that it did not apply to root, but apparently the value 3 was different.
Not sure if that was the case for the original question, because 3 seems to be added in 2016, but since this is the question that people might ask nowadays, I hope this helps.
This fixed it:
sudo sh -c 'echo -1 >/proc/sys/kernel/perf_event_paranoid'
Note that this opens up the counters to everyone, which was fine in my case.
Version info:
Kernel 4.13.0-32P
PAPI
Does anybody know what is the meaning of stalled-cycles-frontend and stalled-cycles-backend in perf stat result ? I searched on the internet but did not find the answer. Thanks
$ sudo perf stat ls
Performance counter stats for 'ls':
0.602144 task-clock # 0.762 CPUs utilized
0 context-switches # 0.000 K/sec
0 CPU-migrations # 0.000 K/sec
236 page-faults # 0.392 M/sec
768956 cycles # 1.277 GHz
962999 stalled-cycles-frontend # 125.23% frontend cycles idle
634360 stalled-cycles-backend # 82.50% backend cycles idle
890060 instructions # 1.16 insns per cycle
# 1.08 stalled cycles per insn
179378 branches # 297.899 M/sec
9362 branch-misses # 5.22% of all branches [48.33%]
0.000790562 seconds time elapsed
The theory:
Let's start from this: nowaday's CPU's are superscalar, which means that they can execute more than one instruction per cycle (IPC). Latest Intel architectures can go up to 4 IPC (4 x86 instruction decoders). Let's not bring macro / micro fusion into discussion to complicate things more :).
Typically, workloads do not reach IPC=4 due to various resource contentions. This means that the CPU is wasting cycles (number of instructions is given by the software and the CPU has to execute them in as few cycles as possible).
We can divide the total cycles being spent by the CPU in 3 categories:
Cycles where instructions get retired (useful work)
Cycles being spent in the Back-End (wasted)
Cycles spent in the Front-End (wasted).
To get an IPC of 4, the number of cycles retiring has to be close to the total number of cycles. Keep in mind that in this stage, all the micro-operations (uOps) retire from the pipeline and commit their results into registers / caches. At this stage you can have even more than 4 uOps retiring, because this number is given by the number of execution ports. If you have only 25% of the cycles retiring 4 uOps then you will have an overall IPC of 1.
The cycles stalled in the back-end are a waste because the CPU has to wait for resources (usually memory) or to finish long latency instructions (e.g. transcedentals - sqrt, reciprocals, divisions, etc.).
The cycles stalled in the front-end are a waste because that means that the Front-End does not feed the Back End with micro-operations. This can mean that you have misses in the Instruction cache, or complex instructions that are not already decoded in the micro-op cache. Just-in-time compiled code usually expresses this behavior.
Another stall reason is branch prediction miss. That is called bad speculation. In that case uOps are issued but they are discarded because the BP predicted wrong.
The implementation in profilers:
How do you interpret the BE and FE stalled cycles?
Different profilers have different approaches on these metrics. In vTune, categories 1 to 3 add up to give 100% of the cycles. That seams reasonable because either you have your CPU stalled (no uOps are retiring) either it performs usefull work (uOps) retiring. See more here: https://software.intel.com/sites/products/documentation/doclib/stdxe/2013SP1/amplifierxe/snb/index.htm
In perf this usually does not happen. That's a problem because when you see 125% cycles stalled in the front end, you don't know how to really interpret this. You could link the >1 metric with the fact that there are 4 decoders but if you continue the reasoning, then the IPC won't match.
Even better, you don't know how big the problem is. 125% out of what? What do the #cycles mean then?
I personally look a bit suspicious on perf's BE and FE stalled cycles and hope this will get fixed.
Probably we will get the final answer by debugging the code from here: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/perf/builtin-stat.c
To convert generic events exported by perf into your CPU documentation raw events you can run:
more /sys/bus/event_source/devices/cpu/events/stalled-cycles-frontend
It will show you something like
event=0x0e,umask=0x01,inv,cmask=0x01
According to the Intel documentation SDM volume 3B (I have a core i5-2520):
UOPS_ISSUED.ANY:
Increments each cycle the # of Uops issued by the RAT to RS.
Set Cmask = 1, Inv = 1, Any= 1 to count stalled cycles of this core.
For the stalled-cycles-backend event translating to event=0xb1,umask=0x01 on my system the same documentation says:
UOPS_DISPATCHED.THREAD:
Counts total number of uops to be dispatched per- thread each cycle
Set Cmask = 1, INV =1 to count stall cycles.
Usually, stalled cycles are cycles where the processor is waiting for something (memory to be feed after executing a load operation for example) and doesn't have any other stuff to do. Moreover, the frontend part of the CPU is the piece of hardware responsible to fetch and decode instructions (convert them to UOPs) where as the backend part is responsible to effectively execute the UOPs.
A CPU cycle is “stalled” when the pipeline doesn't advance during it.
Processor pipeline is composed of many stages: the front-end is a group of these stages which is responsible for the fetch and decode phases, while the back-end executes the instructions. There is a buffer between front-end and back-end, so when the former is stalled the latter can still have some work to do.
Taken from http://paolobernardi.wordpress.com/2012/08/07/playing-around-with-perf/
According to author of these events, they defined loosely and are approximated by available CPU performance counters. As I know, perf doesn't support formulas to calculate some synthetic event based on several hardware events, so it can't use front-end/back-end stall bound method from Intel's Optimization manual (Implemented in VTune) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf "B.3.2 Hierarchical Top-Down Performance Characterization Methodology"
%FE_Bound = 100 * (IDQ_UOPS_NOT_DELIVERED.CORE / N );
%Bad_Speculation = 100 * ( (UOPS_ISSUED.ANY – UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / N) ;
%Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ N) ;
%BE_Bound = 100 * (1 – (FE_Bound + Retiring + Bad_Speculation) ) ;
N = 4*CPU_CLK_UNHALTED.THREAD" (for SandyBridge)
Right formulas can be used with some external scripting, like it was done in Andi Kleen's pmu-tools (toplev.py): https://github.com/andikleen/pmu-tools (source), http://halobates.de/blog/p/262 (description):
% toplev.py -d -l2 numademo 100M stream
...
perf stat --log-fd 4 -x, -e
{r3079,r19c,r10401c3,r100030d,rc5,r10e,cycles,r400019c,r2c2,instructions}
{r15e,r60006a3,r30001b1,r40004a3,r8a2,r10001b1,cycles}
numademo 100M stream
...
BE Backend Bound: 72.03%
This category reflects slots where no uops are being delivered due to a lack
of required resources for accepting more uops in the Backend of the pipeline.
.....
FE Frontend Bound: 54.07%
This category reflects slots where the Frontend of the processor undersupplies
its Backend.
Commit which introduced stalled-cycles-frontend and stalled-cycles-backend events instead of original universal stalled-cycles:
http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=8f62242246351b5a4bc0c1f00c0c7003edea128a
author Ingo Molnar <mingo#el...> 2011-04-29 11:19:47 (GMT)
committer Ingo Molnar <mingo#el...> 2011-04-29 12:23:58 (GMT)
commit 8f62242246351b5a4bc0c1f00c0c7003edea128a (patch)
tree 9021c99956e0f9dc64655aaa4309c0f0fdb055c9
parent ede70290046043b2638204cab55e26ea1d0c6cd9 (diff)
perf events: Add generic front-end and back-end stalled cycle event definitions
Add two generic hardware events: front-end and back-end stalled cycles.
These events measure conditions when the CPU is executing code but its
capabilities are not fully utilized. Understanding such situations and
analyzing them is an important sub-task of code optimization workflows.
Both events limit performance: most front end stalls tend to be caused
by branch misprediction or instruction fetch cachemisses, backend
stalls can be caused by various resource shortages or inefficient
instruction scheduling.
Front-end stalls are the more important ones: code cannot run fast
if the instruction stream is not being kept up.
An over-utilized back-end can cause front-end stalls and thus
has to be kept an eye on as well.
The exact composition is very program logic and instruction mix
dependent.
We use the terms 'stall', 'front-end' and 'back-end' loosely and
try to use the best available events from specific CPUs that
approximate these concepts.
Cc: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
Link: http://lkml.kernel.org/n/tip-7y40wib8n000io7hjpn1dsrm#git.kernel.org
Signed-off-by: Ingo Molnar
/* Install the stalled-cycles event: UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 */
- intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES] = 0x1803fb1;
+ intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x1803fb1;
- PERF_COUNT_HW_STALLED_CYCLES = 7,
+ PERF_COUNT_HW_STALLED_CYCLES_FRONTEND = 7,
+ PERF_COUNT_HW_STALLED_CYCLES_BACKEND = 8,
Trying to determine the Processor Queue Length (the number of processes that ready to run but currently aren't) on a linux machine. There is a WMI call in Windows for this metric, but not knowing much about linux I'm trying to mine /proc and 'top' for the information. Is there a way to determine the queue length for the cpu?
Edit to add: Microsoft's words concerning their metric: "The collection of one or more threads that is ready but not able to run on the processor due to another active thread that is currently running is called the processor queue."
sar -q will report queue length, task list length and three load averages.
Example:
matli#tornado:~$ sar -q 1 0
Linux 2.6.27-9-generic (tornado) 01/13/2009 _i686_
11:38:32 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
11:38:33 PM 0 305 1.26 0.95 0.54
11:38:34 PM 4 305 1.26 0.95 0.54
11:38:35 PM 1 306 1.26 0.95 0.54
11:38:36 PM 1 306 1.26 0.95 0.54
^C
vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 256368 53764 75980 220564 2 28 60 54 774 1343 15 4 78 2
The first column (r) is the run queue - 2 on my machine right now
Edit: Surprised there isn't a way to just get the number
Quick 'n' dirty way to get the number (might vary a little on different machines):
vmstat|tail -1|cut -d" " -f2
The metrics you seek exist in /proc/schedstat.
The format of this file is described in sched-stats.txt in the kernel source. Specifically, the cpu<N> lines are what you want:
CPU statistics
--------------
cpu<N> 1 2 3 4 5 6 7 8 9
First field is a sched_yield() statistic:
1) # of times sched_yield() was called
Next three are schedule() statistics:
2) This field is a legacy array expiration count field used in the O(1)
scheduler. We kept it for ABI compatibility, but it is always set to zero.
3) # of times schedule() was called
4) # of times schedule() left the processor idle
Next two are try_to_wake_up() statistics:
5) # of times try_to_wake_up() was called
6) # of times try_to_wake_up() was called to wake up the local cpu
Next three are statistics describing scheduling latency:
7) sum of all time spent running by tasks on this processor (in jiffies)
8) sum of all time spent waiting to run by tasks on this processor (in
jiffies)
9) # of timeslices run on this cpu
In particular, field 8. To find the run queue length, you would:
Observe field 8 for each CPU and record the value.
Wait for some interval.
Observe field 8 for each CPU again, and calculate how much the value has increased.
Dividing that difference by the length of the time interval waited (the documentation says it's in jiffies, but it's actually in nanoseconds since the addition of CFS), by Little's Law, yields the mean length of the scheduler run queue over the interval.
Unfortunately, I'm not aware of any utility to automate this process which is usually installed or even packaged in a Linux distribution. I've not used it, but the kernel documentation suggests http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c, which unfortunately refers to a domain that is no longer resolvable. Fortunately, it's available on the wayback machine.
Why not sar or vmstat?
These tools report the number of currently runnable processes. Certainly if this number is greater than the number of CPUs, some of them must be waiting. However, processes can still be waiting even when the number of processes is less than the number of CPUs, for a variety of reasons:
A process may be pinned to a particular CPU.
The scheduler may decide to schedule a process on a particular CPU to make better utilization of cache, or for NUMA optimization reasons.
The scheduler may intentionally idle a CPU to allow more time to a competing, higher priority process on another CPU that shares the same execution core (a hyperthreading optimization).
Hardware interrupts may be processable only on particular CPUs for a variety of hardware and software reasons.
Moreover, the number of runnable processes is only sampled at an instant in time. In many cases this number may fluctuate rapidly, and the contention may be occurring between the times the metric is being sampled.
These things mean the number of runnable processes minus the number of CPUs is not a reliable indicator of CPU contention.
uptime will give you the recent load average, which is approximately the average number of active processes. uptime reports the load average over the last 1, 5, and 15 minutes. It's a per-system measurement, not per-CPU.
Not sure what the processor queue length in Windows is, hopefully it's close enough to this?