I know that the ARM PMU is partially implemented, thanks to the gem5 source code and some publications.
I have a binary which uses perf_event to access the PMU on a Linux-based OS, under an ARM processor. Could it use perf_event inside a gem5 full-system simulation with a Linux kernel, under the ARM ISA?
So far, I haven't found the right way to do it. If someone knows, I will be very grateful!
As of September 2020, gem5 needs to be patched in order to use the ARM PMU.
Edit: As of November 2020, gem5 is now patched and it will be included in the next release. Thanks to the developers!
How to patch gem5
This is not a clean patch (very straightforward), and it is more intended to understand how it works. Nonetheless, this is the patch to apply with git apply from the gem5 source repository:
diff --git i/src/arch/arm/ArmISA.py w/src/arch/arm/ArmISA.py
index 2641ec3fb..3d85c1b75 100644
--- i/src/arch/arm/ArmISA.py
+++ w/src/arch/arm/ArmISA.py
## -36,6 +36,7 ##
from m5.params import *
from m5.proxy import *
+from m5.SimObject import SimObject
from m5.objects.ArmPMU import ArmPMU
from m5.objects.ArmSystem import SveVectorLength
from m5.objects.BaseISA import BaseISA
## -49,6 +50,8 ## class ArmISA(BaseISA):
cxx_class = 'ArmISA::ISA'
cxx_header = "arch/arm/isa.hh"
+ generateDeviceTree = SimObject.recurseDeviceTree
+
system = Param.System(Parent.any, "System this ISA object belongs to")
pmu = Param.ArmPMU(NULL, "Performance Monitoring Unit")
diff --git i/src/arch/arm/ArmPMU.py w/src/arch/arm/ArmPMU.py
index 047e908b3..58553fbf9 100644
--- i/src/arch/arm/ArmPMU.py
+++ w/src/arch/arm/ArmPMU.py
## -40,6 +40,7 ## from m5.params import *
from m5.params import isNullPointer
from m5.proxy import *
from m5.objects.Gic import ArmInterruptPin
+from m5.util.fdthelper import *
class ProbeEvent(object):
def __init__(self, pmu, _eventId, obj, *listOfNames):
## -76,6 +77,17 ## class ArmPMU(SimObject):
_events = None
+ def generateDeviceTree(self, state):
+ node = FdtNode("pmu")
+ node.appendCompatible("arm,armv8-pmuv3")
+ # gem5 uses GIC controller interrupt notation, where PPI interrupts
+ # start to 16. However, the Linux kernel start from 0, and used a tag
+ # (set to 1) to indicate the PPI interrupt type.
+ node.append(FdtPropertyWords("interrupts", [
+ 1, int(self.interrupt.num) - 16, 0xf04
+ ]))
+ yield node
+
def addEvent(self, newObject):
if not (isinstance(newObject, ProbeEvent)
or isinstance(newObject, SoftwareIncrement)):
diff --git i/src/cpu/BaseCPU.py w/src/cpu/BaseCPU.py
index ab70d1d7f..66a49a038 100644
--- i/src/cpu/BaseCPU.py
+++ w/src/cpu/BaseCPU.py
## -302,6 +302,11 ## class BaseCPU(ClockedObject):
node.appendPhandle(phandle_key)
cpus_node.append(node)
+ # Generate nodes from the BaseCPU children (and don't add them as
+ # subnode). Please note: this is mainly needed for the ISA class.
+ for child_node in self.recurseDeviceTree(state):
+ yield child_node
+
yield cpus_node
def __init__(self, **kwargs):
What the patch resolves
The Linux kernel uses a Device Tree Blob (DTB), which is a regular file, to declare the hardware on which the kernel is running. This is used to make the kernel portable between different architecture without a recompilation for each hardware change. The DTB follows the Device Tree Reference, and is compiled from a Device Tree Source (DTS) file, a regular text file. You can learn more here and here.
The problem was that the PMU is supposed to be declared to the Linux kernel via the DTB. You can learn more here and here. In a simulated system, because the system is specified by the user, gem5 has to generate a DTB itself to pass to the kernel, so the latter can recognize the simulated hardware. However, the problem is that gem5 does not generate the DTB entry for our PMU.
What the patch does
The patch adds an entry to the ISA and the CPU files to enable DTB generation recursion up to find the PMU. The hierarchy is the following: CPU => ISA => PMU. Then, it adds the generation function in the PMU to generate a unique DTB entry to declare the PMU, with the proper notation for the interrupt declaration in the kernel.
After running a simulation with our patch, we could see the DTS from the DTB like this:
cd m5out
# Decompile the DTB to get the DTS.
dtc -I dtb -O dts system.dtb > system.dts
# Find the PMU entry.
head system.dts
dtc is the Device Tree Compiler, installed with sudo apt-get install device-tree-compiler. We end up with this pmu DTB entry, under the root node (/):
/dts-v1/;
/ {
#address-cells = <0x02>;
#size-cells = <0x02>;
interrupt-parent = <0x05>;
compatible = "arm,vexpress";
model = "V2P-CA15";
arm,hbi = <0x00>;
arm,vexpress,site = <0x0f>;
memory#80000000 {
device_type = "memory";
reg = <0x00 0x80000000 0x01 0x00>;
};
pmu {
compatible = "arm,armv8-pmuv3";
interrupts = <0x01 0x04 0xf04>;
};
cpus {
#address-cells = <0x01>;
#size-cells = <0x00>;
cpu#0 {
device_type = "cpu";
compatible = "gem5,arm-cpu";
[...]
In the line interrupts = <0x01 0x04 0xf04>;, 0x01 is used to indicate that the number 0x04 is the number of a PPI interrupt (the one declared with number 20 in gem5, the difference of 16 is explained inside the patch code). The 0xf04 corresponds to a flag (0x4) indicating that it is a "active high level-sensitive" interrupt and a bit mask (0xf) indicating that the interrupts should be wired to all PE attached to the GIC. You can learn more here.
If the patch works and your ArmPMU is declared properly, you should see this message at boot time:
[ 0.239967] hw perfevents: enabled with armv8_pmuv3 PMU driver, 32 counters available
Context
I was not able to use the Performance Monitoring Unit (PMU) because of a gem5's unimplemented feature. The reference on the mailing list can be found here. After a personal patch, the PMU is accessible through perf_event. Fortunately, a similar patch will be released in the official gem5 release soon, could be seen here. The patch will be described in another answer, due to the number of link limitation inside one message.
How to use the PMU
C source code
This is a minimal working example of a C source code using perf_event, used to count the number of mispredicted branches by the branch predictor unit during a specific task:
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/perf_event.h>
int main(int argc, char **argv) {
/* File descriptor used to read mispredicted branches counter. */
static int perf_fd_branch_miss;
/* Initialize our perf_event_attr, representing one counter to be read. */
static struct perf_event_attr attr_branch_miss;
attr_branch_miss.size = sizeof(attr_branch_miss);
attr_branch_miss.exclude_kernel = 1;
attr_branch_miss.exclude_hv = 1;
attr_branch_miss.exclude_callchain_kernel = 1;
/* On a real system, you can do like this: */
attr_branch_miss.type = PERF_TYPE_HARDWARE;
attr_branch_miss.config = PERF_COUNT_HW_BRANCH_MISSES;
/* On a gem5 system, you have to do like this: */
attr_branch_miss.type = PERF_TYPE_RAW;
attr_branch_miss.config = 0x10;
/* Open the file descriptor corresponding to this counter. The counter
should start at this moment. */
if ((perf_fd_branch_miss = syscall(__NR_perf_event_open, &attr_branch_miss, 0, -1, -1, 0)) == -1)
fprintf(stderr, "perf_event_open fail %d %d: %s\n", perf_fd_branch_miss, errno, strerror(errno));
/* Workload here, that means our specific task to profile. */
/* Get and close the performance counters. */
uint64_t counter_branch_miss = 0;
read(perf_fd_branch_miss, &counter_branch_miss, sizeof(counter_branch_miss));
close(perf_fd_branch_miss);
/* Display the result. */
printf("Number of mispredicted branches: %d\n", counter_branch_miss);
}
I will not enter into the details of how using perf_event, good resources are available here, here, here, here. However, just a few notes about the code above:
On real hardware, when using perf_event and common events (events that are available under a lot of architectures), it is recommended to use perf_event macros PERF_TYPE_HARDWARE as type and to use macros like PERF_COUNT_HW_BRANCH_MISSES for the number of mispredicted branches, PERF_COUNT_HW_CACHE_MISSES for the number of cache misses, and so on (see the manual page for a list). This is a best practice to have a portable code.
On a gem5 simulated system, currently (v20.0), a C source code have to use PERF_TYPE_RAW type and architectural event ID to identify an event. Here, 0x10 is the ID of the 0x0010, BR_MIS_PRED, Mispredicted or not predicted branch event, described in the ARMv8-A Reference Manual (here). In the manual, all events available in real hardware are described. However, they are not all implemented into gem5. To see the list of implemented event inside gem5, refer to the src/arch/arm/ArmPMU.py file. In the latter, the line self.addEvent(ProbeEvent(self,0x10, bpred, "Misses")) corresponds to the declaration of the counter described in the manual. This is not a normal behavior, hence gem5 should be patched to allow using PERF_TYPE_HARDWARE one day.
gem5 simulation script
This is not a entire MWE script (it would be too long!), only the needed portion to add inside a full-system script to use the PMU. We use an ArmSystem as a system, with the RealView platform.
For each ISA (we use an ARM ISA here) of each CPU (e.g., a DerivO3CPU) in our cluster (which is a SubSystem class), we add to it a PMU with a unique interrupt number and the already implemented architectural event. An example of this function could be found in configs/example/arm/devices.py.
To choose an interrupt number, pick a free PPI interrupt in the platform interrupt mapping. Here, we choose PPI n°20, according to the RealView interrupt map (src/dev/arm/RealView.py). Since PPIs interrupts are local per Processing Element (PE, corresponds to cores in our context), the interrupt number can be the same for all PE without any conflict. To know more about PPI interrupts, see the GIC guide from ARM here.
Here, we can see that the interrupt n°20 is not used by the system (from RealView.py):
Interrupts:
0- 15: Software generated interrupts (SGIs)
16- 31: On-chip private peripherals (PPIs)
25 : vgic
26 : generic_timer (hyp)
27 : generic_timer (virt)
28 : Reserved (Legacy FIQ)
We pass to addArchEvents our system components (dtb, itb, etc.) to link the PMU with them, thus the PMU will use the internal counters (called probes) of these components as exposed counters to the system.
for cpu in system.cpu_cluster.cpus:
for isa in cpu.isa:
isa.pmu = ArmPMU(interrupt=ArmPPI(num=20))
# Add the implemented architectural events of gem5. We can
# discover which events is implemented by looking at the file
# "ArmPMU.py".
isa.pmu.addArchEvents(
cpu=cpu, dtb=cpu.dtb, itb=cpu.itb,
icache=getattr(cpu, "icache", None),
dcache=getattr(cpu, "dcache", None),
l2cache=getattr(system.cpu_cluster, "l2", None))
Two quick additions to Pierre's awesome answers:
for fs.py as of gem5 937241101fae2cd0755c43c33bab2537b47596a2, all that is missing is to apply to fs.py as shown at: https://gem5-review.googlesource.com/c/public/gem5/+/37978/1/configs/example/fs.py
for cpu in test_sys.cpu:
if buildEnv['TARGET_ISA'] in "arm":
for isa in cpu.isa:
isa.pmu = ArmPMU(interrupt=ArmPPI(num=20))
isa.pmu.addArchEvents(
cpu=cpu, dtb=cpu.mmu.dtb, itb=cpu.mmu.itb,
icache=getattr(cpu, "icache", None),
dcache=getattr(cpu, "dcache", None),
l2cache=getattr(test_sys, "l2", None))
a C example can also be found in man perf_event_open
Where can I check to know if the Linux kernel is running in non-secure (EL2) or secure (EL3) mode ?
How can I change this mode ?
I run on the ARMv8 64 bits.
Thanks in advance
Where can I check to know if the Linux kernel is running in non-secure (EL2) or secure (EL3) mode ?
I'm going to give a cheeky answer.
Adapt what is the current execution mode/exception level, etc? and hack your region of interest:
diff --git a/init/main.c b/init/main.c
index 18f8f0140fa0..840f886d17b3 100644
--- a/init/main.c
+++ b/init/main.c
## -533,6 +533,10 ## asmlinkage __visible void __init start_kernel(void)
char *command_line;
char *after_dashes;
+ register u64 x0 __asm__ ("x0");
+ __asm__ ("mrs x0, CurrentEL;" : : : "%x0");
+ pr_info("EL = %llu\n", (unsigned long long)(x0 >> 2));
+
set_task_stack_end_magic(&init_task);
smp_setup_processor_id();
debug_objects_early_init();
Output at the start of boot.
EL = 1
The default Linux kernel v4.19 boot logs also tell us that by default:
CPU: All CPU(s) started at EL1
How can I change this mode?
Traditionally the kernel ran on EL1 only, and EL2 was left for a hypervisor like Xen, and EL3 for a bootloader like ARM Trusted Firmware.
However, with the introduction of the ARMv8.1 VHE extension, EL2 kernel became efficient and was added into Linux mainline: What are Ring 0 and Ring 3 in the context of operating systems? Not sure which config turns it on though, have a look.
EL3 I don't think there's anything mainline. There are some people trying to adapt kernel code to be the booloader as well, have a look e.g. at: https://github.com/kexecboot/kexecboot
I'm working with the support SMP kernel: Snapgear 2.6.21.
I have created 4 threads in my c application, and I am trying to set thread 1 to run on CPU1, thread2 on CPU 2, etc.
However, the compiler sparc-linux-gcc does not recognize these functions:
CPU_SET (int cpu, cpu_set_t * set);
CPU_ZERO (cpu_set_t * set);
and this type: cpu_set_t
It always gives me these errors:
implicit declaration of function 'CPU_ZERO'
implicit declaration of function 'CPU_SET'
'cpu_set_t' undeclared (first use in this function)
Here is my code to bind active thread to processor 0:
cpu_set_t mask;
CPU_ZERO (& mask);
CPU_SET (0, & mask) // bind processor 0
sched_setaffinity (0, sizeof(mask), & mask);
I have included and defined at the top :
**define _GNU_SOURCE
include <sched.h>**
But I always get the same errors. can you help me please?
You should read sched_setaffinity(2) carefully and test its result (and display errno on failure, e.g. with perror).
Actually, I believe you should use pthread_setaffinity_np(3) instead (and of course test its failure, etc...)
Even more, I believe that you should not bother to explicitly set the affinity. Recent Linux kernels are often quite good at dispatching running threads on different CPUs.
So simply use pthreads and don't bother about affinity, unless you see actual issues when benchmarking.
BTW, passing the -H flag to your GCC (cross-)compiler could be helpful. It shows you the included files. Perhaps also look into the preprocessed form obtained with gcc -C -E ; it looks like some header files are missing or not found (maybe some missing -I include-directory at compilation time, or some missing headers on your development system)
BTW, your kernel version looks ancient. Can't you upgrade your kernel to something newer (3.15.x or some 3.y)?
I was asked to run a Java virtual machine on a Broadcom MIPS board and was very glad to find the OJEC cvm binary for MIPS from Oracle. Unfortunately, it seems that the binary wasn't built for my board since it could not be executed properly.
/mnt/nfs/Oracle_JavaME_Embedded_Client/1.0/binaries/bin # ./cvm
-sh: ./cvm: not found
Can I ask anyone knows if I can get the OJEC source code somewhere to rebuild the client with our toolchain we're using for the board? If yes, is there a guide for building the client?
While searching google, I found a CDC source code from the "Phoneme" project and could build the cvm with our mips toolchain. It works fine! I could run its test and some hello world samples. However, looking at the Phoneme svn log, I realized that the project is not active recently, last change is about a year ago. Could someone tell the project's status and how it is different to the OJECT?
I'm also confused with openjdk's hotspot. Is it different to the OJEC? or are they both based on the CDC?
Here is the cpu info I got from my box's /proc/
cat /proc/cpuinfo
system type : BCM7413B1 STB platform
processor : 0
cpu model : Broadcom BMIPS4380 V4.4 FPU V0.1
BogoMIPS : 404.48
wait instruction : yes
microsecond timers : yes
tlb_entries : 32
extra interrupt vector : yes
hardware watchpoint : no
ASEs implemented : mips16
shadow register sets : 1
core : 0
VCED exceptions : not available
VCEI exceptions : not available
You just forgot to chmod the file to +x
The binary is for mips32r2, your platform is mips32r1.
Not much help, I know, but HotSpot is the OpenJDK (also the Oracle JVM) for its SE implementation - it isn't an ME implementation.
By the way, the command line you posted just looks like the cvm binary is missing in that directory.