Why does strace ignore some syscalls (randomly) depending on environment/kernel? - linux

If I compile the following program:
$ cat main.cpp && g++ main.cpp
#include <time.h>
int main() {
struct timespec ts;
return clock_gettime(CLOCK_MONOTONIC, &ts);
}
and then run it under strace in "standard" Kubuntu, I get this:
strace -tt --trace=clock_gettime ./a.out
17:58:40.395200 +++ exited with 0 +++
As you can see, there is no clock_gettime (full strace output is here).
On the other hand, if I run the same app in my custom built linux kernel under qemu, I get the following output:
strace -tt --trace=clock_gettime ./a.out
18:00:53.082115 clock_gettime(CLOCK_MONOTONIC, {tv_sec=101481, tv_nsec=107976517}) = 0
18:00:53.082331 +++ exited with 0 +++
Which is more expected - there is clock_gettime.
So, my questions are:
Why does strace ignore/omit clock_gettime if I run it in Kubuntu?
Why strace's behaviour differs depending on environment/kernel?

Answer to the first question
From vdso man
strace(1), seccomp(2), and the vDSO
When tracing systems calls with strace(1), symbols (system calls) that are exported by the vDSO will not appear in the trace output. Those system calls will likewise not be visible to seccomp(2) filters.
Answer to the second question:
In the vDSO, clock_gettimeofday and related functions are reliant on specific clock modes; see __arch_get_hw_counter.
If the clock mode is VCLOCK_TSC, the time is read without a syscall, using RDTSC; if it’s VCLOCK_PVCLOCK or VCLOCK_HVCLOCK, it’s read from a specific page to retrieve the information from the hypervisor. HPET doesn’t declare a clock mode, so it ends up with the default VCLOCK_NONE, and the vDSO issues a system call to retrieve the time.
And indeed:
In the default kernel (from Kubuntu):
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
Custom built kernel:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
More info about various clock sources. In particular:
The documentation of Red Hat MRG version 2 states that TSC is the preferred clock source due to its much lower overhead, but it uses HPET as a fallback. A benchmark in that environment for 10 million event counts found that TSC took about 0.6 seconds, HPET took slightly over 12 seconds, and ACPI Power Management Timer took around 24 seconds.

It may be due to the fact that clock_gettime() is part of the optimized syscalls. Look at the vdso mechanism described in this answer.
Considering clock_gettime(), on some architecture (e.g. Linux on ARM v7 32 bits), only a subset of the available clock identifiers are supported in the VDSO implementation. For the others, there is a fallback into the actual system call. Here is the source code of the VDSO implementation of clock_gettime() in the Linux kernel for the ARM v7 (file arch/arm/vdso/vgettimeofday.c in the source tree):
notrace int __vdso_clock_gettime(clockid_t clkid, struct timespec *ts)
{
struct vdso_data *vdata;
int ret = -1;
vdata = __get_datapage();
switch (clkid) {
case CLOCK_REALTIME_COARSE:
ret = do_realtime_coarse(ts, vdata);
break;
case CLOCK_MONOTONIC_COARSE:
ret = do_monotonic_coarse(ts, vdata);
break;
case CLOCK_REALTIME:
ret = do_realtime(ts, vdata);
break;
case CLOCK_MONOTONIC:
ret = do_monotonic(ts, vdata);
break;
default:
break;
}
if (ret)
ret = clock_gettime_fallback(clkid, ts);
return ret;
}
The above source code shows a switch/case with the supported clock identifiers and in the default case, there is a fallback to the actual system call.
On such architecture, spying a software like systemd which uses clock_gettime() with CLOCK_MONOTONIC and CLOCK_BOOTTIME, strace only shows the calls with the latter identifier as it is not part of the supported cases in VDSO mode.
Cf. this link for reference

Related

How does one do a "zero-syscall clock_gettime" without dynamic linking?

I ran the code below with strace. I can see it doesn't use a system call to get the time. After write only clock_nanosleep and exit_group are called. It correctly gives me 3 every run (I was expecting to occasionally get 2).
Musl .80 release says "zero-syscall clock_gettime support (dynamic-linked x86_64 only)"
Is there a way to do this without depending on being dynamically linked? Perhaps with cpuid? Compiling with -nostdlib says it depends on clock_gettime. Is there a way to implement CLOCK_REALTIME or CLOCK_MONOTONIC? I'd find using __rdtsc(p) acceptable if there's a way to get cpu frequency so I can estimate how much time has pass. I don't need nanosecond precision.
#include <time.h>
int main() {
struct timespec time, time2;
write(1, "Test\n", 5);
clock_gettime(CLOCK_REALTIME, &time);
sleep(3);
clock_gettime(CLOCK_REALTIME, &time2);
return time2.tv_sec - time.tv_sec;
}

perf stat time elapsed for sys and user [duplicate]

$ time foo
real 0m0.003s
user 0m0.000s
sys 0m0.004s
$
What do real, user and sys mean in the output of time? Which one is meaningful when benchmarking my app?
Real, User and Sys process time statistics
One of these things is not like the other. Real refers to actual elapsed time; User and Sys refer to CPU time used only by the process.
Real is wall clock time - time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).
User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.
Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like 'user', this is only CPU time used by the process. See below for a brief description of kernel mode (also known as 'supervisor' mode) and the system call mechanism.
User+Sys will tell you how much actual CPU time your process used. Note that this is across all CPUs, so if the process has multiple threads (and this process is running on a computer with more than one processor) it could potentially exceed the wall clock time reported by Real (which usually occurs). Note that in the output these figures include the User and Sys time of all child processes (and their descendants) as well when they could have been collected, e.g. by wait(2) or waitpid(2), although the underlying system calls return the statistics for the process and its children separately.
Origins of the statistics reported by time (1)
The statistics reported by time are gathered from various system calls. 'User' and 'Sys' come from wait (2) (POSIX) or times (2) (POSIX), depending on the particular system. 'Real' is calculated from a start and end time gathered from the gettimeofday (2) call. Depending on the version of the system, various other statistics such as the number of context switches may also be gathered by time.
On a multi-processor machine, a multi-threaded process or a process forking children could have an elapsed time smaller than the total CPU time - as different threads or processes may run in parallel. Also, the time statistics reported come from different origins, so times recorded for very short running tasks may be subject to rounding errors, as the example given by the original poster shows.
A brief primer on Kernel vs. User mode
On Unix, or any protected-memory operating system, 'Kernel' or 'Supervisor' mode refers to a privileged mode that the CPU can operate in. Certain privileged actions that could affect security or stability can only be done when the CPU is operating in this mode; these actions are not available to application code. An example of such an action might be manipulation of the MMU to gain access to the address space of another process. Normally, user-mode code cannot do this (with good reason), although it can request shared memory from the kernel, which could be read or written by more than one process. In this case, the shared memory is explicitly requested from the kernel through a secure mechanism and both processes have to explicitly attach to it in order to use it.
The privileged mode is usually referred to as 'kernel' mode because the kernel is executed by the CPU running in this mode. In order to switch to kernel mode you have to issue a specific instruction (often called a trap) that switches the CPU to running in kernel mode and runs code from a specific location held in a jump table. For security reasons, you cannot switch to kernel mode and execute arbitrary code - the traps are managed through a table of addresses that cannot be written to unless the CPU is running in supervisor mode. You trap with an explicit trap number and the address is looked up in the jump table; the kernel has a finite number of controlled entry points.
The 'system' calls in the C library (particularly those described in Section 2 of the man pages) have a user-mode component, which is what you actually call from your C program. Behind the scenes, they may issue one or more system calls to the kernel to do specific services such as I/O, but they still also have code running in user-mode. It is also quite possible to directly issue a trap to kernel mode from any user space code if desired, although you may need to write a snippet of assembly language to set up the registers correctly for the call.
More about 'sys'
There are things that your code cannot do from user mode - things like allocating memory or accessing hardware (HDD, network, etc.). These are under the supervision of the kernel, and it alone can do them. Some operations like malloc orfread/fwrite will invoke these kernel functions and that then will count as 'sys' time. Unfortunately it's not as simple as "every call to malloc will be counted in 'sys' time". The call to malloc will do some processing of its own (still counted in 'user' time) and then somewhere along the way it may call the function in kernel (counted in 'sys' time). After returning from the kernel call, there will be some more time in 'user' and then malloc will return to your code. As for when the switch happens, and how much of it is spent in kernel mode... you cannot say. It depends on the implementation of the library. Also, other seemingly innocent functions might also use malloc and the like in the background, which will again have some time in 'sys' then.
To expand on the accepted answer, I just wanted to provide another reason why real ≠ user + sys.
Keep in mind that real represents actual elapsed time, while user and sys values represent CPU execution time. As a result, on a multicore system, the user and/or sys time (as well as their sum) can actually exceed the real time. For example, on a Java app I'm running for class I get this set of values:
real 1m47.363s
user 2m41.318s
sys 0m4.013s
• real: The actual time spent in running the process from start to finish, as if it was measured by a human with a stopwatch
• user: The cumulative time spent by all the CPUs during the computation
• sys: The cumulative time spent by all the CPUs during system-related tasks such as memory allocation.
Notice that sometimes user + sys might be greater than real, as
multiple processors may work in parallel.
Minimal runnable POSIX C examples
To make things more concrete, I want to exemplify a few extreme cases of time with some minimal C test programs.
All programs can be compiled and run with:
gcc -ggdb3 -o main.out -pthread -std=c99 -pedantic-errors -Wall -Wextra main.c
time ./main.out
and have been tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28, Linux kernel 4.18, ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB).
sleep syscall
Non-busy sleep as done by the sleep syscall only counts in real, but not for user or sys.
For example, a program that sleeps for a second:
#define _XOPEN_SOURCE 700
#include <stdlib.h>
#include <unistd.h>
int main(void) {
sleep(1);
return EXIT_SUCCESS;
}
GitHub upstream.
outputs something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
The same holds for programs blocked on IO becoming available.
For example, the following program waits for the user to enter a character and press enter:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%c\n", getchar());
return EXIT_SUCCESS;
}
GitHub upstream.
And if you wait for about one second, it outputs just like the sleep example something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
For this reason time can help you distinguish between CPU and IO bound programs: What do the terms "CPU bound" and "I/O bound" mean?
Multiple threads
The following example does niters iterations of useless purely CPU-bound work on nthreads threads:
#define _XOPEN_SOURCE 700
#include <assert.h>
#include <inttypes.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
uint64_t niters;
void* my_thread(void *arg) {
uint64_t *argument, i, result;
argument = (uint64_t *)arg;
result = *argument;
for (i = 0; i < niters; ++i) {
result = (result * result) - (3 * result) + 1;
}
*argument = result;
return NULL;
}
int main(int argc, char **argv) {
size_t nthreads;
pthread_t *threads;
uint64_t rc, i, *thread_args;
/* CLI args. */
if (argc > 1) {
niters = strtoll(argv[1], NULL, 0);
} else {
niters = 1000000000;
}
if (argc > 2) {
nthreads = strtoll(argv[2], NULL, 0);
} else {
nthreads = 1;
}
threads = malloc(nthreads * sizeof(*threads));
thread_args = malloc(nthreads * sizeof(*thread_args));
/* Create all threads */
for (i = 0; i < nthreads; ++i) {
thread_args[i] = i;
rc = pthread_create(
&threads[i],
NULL,
my_thread,
(void*)&thread_args[i]
);
assert(rc == 0);
}
/* Wait for all threads to complete */
for (i = 0; i < nthreads; ++i) {
rc = pthread_join(threads[i], NULL);
assert(rc == 0);
printf("%" PRIu64 " %" PRIu64 "\n", i, thread_args[i]);
}
free(threads);
free(thread_args);
return EXIT_SUCCESS;
}
GitHub upstream + plot code.
Then we plot wall, user and sys as a function of the number of threads for a fixed 10^10 iterations on my 8 hyperthread CPU:
Plot data.
From the graph, we see that:
for a CPU intensive single core application, wall and user are about the same
for 2 cores, user is about 2x wall, which means that the user time is counted across all threads.
user basically doubled, and while wall stayed the same.
this continues up to 8 threads, which matches my number of hyperthreads in my computer.
After 8, wall starts to increase as well, because we don't have any extra CPUs to put more work in a given amount of time!
The ratio plateaus at this point.
Note that this graph is only so clear and simple because the work is purely CPU-bound: if it were memory bound, then we would get a fall in performance much earlier with less cores because the memory accesses would be a bottleneck as shown at What do the terms "CPU bound" and "I/O bound" mean?
Quickly checking that wall < user is a simple way to determine that a program is multithreaded, and the closer that ratio is to the number of cores, the more effective the parallelization is, e.g.:
multithreaded linkers: Can gcc use multiple cores when linking?
C++ parallel sort: Are C++17 Parallel Algorithms implemented already?
Sys heavy work with sendfile
The heaviest sys workload I could come up with was to use the sendfile, which does a file copy operation on kernel space: Copy a file in a sane, safe and efficient way
So I imagined that this in-kernel memcpy will be a CPU intensive operation.
First I initialize a large 10GiB random file with:
dd if=/dev/urandom of=sendfile.in.tmp bs=1K count=10M
Then run the code:
#define _GNU_SOURCE
#include <assert.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/sendfile.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
int main(int argc, char **argv) {
char *source_path, *dest_path;
int source, dest;
struct stat stat_source;
if (argc > 1) {
source_path = argv[1];
} else {
source_path = "sendfile.in.tmp";
}
if (argc > 2) {
dest_path = argv[2];
} else {
dest_path = "sendfile.out.tmp";
}
source = open(source_path, O_RDONLY);
assert(source != -1);
dest = open(dest_path, O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
assert(dest != -1);
assert(fstat(source, &stat_source) != -1);
assert(sendfile(dest, source, 0, stat_source.st_size) != -1);
assert(close(source) != -1);
assert(close(dest) != -1);
return EXIT_SUCCESS;
}
GitHub upstream.
which gives basically mostly system time as expected:
real 0m2.175s
user 0m0.001s
sys 0m1.476s
I was also curious to see if time would distinguish between syscalls of different processes, so I tried:
time ./sendfile.out sendfile.in1.tmp sendfile.out1.tmp &
time ./sendfile.out sendfile.in2.tmp sendfile.out2.tmp &
And the result was:
real 0m3.651s
user 0m0.000s
sys 0m1.516s
real 0m4.948s
user 0m0.000s
sys 0m1.562s
The sys time is about the same for both as for a single process, but the wall time is larger because the processes are competing for disk read access likely.
So it seems that it does in fact account for which process started a given kernel work.
Bash source code
When you do just time <cmd> on Ubuntu, it use the Bash keyword as can be seen from:
type time
which outputs:
time is a shell keyword
So we grep source in the Bash 4.19 source code for the output string:
git grep '"user\b'
which leads us to execute_cmd.c function time_command, which uses:
gettimeofday() and getrusage() if both are available
times() otherwise
all of which are Linux system calls and POSIX functions.
GNU Coreutils source code
If we call it as:
/usr/bin/time
then it uses the GNU Coreutils implementation.
This one is a bit more complex, but the relevant source seems to be at resuse.c and it does:
a non-POSIX BSD wait3 call if that is available
times and gettimeofday otherwise
1: https://i.stack.imgur.com/qAfEe.png**Minimal runnable POSIX C examples**
To make things more concrete, I want to exemplify a few extreme cases of time with some minimal C test programs.
All programs can be compiled and run with:
gcc -ggdb3 -o main.out -pthread -std=c99 -pedantic-errors -Wall -Wextra main.c
time ./main.out
and have been tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28, Linux kernel 4.18, ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB).
sleep
Non-busy sleep does not count in either user or sys, only real.
For example, a program that sleeps for a second:
#define _XOPEN_SOURCE 700
#include <stdlib.h>
#include <unistd.h>
int main(void) {
sleep(1);
return EXIT_SUCCESS;
}
GitHub upstream.
outputs something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
The same holds for programs blocked on IO becoming available.
For example, the following program waits for the user to enter a character and press enter:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%c\n", getchar());
return EXIT_SUCCESS;
}
GitHub upstream.
And if you wait for about one second, it outputs just like the sleep example something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
For this reason time can help you distinguish between CPU and IO bound programs: What do the terms "CPU bound" and "I/O bound" mean?
multithreaded linkers: Can gcc use multiple cores when linking?
C++ parallel sort: Are C++17 Parallel Algorithms implemented already?
Real shows total turn-around time for a process;
while User shows the execution time for user-defined instructions
and Sys is for time for executing system calls!
Real time includes the waiting time also (the waiting time for I/O etc.)
In very simple terms, I like to think about it like this:
real is the actual amount of time it took to run the command (as if you had timed it with a stopwatch)
user and sys are how much 'work' the CPU had to do to execute the command. This 'work' is expressed in units of time.
Generally speaking:
user is how much work the CPU did to run to run the command's code
sys is how much work the CPU had to do to handle 'system overhead' type tasks (such as allocating memory, file I/O, ect.) in order to support the running command
Since these last two times are counting 'work' done, they don't include time a thread might have spent waiting (such as waiting on another process or for disk I/O to finish).
real, however, is a measure of actual runtime and not 'work', so it does include any time spent waiting (which is why sometimes real > usr+sys).
And finally, sometimes the reverse is true (usr+sys > real) for multi-threaded applications. This also arises because we are comparing 'work-time' to actual time. For example, if 3 processors each run continuously for 10 minutes to execute a command, you will get real = 10m but usr = 30m.
I want to mention some other scenario when the real-time is much much bigger than user + sys. I've created a simple server which respondes after a long time
real 4.784
user 0.01s
sys 0.01s
the issue is that in this scenario the process waits for the response which is not on the user site nor in the system.
Something similar happens when you run the find command. In that case, the time is spent mostly on requesting and getting a response from SSD.
Must mention that at least on my AMD Ryzen CPU, the user is always large than real in multi-threaded program(or single threaded program compiled with -O3).
eg.
real 0m5.815s
user 0m8.213s
sys 0m0.473s

Troubles at singlestepping on ARM machine [duplicate]

OK, this is a simple question.Does android support the PTRACE_SINGLESTEP when I use ptrace systemcall? when I want to ptrace a android apk program, I find that I can't process the SINGLESTEP trace. But the situation changed when I use the PTRACE_SYSCALL, It can work perfectly. Does the android wipe out this function or arm lack some supports in hardware? Any help will be appreciated!thanks.
this is my core program:
int main(int argc, char *argv[])
{
if(argc != 2) {
__android_log_print(ANDROID_LOG_DEBUG,TAG,"please input the pid!");
return -1;
}
if(0 != ptrace(PTRACE_ATTACH, target_pid, NULL, NULL))
{
__android_log_print(ANDROID_LOG_DEBUG,TAG,"ptrace attach error");
return -1;
}
__android_log_print(ANDROID_LOG_DEBUG,TAG,"start monitor process :%d",target_pid);
while(1)
{
wait(&status);
if(WIFEXITED(status))
{
break;
}
if (ptrace(PTRACE_SINGLESTEP, target_pid, 0, 0) != 0)
__android_log_print(ANDROID_LOG_DEBUG,TAG,"PTRACE_SINGLESTEP attach error");
}
ptrace(PTRACE_DETACH, target_pid, NULL, NULL);
__android_log_print(ANDROID_LOG_DEBUG,TAG,"monitor finished");
return 0;
}
I run this program on shell. And I can get the root privilege.
If I change the request to PTRACE_SYSCALL the program will run normally.
But if the request is PTRACE_SINGLESTEP, the program will get an error!
PTRACE_SINGLESTEP has been removed on ARM Linux since 2011, by this commit.
The HW has no support for single-stepping; previous kernel support involved decoding the instruction to figure out which one's next (branches) and temporarily replacing it with a debug-break software breakpoint.
Quoting a mailing list message about the same commit, describing the old situation: http://lists.infradead.org/pipermail/linux-arm-kernel/2011-February/041324.html
PTRACE_SINGLESTEP is a ptrace request designed to offer single-stepping
support to userspace when the underlying architecture has hardware
support for this operation.
On ARM, we set arch_has_single_step() to 1 and attempt to emulate
hardware single-stepping by disassembling the current instruction to
determine the next pc and placing a software breakpoint on that
location.
Unfortunately this has the following problems:
Only a subset of ARMv7 instructions are supported
Thumb-2 is unsupported
The code is not SMP safe
We could try to fix this code, but it turns out that because of the
above issues it is rarely used in practice. GDB, for example, uses
PTRACE_POKETEXT and PTRACE_PEEKTEXT to manage breakpoints itself and
does not require any kernel assistance.
This patch removes the single-step emulation code from ptrace meaning
that the PTRACE_SINGLESTEP request will return -EIO on ARM. Portable
code must check the return value from a ptrace call and handle the
failure gracefully.
Signed-off-by: Will Deacon <will.deacon at arm.com>
---
The comments I received about v1 suggest that:
If emulation is required, it is plausible to do it from userspace
ltrace uses the SINGLESTEP call (conditionally at compile-time since other architectures, such as mips, do not support this
request) but does not check the return value from ptrace. This is a
bug in ltrace.
strace does not use SINGLESTEP

Using perf_event with the ARM PMU inside gem5

I know that the ARM PMU is partially implemented, thanks to the gem5 source code and some publications.
I have a binary which uses perf_event to access the PMU on a Linux-based OS, under an ARM processor. Could it use perf_event inside a gem5 full-system simulation with a Linux kernel, under the ARM ISA?
So far, I haven't found the right way to do it. If someone knows, I will be very grateful!
As of September 2020, gem5 needs to be patched in order to use the ARM PMU.
Edit: As of November 2020, gem5 is now patched and it will be included in the next release. Thanks to the developers!
How to patch gem5
This is not a clean patch (very straightforward), and it is more intended to understand how it works. Nonetheless, this is the patch to apply with git apply from the gem5 source repository:
diff --git i/src/arch/arm/ArmISA.py w/src/arch/arm/ArmISA.py
index 2641ec3fb..3d85c1b75 100644
--- i/src/arch/arm/ArmISA.py
+++ w/src/arch/arm/ArmISA.py
## -36,6 +36,7 ##
from m5.params import *
from m5.proxy import *
+from m5.SimObject import SimObject
from m5.objects.ArmPMU import ArmPMU
from m5.objects.ArmSystem import SveVectorLength
from m5.objects.BaseISA import BaseISA
## -49,6 +50,8 ## class ArmISA(BaseISA):
cxx_class = 'ArmISA::ISA'
cxx_header = "arch/arm/isa.hh"
+ generateDeviceTree = SimObject.recurseDeviceTree
+
system = Param.System(Parent.any, "System this ISA object belongs to")
pmu = Param.ArmPMU(NULL, "Performance Monitoring Unit")
diff --git i/src/arch/arm/ArmPMU.py w/src/arch/arm/ArmPMU.py
index 047e908b3..58553fbf9 100644
--- i/src/arch/arm/ArmPMU.py
+++ w/src/arch/arm/ArmPMU.py
## -40,6 +40,7 ## from m5.params import *
from m5.params import isNullPointer
from m5.proxy import *
from m5.objects.Gic import ArmInterruptPin
+from m5.util.fdthelper import *
class ProbeEvent(object):
def __init__(self, pmu, _eventId, obj, *listOfNames):
## -76,6 +77,17 ## class ArmPMU(SimObject):
_events = None
+ def generateDeviceTree(self, state):
+ node = FdtNode("pmu")
+ node.appendCompatible("arm,armv8-pmuv3")
+ # gem5 uses GIC controller interrupt notation, where PPI interrupts
+ # start to 16. However, the Linux kernel start from 0, and used a tag
+ # (set to 1) to indicate the PPI interrupt type.
+ node.append(FdtPropertyWords("interrupts", [
+ 1, int(self.interrupt.num) - 16, 0xf04
+ ]))
+ yield node
+
def addEvent(self, newObject):
if not (isinstance(newObject, ProbeEvent)
or isinstance(newObject, SoftwareIncrement)):
diff --git i/src/cpu/BaseCPU.py w/src/cpu/BaseCPU.py
index ab70d1d7f..66a49a038 100644
--- i/src/cpu/BaseCPU.py
+++ w/src/cpu/BaseCPU.py
## -302,6 +302,11 ## class BaseCPU(ClockedObject):
node.appendPhandle(phandle_key)
cpus_node.append(node)
+ # Generate nodes from the BaseCPU children (and don't add them as
+ # subnode). Please note: this is mainly needed for the ISA class.
+ for child_node in self.recurseDeviceTree(state):
+ yield child_node
+
yield cpus_node
def __init__(self, **kwargs):
What the patch resolves
The Linux kernel uses a Device Tree Blob (DTB), which is a regular file, to declare the hardware on which the kernel is running. This is used to make the kernel portable between different architecture without a recompilation for each hardware change. The DTB follows the Device Tree Reference, and is compiled from a Device Tree Source (DTS) file, a regular text file. You can learn more here and here.
The problem was that the PMU is supposed to be declared to the Linux kernel via the DTB. You can learn more here and here. In a simulated system, because the system is specified by the user, gem5 has to generate a DTB itself to pass to the kernel, so the latter can recognize the simulated hardware. However, the problem is that gem5 does not generate the DTB entry for our PMU.
What the patch does
The patch adds an entry to the ISA and the CPU files to enable DTB generation recursion up to find the PMU. The hierarchy is the following: CPU => ISA => PMU. Then, it adds the generation function in the PMU to generate a unique DTB entry to declare the PMU, with the proper notation for the interrupt declaration in the kernel.
After running a simulation with our patch, we could see the DTS from the DTB like this:
cd m5out
# Decompile the DTB to get the DTS.
dtc -I dtb -O dts system.dtb > system.dts
# Find the PMU entry.
head system.dts
dtc is the Device Tree Compiler, installed with sudo apt-get install device-tree-compiler. We end up with this pmu DTB entry, under the root node (/):
/dts-v1/;
/ {
#address-cells = <0x02>;
#size-cells = <0x02>;
interrupt-parent = <0x05>;
compatible = "arm,vexpress";
model = "V2P-CA15";
arm,hbi = <0x00>;
arm,vexpress,site = <0x0f>;
memory#80000000 {
device_type = "memory";
reg = <0x00 0x80000000 0x01 0x00>;
};
pmu {
compatible = "arm,armv8-pmuv3";
interrupts = <0x01 0x04 0xf04>;
};
cpus {
#address-cells = <0x01>;
#size-cells = <0x00>;
cpu#0 {
device_type = "cpu";
compatible = "gem5,arm-cpu";
[...]
In the line interrupts = <0x01 0x04 0xf04>;, 0x01 is used to indicate that the number 0x04 is the number of a PPI interrupt (the one declared with number 20 in gem5, the difference of 16 is explained inside the patch code). The 0xf04 corresponds to a flag (0x4) indicating that it is a "active high level-sensitive" interrupt and a bit mask (0xf) indicating that the interrupts should be wired to all PE attached to the GIC. You can learn more here.
If the patch works and your ArmPMU is declared properly, you should see this message at boot time:
[ 0.239967] hw perfevents: enabled with armv8_pmuv3 PMU driver, 32 counters available
Context
I was not able to use the Performance Monitoring Unit (PMU) because of a gem5's unimplemented feature. The reference on the mailing list can be found here. After a personal patch, the PMU is accessible through perf_event. Fortunately, a similar patch will be released in the official gem5 release soon, could be seen here. The patch will be described in another answer, due to the number of link limitation inside one message.
How to use the PMU
C source code
This is a minimal working example of a C source code using perf_event, used to count the number of mispredicted branches by the branch predictor unit during a specific task:
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/perf_event.h>
int main(int argc, char **argv) {
/* File descriptor used to read mispredicted branches counter. */
static int perf_fd_branch_miss;
/* Initialize our perf_event_attr, representing one counter to be read. */
static struct perf_event_attr attr_branch_miss;
attr_branch_miss.size = sizeof(attr_branch_miss);
attr_branch_miss.exclude_kernel = 1;
attr_branch_miss.exclude_hv = 1;
attr_branch_miss.exclude_callchain_kernel = 1;
/* On a real system, you can do like this: */
attr_branch_miss.type = PERF_TYPE_HARDWARE;
attr_branch_miss.config = PERF_COUNT_HW_BRANCH_MISSES;
/* On a gem5 system, you have to do like this: */
attr_branch_miss.type = PERF_TYPE_RAW;
attr_branch_miss.config = 0x10;
/* Open the file descriptor corresponding to this counter. The counter
should start at this moment. */
if ((perf_fd_branch_miss = syscall(__NR_perf_event_open, &attr_branch_miss, 0, -1, -1, 0)) == -1)
fprintf(stderr, "perf_event_open fail %d %d: %s\n", perf_fd_branch_miss, errno, strerror(errno));
/* Workload here, that means our specific task to profile. */
/* Get and close the performance counters. */
uint64_t counter_branch_miss = 0;
read(perf_fd_branch_miss, &counter_branch_miss, sizeof(counter_branch_miss));
close(perf_fd_branch_miss);
/* Display the result. */
printf("Number of mispredicted branches: %d\n", counter_branch_miss);
}
I will not enter into the details of how using perf_event, good resources are available here, here, here, here. However, just a few notes about the code above:
On real hardware, when using perf_event and common events (events that are available under a lot of architectures), it is recommended to use perf_event macros PERF_TYPE_HARDWARE as type and to use macros like PERF_COUNT_HW_BRANCH_MISSES for the number of mispredicted branches, PERF_COUNT_HW_CACHE_MISSES for the number of cache misses, and so on (see the manual page for a list). This is a best practice to have a portable code.
On a gem5 simulated system, currently (v20.0), a C source code have to use PERF_TYPE_RAW type and architectural event ID to identify an event. Here, 0x10 is the ID of the 0x0010, BR_MIS_PRED, Mispredicted or not predicted branch event, described in the ARMv8-A Reference Manual (here). In the manual, all events available in real hardware are described. However, they are not all implemented into gem5. To see the list of implemented event inside gem5, refer to the src/arch/arm/ArmPMU.py file. In the latter, the line self.addEvent(ProbeEvent(self,0x10, bpred, "Misses")) corresponds to the declaration of the counter described in the manual. This is not a normal behavior, hence gem5 should be patched to allow using PERF_TYPE_HARDWARE one day.
gem5 simulation script
This is not a entire MWE script (it would be too long!), only the needed portion to add inside a full-system script to use the PMU. We use an ArmSystem as a system, with the RealView platform.
For each ISA (we use an ARM ISA here) of each CPU (e.g., a DerivO3CPU) in our cluster (which is a SubSystem class), we add to it a PMU with a unique interrupt number and the already implemented architectural event. An example of this function could be found in configs/example/arm/devices.py.
To choose an interrupt number, pick a free PPI interrupt in the platform interrupt mapping. Here, we choose PPI n°20, according to the RealView interrupt map (src/dev/arm/RealView.py). Since PPIs interrupts are local per Processing Element (PE, corresponds to cores in our context), the interrupt number can be the same for all PE without any conflict. To know more about PPI interrupts, see the GIC guide from ARM here.
Here, we can see that the interrupt n°20 is not used by the system (from RealView.py):
Interrupts:
0- 15: Software generated interrupts (SGIs)
16- 31: On-chip private peripherals (PPIs)
25 : vgic
26 : generic_timer (hyp)
27 : generic_timer (virt)
28 : Reserved (Legacy FIQ)
We pass to addArchEvents our system components (dtb, itb, etc.) to link the PMU with them, thus the PMU will use the internal counters (called probes) of these components as exposed counters to the system.
for cpu in system.cpu_cluster.cpus:
for isa in cpu.isa:
isa.pmu = ArmPMU(interrupt=ArmPPI(num=20))
# Add the implemented architectural events of gem5. We can
# discover which events is implemented by looking at the file
# "ArmPMU.py".
isa.pmu.addArchEvents(
cpu=cpu, dtb=cpu.dtb, itb=cpu.itb,
icache=getattr(cpu, "icache", None),
dcache=getattr(cpu, "dcache", None),
l2cache=getattr(system.cpu_cluster, "l2", None))
Two quick additions to Pierre's awesome answers:
for fs.py as of gem5 937241101fae2cd0755c43c33bab2537b47596a2, all that is missing is to apply to fs.py as shown at: https://gem5-review.googlesource.com/c/public/gem5/+/37978/1/configs/example/fs.py
for cpu in test_sys.cpu:
if buildEnv['TARGET_ISA'] in "arm":
for isa in cpu.isa:
isa.pmu = ArmPMU(interrupt=ArmPPI(num=20))
isa.pmu.addArchEvents(
cpu=cpu, dtb=cpu.mmu.dtb, itb=cpu.mmu.itb,
icache=getattr(cpu, "icache", None),
dcache=getattr(cpu, "dcache", None),
l2cache=getattr(test_sys, "l2", None))
a C example can also be found in man perf_event_open

Pthreads & Multicore compiler

I'm working with the support SMP kernel: Snapgear 2.6.21.
I have created 4 threads in my c application, and I am trying to set thread 1 to run on CPU1, thread2 on CPU 2, etc.
However, the compiler sparc-linux-gcc does not recognize these functions:
CPU_SET (int cpu, cpu_set_t * set);
CPU_ZERO (cpu_set_t * set);
and this type: cpu_set_t
It always gives me these errors:
implicit declaration of function 'CPU_ZERO'
implicit declaration of function 'CPU_SET'
'cpu_set_t' undeclared (first use in this function)
Here is my code to bind active thread to processor 0:
cpu_set_t mask;
CPU_ZERO (& mask);
CPU_SET (0, & mask) // bind processor 0
sched_setaffinity (0, sizeof(mask), & mask);
I have included and defined at the top :
**define _GNU_SOURCE
include <sched.h>**
But I always get the same errors. can you help me please?
You should read sched_setaffinity(2) carefully and test its result (and display errno on failure, e.g. with perror).
Actually, I believe you should use pthread_setaffinity_np(3) instead (and of course test its failure, etc...)
Even more, I believe that you should not bother to explicitly set the affinity. Recent Linux kernels are often quite good at dispatching running threads on different CPUs.
So simply use pthreads and don't bother about affinity, unless you see actual issues when benchmarking.
BTW, passing the -H flag to your GCC (cross-)compiler could be helpful. It shows you the included files. Perhaps also look into the preprocessed form obtained with gcc -C -E ; it looks like some header files are missing or not found (maybe some missing -I include-directory at compilation time, or some missing headers on your development system)
BTW, your kernel version looks ancient. Can't you upgrade your kernel to something newer (3.15.x or some 3.y)?

Resources