Why is the p-state status MSR on ryzen not changing? - linux

I am trying to detect the current p-state of my cpu. I have noticed that the p-state status MSR (C001_0063) always returns 2 on my ryzen 1700x system, even if the core is clearly not in that state. I think it used to work with the initial bios (v0403) that my motherboard came with, but that's not available for download anymore1.
My cpu is overclocked2 to 3.8GHz. I used cpufreq-set to fix the speed and cpufreq-info to verify:
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: 4294.55 ms.
hardware limits: 2.20 GHz - 3.80 GHz
available frequency steps: 3.80 GHz, 2.20 GHz
available cpufreq governors: ondemand, conservative, performance, schedutil
current policy: frequency should be within 3.80 GHz and 3.80 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 3.80 GHz (asserted by call to hardware).
Following is a little test program that shows the value of the register for core #0, along with the effective speed relative to P0 state. Needs root privileges. For me, it constantly prints pstate: 2, speed: 99% under load.
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main(int argc, char** argv)
{
uint64_t aperf_old = 0;
uint64_t mperf_old = 0;
int fd;
fd = open("/dev/cpu/0/msr", O_RDONLY);
uint64_t pstate_limits;
pread(fd, &pstate_limits, sizeof(pstate_limits), 0xC0010061);
printf("pstate ranges: %d to %d\n", (int)(pstate_limits & 0x07), (int)((pstate_limits >> 4) & 0x07));
for(;;)
{
uint64_t pstate;
uint64_t pstate_req;
uint64_t aperf;
uint64_t mperf;
pread(fd, &pstate_req, sizeof(pstate_req), 0xC0010062);
pread(fd, &pstate, sizeof(pstate), 0xC0010063);
pread(fd, &aperf, sizeof(aperf), 0x000000E8);
pread(fd, &mperf, sizeof(mperf), 0x000000E7);
printf("pstate: %d, requested: %d", (int)(pstate & 0x07), (int)(pstate_req & 0x07));
if (mperf_old != 0 && mperf_old != mperf)
{
printf(", speed: %d%%", (int)(100 * (aperf - aperf_old) / (mperf - mperf_old)));
}
putchar('\n');
mperf_old = mperf;
aperf_old = aperf;
sleep(1);
}
}
A similar approach used to work on my FX-8350. What am I doing wrong? Test results also welcome.
System information:
Cpu: ryzen 1700x, P0 & P1 is 3.8GHz3, P2 is 2.2GHz
Motherboard: Asus Prime X370-A, bios 3401
Operating system: debian 7.1, kernel 4.9.0
Update: I have changed the code to print the requested pstate and that register is changing as expected. The actual cpu speed is changing too, as confirmed by various benchmarks.
1 For some obscure reason, the bios backup function is disabled, so I couldn't make a copy before updating.
2 I will run a test at defaults when I get a chance.
3 No idea why it's duplicated.

Per-core (in reality: per CCD) P-States are in the undocumented MSR c0010293, bits 22:24, I think.

It may not be related, however I have heard that some people have successfully had their Ryzen 7s replaced by AMD, due to p states causing system stability issues in Unix or Unix like systems. Although from commentary on the matter, especially from ASrock forum points towards a driver/firmware issue, I have read elsewhere that AMD may be taking some responsibility with early batches of chips. That said from what I know of P and C states, some states are requested by the operating system, or host vertulisation environment. So a patch could be done from within?

Related

perf stat time elapsed for sys and user [duplicate]

$ time foo
real 0m0.003s
user 0m0.000s
sys 0m0.004s
$
What do real, user and sys mean in the output of time? Which one is meaningful when benchmarking my app?
Real, User and Sys process time statistics
One of these things is not like the other. Real refers to actual elapsed time; User and Sys refer to CPU time used only by the process.
Real is wall clock time - time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).
User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.
Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like 'user', this is only CPU time used by the process. See below for a brief description of kernel mode (also known as 'supervisor' mode) and the system call mechanism.
User+Sys will tell you how much actual CPU time your process used. Note that this is across all CPUs, so if the process has multiple threads (and this process is running on a computer with more than one processor) it could potentially exceed the wall clock time reported by Real (which usually occurs). Note that in the output these figures include the User and Sys time of all child processes (and their descendants) as well when they could have been collected, e.g. by wait(2) or waitpid(2), although the underlying system calls return the statistics for the process and its children separately.
Origins of the statistics reported by time (1)
The statistics reported by time are gathered from various system calls. 'User' and 'Sys' come from wait (2) (POSIX) or times (2) (POSIX), depending on the particular system. 'Real' is calculated from a start and end time gathered from the gettimeofday (2) call. Depending on the version of the system, various other statistics such as the number of context switches may also be gathered by time.
On a multi-processor machine, a multi-threaded process or a process forking children could have an elapsed time smaller than the total CPU time - as different threads or processes may run in parallel. Also, the time statistics reported come from different origins, so times recorded for very short running tasks may be subject to rounding errors, as the example given by the original poster shows.
A brief primer on Kernel vs. User mode
On Unix, or any protected-memory operating system, 'Kernel' or 'Supervisor' mode refers to a privileged mode that the CPU can operate in. Certain privileged actions that could affect security or stability can only be done when the CPU is operating in this mode; these actions are not available to application code. An example of such an action might be manipulation of the MMU to gain access to the address space of another process. Normally, user-mode code cannot do this (with good reason), although it can request shared memory from the kernel, which could be read or written by more than one process. In this case, the shared memory is explicitly requested from the kernel through a secure mechanism and both processes have to explicitly attach to it in order to use it.
The privileged mode is usually referred to as 'kernel' mode because the kernel is executed by the CPU running in this mode. In order to switch to kernel mode you have to issue a specific instruction (often called a trap) that switches the CPU to running in kernel mode and runs code from a specific location held in a jump table. For security reasons, you cannot switch to kernel mode and execute arbitrary code - the traps are managed through a table of addresses that cannot be written to unless the CPU is running in supervisor mode. You trap with an explicit trap number and the address is looked up in the jump table; the kernel has a finite number of controlled entry points.
The 'system' calls in the C library (particularly those described in Section 2 of the man pages) have a user-mode component, which is what you actually call from your C program. Behind the scenes, they may issue one or more system calls to the kernel to do specific services such as I/O, but they still also have code running in user-mode. It is also quite possible to directly issue a trap to kernel mode from any user space code if desired, although you may need to write a snippet of assembly language to set up the registers correctly for the call.
More about 'sys'
There are things that your code cannot do from user mode - things like allocating memory or accessing hardware (HDD, network, etc.). These are under the supervision of the kernel, and it alone can do them. Some operations like malloc orfread/fwrite will invoke these kernel functions and that then will count as 'sys' time. Unfortunately it's not as simple as "every call to malloc will be counted in 'sys' time". The call to malloc will do some processing of its own (still counted in 'user' time) and then somewhere along the way it may call the function in kernel (counted in 'sys' time). After returning from the kernel call, there will be some more time in 'user' and then malloc will return to your code. As for when the switch happens, and how much of it is spent in kernel mode... you cannot say. It depends on the implementation of the library. Also, other seemingly innocent functions might also use malloc and the like in the background, which will again have some time in 'sys' then.
To expand on the accepted answer, I just wanted to provide another reason why real ≠ user + sys.
Keep in mind that real represents actual elapsed time, while user and sys values represent CPU execution time. As a result, on a multicore system, the user and/or sys time (as well as their sum) can actually exceed the real time. For example, on a Java app I'm running for class I get this set of values:
real 1m47.363s
user 2m41.318s
sys 0m4.013s
• real: The actual time spent in running the process from start to finish, as if it was measured by a human with a stopwatch
• user: The cumulative time spent by all the CPUs during the computation
• sys: The cumulative time spent by all the CPUs during system-related tasks such as memory allocation.
Notice that sometimes user + sys might be greater than real, as
multiple processors may work in parallel.
Minimal runnable POSIX C examples
To make things more concrete, I want to exemplify a few extreme cases of time with some minimal C test programs.
All programs can be compiled and run with:
gcc -ggdb3 -o main.out -pthread -std=c99 -pedantic-errors -Wall -Wextra main.c
time ./main.out
and have been tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28, Linux kernel 4.18, ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB).
sleep syscall
Non-busy sleep as done by the sleep syscall only counts in real, but not for user or sys.
For example, a program that sleeps for a second:
#define _XOPEN_SOURCE 700
#include <stdlib.h>
#include <unistd.h>
int main(void) {
sleep(1);
return EXIT_SUCCESS;
}
GitHub upstream.
outputs something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
The same holds for programs blocked on IO becoming available.
For example, the following program waits for the user to enter a character and press enter:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%c\n", getchar());
return EXIT_SUCCESS;
}
GitHub upstream.
And if you wait for about one second, it outputs just like the sleep example something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
For this reason time can help you distinguish between CPU and IO bound programs: What do the terms "CPU bound" and "I/O bound" mean?
Multiple threads
The following example does niters iterations of useless purely CPU-bound work on nthreads threads:
#define _XOPEN_SOURCE 700
#include <assert.h>
#include <inttypes.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
uint64_t niters;
void* my_thread(void *arg) {
uint64_t *argument, i, result;
argument = (uint64_t *)arg;
result = *argument;
for (i = 0; i < niters; ++i) {
result = (result * result) - (3 * result) + 1;
}
*argument = result;
return NULL;
}
int main(int argc, char **argv) {
size_t nthreads;
pthread_t *threads;
uint64_t rc, i, *thread_args;
/* CLI args. */
if (argc > 1) {
niters = strtoll(argv[1], NULL, 0);
} else {
niters = 1000000000;
}
if (argc > 2) {
nthreads = strtoll(argv[2], NULL, 0);
} else {
nthreads = 1;
}
threads = malloc(nthreads * sizeof(*threads));
thread_args = malloc(nthreads * sizeof(*thread_args));
/* Create all threads */
for (i = 0; i < nthreads; ++i) {
thread_args[i] = i;
rc = pthread_create(
&threads[i],
NULL,
my_thread,
(void*)&thread_args[i]
);
assert(rc == 0);
}
/* Wait for all threads to complete */
for (i = 0; i < nthreads; ++i) {
rc = pthread_join(threads[i], NULL);
assert(rc == 0);
printf("%" PRIu64 " %" PRIu64 "\n", i, thread_args[i]);
}
free(threads);
free(thread_args);
return EXIT_SUCCESS;
}
GitHub upstream + plot code.
Then we plot wall, user and sys as a function of the number of threads for a fixed 10^10 iterations on my 8 hyperthread CPU:
Plot data.
From the graph, we see that:
for a CPU intensive single core application, wall and user are about the same
for 2 cores, user is about 2x wall, which means that the user time is counted across all threads.
user basically doubled, and while wall stayed the same.
this continues up to 8 threads, which matches my number of hyperthreads in my computer.
After 8, wall starts to increase as well, because we don't have any extra CPUs to put more work in a given amount of time!
The ratio plateaus at this point.
Note that this graph is only so clear and simple because the work is purely CPU-bound: if it were memory bound, then we would get a fall in performance much earlier with less cores because the memory accesses would be a bottleneck as shown at What do the terms "CPU bound" and "I/O bound" mean?
Quickly checking that wall < user is a simple way to determine that a program is multithreaded, and the closer that ratio is to the number of cores, the more effective the parallelization is, e.g.:
multithreaded linkers: Can gcc use multiple cores when linking?
C++ parallel sort: Are C++17 Parallel Algorithms implemented already?
Sys heavy work with sendfile
The heaviest sys workload I could come up with was to use the sendfile, which does a file copy operation on kernel space: Copy a file in a sane, safe and efficient way
So I imagined that this in-kernel memcpy will be a CPU intensive operation.
First I initialize a large 10GiB random file with:
dd if=/dev/urandom of=sendfile.in.tmp bs=1K count=10M
Then run the code:
#define _GNU_SOURCE
#include <assert.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/sendfile.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
int main(int argc, char **argv) {
char *source_path, *dest_path;
int source, dest;
struct stat stat_source;
if (argc > 1) {
source_path = argv[1];
} else {
source_path = "sendfile.in.tmp";
}
if (argc > 2) {
dest_path = argv[2];
} else {
dest_path = "sendfile.out.tmp";
}
source = open(source_path, O_RDONLY);
assert(source != -1);
dest = open(dest_path, O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
assert(dest != -1);
assert(fstat(source, &stat_source) != -1);
assert(sendfile(dest, source, 0, stat_source.st_size) != -1);
assert(close(source) != -1);
assert(close(dest) != -1);
return EXIT_SUCCESS;
}
GitHub upstream.
which gives basically mostly system time as expected:
real 0m2.175s
user 0m0.001s
sys 0m1.476s
I was also curious to see if time would distinguish between syscalls of different processes, so I tried:
time ./sendfile.out sendfile.in1.tmp sendfile.out1.tmp &
time ./sendfile.out sendfile.in2.tmp sendfile.out2.tmp &
And the result was:
real 0m3.651s
user 0m0.000s
sys 0m1.516s
real 0m4.948s
user 0m0.000s
sys 0m1.562s
The sys time is about the same for both as for a single process, but the wall time is larger because the processes are competing for disk read access likely.
So it seems that it does in fact account for which process started a given kernel work.
Bash source code
When you do just time <cmd> on Ubuntu, it use the Bash keyword as can be seen from:
type time
which outputs:
time is a shell keyword
So we grep source in the Bash 4.19 source code for the output string:
git grep '"user\b'
which leads us to execute_cmd.c function time_command, which uses:
gettimeofday() and getrusage() if both are available
times() otherwise
all of which are Linux system calls and POSIX functions.
GNU Coreutils source code
If we call it as:
/usr/bin/time
then it uses the GNU Coreutils implementation.
This one is a bit more complex, but the relevant source seems to be at resuse.c and it does:
a non-POSIX BSD wait3 call if that is available
times and gettimeofday otherwise
1: https://i.stack.imgur.com/qAfEe.png**Minimal runnable POSIX C examples**
To make things more concrete, I want to exemplify a few extreme cases of time with some minimal C test programs.
All programs can be compiled and run with:
gcc -ggdb3 -o main.out -pthread -std=c99 -pedantic-errors -Wall -Wextra main.c
time ./main.out
and have been tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28, Linux kernel 4.18, ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB).
sleep
Non-busy sleep does not count in either user or sys, only real.
For example, a program that sleeps for a second:
#define _XOPEN_SOURCE 700
#include <stdlib.h>
#include <unistd.h>
int main(void) {
sleep(1);
return EXIT_SUCCESS;
}
GitHub upstream.
outputs something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
The same holds for programs blocked on IO becoming available.
For example, the following program waits for the user to enter a character and press enter:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%c\n", getchar());
return EXIT_SUCCESS;
}
GitHub upstream.
And if you wait for about one second, it outputs just like the sleep example something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
For this reason time can help you distinguish between CPU and IO bound programs: What do the terms "CPU bound" and "I/O bound" mean?
multithreaded linkers: Can gcc use multiple cores when linking?
C++ parallel sort: Are C++17 Parallel Algorithms implemented already?
Real shows total turn-around time for a process;
while User shows the execution time for user-defined instructions
and Sys is for time for executing system calls!
Real time includes the waiting time also (the waiting time for I/O etc.)
In very simple terms, I like to think about it like this:
real is the actual amount of time it took to run the command (as if you had timed it with a stopwatch)
user and sys are how much 'work' the CPU had to do to execute the command. This 'work' is expressed in units of time.
Generally speaking:
user is how much work the CPU did to run to run the command's code
sys is how much work the CPU had to do to handle 'system overhead' type tasks (such as allocating memory, file I/O, ect.) in order to support the running command
Since these last two times are counting 'work' done, they don't include time a thread might have spent waiting (such as waiting on another process or for disk I/O to finish).
real, however, is a measure of actual runtime and not 'work', so it does include any time spent waiting (which is why sometimes real > usr+sys).
And finally, sometimes the reverse is true (usr+sys > real) for multi-threaded applications. This also arises because we are comparing 'work-time' to actual time. For example, if 3 processors each run continuously for 10 minutes to execute a command, you will get real = 10m but usr = 30m.
I want to mention some other scenario when the real-time is much much bigger than user + sys. I've created a simple server which respondes after a long time
real 4.784
user 0.01s
sys 0.01s
the issue is that in this scenario the process waits for the response which is not on the user site nor in the system.
Something similar happens when you run the find command. In that case, the time is spent mostly on requesting and getting a response from SSD.
Must mention that at least on my AMD Ryzen CPU, the user is always large than real in multi-threaded program(or single threaded program compiled with -O3).
eg.
real 0m5.815s
user 0m8.213s
sys 0m0.473s

Using perf_event with the ARM PMU inside gem5

I know that the ARM PMU is partially implemented, thanks to the gem5 source code and some publications.
I have a binary which uses perf_event to access the PMU on a Linux-based OS, under an ARM processor. Could it use perf_event inside a gem5 full-system simulation with a Linux kernel, under the ARM ISA?
So far, I haven't found the right way to do it. If someone knows, I will be very grateful!
As of September 2020, gem5 needs to be patched in order to use the ARM PMU.
Edit: As of November 2020, gem5 is now patched and it will be included in the next release. Thanks to the developers!
How to patch gem5
This is not a clean patch (very straightforward), and it is more intended to understand how it works. Nonetheless, this is the patch to apply with git apply from the gem5 source repository:
diff --git i/src/arch/arm/ArmISA.py w/src/arch/arm/ArmISA.py
index 2641ec3fb..3d85c1b75 100644
--- i/src/arch/arm/ArmISA.py
+++ w/src/arch/arm/ArmISA.py
## -36,6 +36,7 ##
from m5.params import *
from m5.proxy import *
+from m5.SimObject import SimObject
from m5.objects.ArmPMU import ArmPMU
from m5.objects.ArmSystem import SveVectorLength
from m5.objects.BaseISA import BaseISA
## -49,6 +50,8 ## class ArmISA(BaseISA):
cxx_class = 'ArmISA::ISA'
cxx_header = "arch/arm/isa.hh"
+ generateDeviceTree = SimObject.recurseDeviceTree
+
system = Param.System(Parent.any, "System this ISA object belongs to")
pmu = Param.ArmPMU(NULL, "Performance Monitoring Unit")
diff --git i/src/arch/arm/ArmPMU.py w/src/arch/arm/ArmPMU.py
index 047e908b3..58553fbf9 100644
--- i/src/arch/arm/ArmPMU.py
+++ w/src/arch/arm/ArmPMU.py
## -40,6 +40,7 ## from m5.params import *
from m5.params import isNullPointer
from m5.proxy import *
from m5.objects.Gic import ArmInterruptPin
+from m5.util.fdthelper import *
class ProbeEvent(object):
def __init__(self, pmu, _eventId, obj, *listOfNames):
## -76,6 +77,17 ## class ArmPMU(SimObject):
_events = None
+ def generateDeviceTree(self, state):
+ node = FdtNode("pmu")
+ node.appendCompatible("arm,armv8-pmuv3")
+ # gem5 uses GIC controller interrupt notation, where PPI interrupts
+ # start to 16. However, the Linux kernel start from 0, and used a tag
+ # (set to 1) to indicate the PPI interrupt type.
+ node.append(FdtPropertyWords("interrupts", [
+ 1, int(self.interrupt.num) - 16, 0xf04
+ ]))
+ yield node
+
def addEvent(self, newObject):
if not (isinstance(newObject, ProbeEvent)
or isinstance(newObject, SoftwareIncrement)):
diff --git i/src/cpu/BaseCPU.py w/src/cpu/BaseCPU.py
index ab70d1d7f..66a49a038 100644
--- i/src/cpu/BaseCPU.py
+++ w/src/cpu/BaseCPU.py
## -302,6 +302,11 ## class BaseCPU(ClockedObject):
node.appendPhandle(phandle_key)
cpus_node.append(node)
+ # Generate nodes from the BaseCPU children (and don't add them as
+ # subnode). Please note: this is mainly needed for the ISA class.
+ for child_node in self.recurseDeviceTree(state):
+ yield child_node
+
yield cpus_node
def __init__(self, **kwargs):
What the patch resolves
The Linux kernel uses a Device Tree Blob (DTB), which is a regular file, to declare the hardware on which the kernel is running. This is used to make the kernel portable between different architecture without a recompilation for each hardware change. The DTB follows the Device Tree Reference, and is compiled from a Device Tree Source (DTS) file, a regular text file. You can learn more here and here.
The problem was that the PMU is supposed to be declared to the Linux kernel via the DTB. You can learn more here and here. In a simulated system, because the system is specified by the user, gem5 has to generate a DTB itself to pass to the kernel, so the latter can recognize the simulated hardware. However, the problem is that gem5 does not generate the DTB entry for our PMU.
What the patch does
The patch adds an entry to the ISA and the CPU files to enable DTB generation recursion up to find the PMU. The hierarchy is the following: CPU => ISA => PMU. Then, it adds the generation function in the PMU to generate a unique DTB entry to declare the PMU, with the proper notation for the interrupt declaration in the kernel.
After running a simulation with our patch, we could see the DTS from the DTB like this:
cd m5out
# Decompile the DTB to get the DTS.
dtc -I dtb -O dts system.dtb > system.dts
# Find the PMU entry.
head system.dts
dtc is the Device Tree Compiler, installed with sudo apt-get install device-tree-compiler. We end up with this pmu DTB entry, under the root node (/):
/dts-v1/;
/ {
#address-cells = <0x02>;
#size-cells = <0x02>;
interrupt-parent = <0x05>;
compatible = "arm,vexpress";
model = "V2P-CA15";
arm,hbi = <0x00>;
arm,vexpress,site = <0x0f>;
memory#80000000 {
device_type = "memory";
reg = <0x00 0x80000000 0x01 0x00>;
};
pmu {
compatible = "arm,armv8-pmuv3";
interrupts = <0x01 0x04 0xf04>;
};
cpus {
#address-cells = <0x01>;
#size-cells = <0x00>;
cpu#0 {
device_type = "cpu";
compatible = "gem5,arm-cpu";
[...]
In the line interrupts = <0x01 0x04 0xf04>;, 0x01 is used to indicate that the number 0x04 is the number of a PPI interrupt (the one declared with number 20 in gem5, the difference of 16 is explained inside the patch code). The 0xf04 corresponds to a flag (0x4) indicating that it is a "active high level-sensitive" interrupt and a bit mask (0xf) indicating that the interrupts should be wired to all PE attached to the GIC. You can learn more here.
If the patch works and your ArmPMU is declared properly, you should see this message at boot time:
[ 0.239967] hw perfevents: enabled with armv8_pmuv3 PMU driver, 32 counters available
Context
I was not able to use the Performance Monitoring Unit (PMU) because of a gem5's unimplemented feature. The reference on the mailing list can be found here. After a personal patch, the PMU is accessible through perf_event. Fortunately, a similar patch will be released in the official gem5 release soon, could be seen here. The patch will be described in another answer, due to the number of link limitation inside one message.
How to use the PMU
C source code
This is a minimal working example of a C source code using perf_event, used to count the number of mispredicted branches by the branch predictor unit during a specific task:
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/perf_event.h>
int main(int argc, char **argv) {
/* File descriptor used to read mispredicted branches counter. */
static int perf_fd_branch_miss;
/* Initialize our perf_event_attr, representing one counter to be read. */
static struct perf_event_attr attr_branch_miss;
attr_branch_miss.size = sizeof(attr_branch_miss);
attr_branch_miss.exclude_kernel = 1;
attr_branch_miss.exclude_hv = 1;
attr_branch_miss.exclude_callchain_kernel = 1;
/* On a real system, you can do like this: */
attr_branch_miss.type = PERF_TYPE_HARDWARE;
attr_branch_miss.config = PERF_COUNT_HW_BRANCH_MISSES;
/* On a gem5 system, you have to do like this: */
attr_branch_miss.type = PERF_TYPE_RAW;
attr_branch_miss.config = 0x10;
/* Open the file descriptor corresponding to this counter. The counter
should start at this moment. */
if ((perf_fd_branch_miss = syscall(__NR_perf_event_open, &attr_branch_miss, 0, -1, -1, 0)) == -1)
fprintf(stderr, "perf_event_open fail %d %d: %s\n", perf_fd_branch_miss, errno, strerror(errno));
/* Workload here, that means our specific task to profile. */
/* Get and close the performance counters. */
uint64_t counter_branch_miss = 0;
read(perf_fd_branch_miss, &counter_branch_miss, sizeof(counter_branch_miss));
close(perf_fd_branch_miss);
/* Display the result. */
printf("Number of mispredicted branches: %d\n", counter_branch_miss);
}
I will not enter into the details of how using perf_event, good resources are available here, here, here, here. However, just a few notes about the code above:
On real hardware, when using perf_event and common events (events that are available under a lot of architectures), it is recommended to use perf_event macros PERF_TYPE_HARDWARE as type and to use macros like PERF_COUNT_HW_BRANCH_MISSES for the number of mispredicted branches, PERF_COUNT_HW_CACHE_MISSES for the number of cache misses, and so on (see the manual page for a list). This is a best practice to have a portable code.
On a gem5 simulated system, currently (v20.0), a C source code have to use PERF_TYPE_RAW type and architectural event ID to identify an event. Here, 0x10 is the ID of the 0x0010, BR_MIS_PRED, Mispredicted or not predicted branch event, described in the ARMv8-A Reference Manual (here). In the manual, all events available in real hardware are described. However, they are not all implemented into gem5. To see the list of implemented event inside gem5, refer to the src/arch/arm/ArmPMU.py file. In the latter, the line self.addEvent(ProbeEvent(self,0x10, bpred, "Misses")) corresponds to the declaration of the counter described in the manual. This is not a normal behavior, hence gem5 should be patched to allow using PERF_TYPE_HARDWARE one day.
gem5 simulation script
This is not a entire MWE script (it would be too long!), only the needed portion to add inside a full-system script to use the PMU. We use an ArmSystem as a system, with the RealView platform.
For each ISA (we use an ARM ISA here) of each CPU (e.g., a DerivO3CPU) in our cluster (which is a SubSystem class), we add to it a PMU with a unique interrupt number and the already implemented architectural event. An example of this function could be found in configs/example/arm/devices.py.
To choose an interrupt number, pick a free PPI interrupt in the platform interrupt mapping. Here, we choose PPI n°20, according to the RealView interrupt map (src/dev/arm/RealView.py). Since PPIs interrupts are local per Processing Element (PE, corresponds to cores in our context), the interrupt number can be the same for all PE without any conflict. To know more about PPI interrupts, see the GIC guide from ARM here.
Here, we can see that the interrupt n°20 is not used by the system (from RealView.py):
Interrupts:
0- 15: Software generated interrupts (SGIs)
16- 31: On-chip private peripherals (PPIs)
25 : vgic
26 : generic_timer (hyp)
27 : generic_timer (virt)
28 : Reserved (Legacy FIQ)
We pass to addArchEvents our system components (dtb, itb, etc.) to link the PMU with them, thus the PMU will use the internal counters (called probes) of these components as exposed counters to the system.
for cpu in system.cpu_cluster.cpus:
for isa in cpu.isa:
isa.pmu = ArmPMU(interrupt=ArmPPI(num=20))
# Add the implemented architectural events of gem5. We can
# discover which events is implemented by looking at the file
# "ArmPMU.py".
isa.pmu.addArchEvents(
cpu=cpu, dtb=cpu.dtb, itb=cpu.itb,
icache=getattr(cpu, "icache", None),
dcache=getattr(cpu, "dcache", None),
l2cache=getattr(system.cpu_cluster, "l2", None))
Two quick additions to Pierre's awesome answers:
for fs.py as of gem5 937241101fae2cd0755c43c33bab2537b47596a2, all that is missing is to apply to fs.py as shown at: https://gem5-review.googlesource.com/c/public/gem5/+/37978/1/configs/example/fs.py
for cpu in test_sys.cpu:
if buildEnv['TARGET_ISA'] in "arm":
for isa in cpu.isa:
isa.pmu = ArmPMU(interrupt=ArmPPI(num=20))
isa.pmu.addArchEvents(
cpu=cpu, dtb=cpu.mmu.dtb, itb=cpu.mmu.itb,
icache=getattr(cpu, "icache", None),
dcache=getattr(cpu, "dcache", None),
l2cache=getattr(test_sys, "l2", None))
a C example can also be found in man perf_event_open

Number of cores rocket-chip

I am operating Linux on top of spike and the rocket-chip. In order to evaluate a program I am trying to get the # of cores configured in spike and the rocket-chip. I already tried to get the information threw proc/cpuinfo with now success. I also wrote a little program:
#include <stdio.h>
#include <unistd.h>
int main()
{
int numofcores = sysconf(_SC_NPROCESSORS_ONLN);
printf("Core(s) : %d\n", numofcores);
return 0;
}
The problem with this program is that it returns 1, which cannot be the correct value, because I configured 2 cores. Is there another possibility to get the # of cores?
Are you sure linux can see both cores? You can check this with something like: cat /proc/cpuinfo. To support multicore, you will need to turn on SMP support when building riscv-linux.

High CPU usage - simple packet receiver on Linux

I'm writing simple application under Linux that gathers all packets from network. I'm using blocking receiving by calling "recvfrom()" function. When I generate big network load with hping3 (~100k raw frames per second, 130 bytes each) "top" tool shows high CPU usage for my process - it is about 37-38%. It is big value for me. When I decrease number of packets, usage is lower - for example top shows 3% for 4k frames per second.
I've check DC++ when it downloads ~10MB/s and its process doesn't use 38% of CPU but 5%. Is there any programmable way in C to reduce CPU usage and still receive a lot of frames?
My CPU:
Intel i5-2400 CPU # 3.10Ghz
My system:
Ubuntu 11.04 kernel 3.6.6 with PREEMPT-RT patch
And here is my code:
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/socket.h>
#include <linux/if_packet.h>
#include <linux/if_ether.h>
#include <linux/if_arp.h>
#include <arpa/inet.h>
/* Socket descriptor. */
int mainSocket;
/* Buffer for frame. */
unsigned char* buffer;
int main(int argc, char* argv[])
{
/** Create socket. **/
mainSocket = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (mainSocket == -1) {
printf("Error: cannot create socket!\n");
}
/** Create buffer for frame **/
buffer = malloc(ETH_FRAME_LEN);
printf("Listing...");
while(1) {
// Length of received packet
int length = recvfrom(mainSocket, buffer, ETH_FRAME_LEN, 0, NULL, NULL);
if(length > 0)
{
// ... do something ...
}
}
I don't know if this will help, but looking on Google I see that:
Raw socket, Packet socket and Zero copy networking in Linux as well as http://lxr.linux.no/linux+v2.6.36/Documentation/networking/packet_mmap.txt talk about using PACKET_MMAP and mmap() to improve the performance of raw sockets
The Overview of Packet Reception suggests setting your process's affinity to match the CPU to which you bind the NIC using RPS.
Does DC++ do a promiscuous receive? I wouldn't have guessed so. So instead of comparing your performance to DC++, perhaps you should compare your performance to the performance of a utility like libpcap.
May be because, TCP/IP stack running on NIC and DC++ is getting stream of data directly from NIC, so your processor is not doing any TCP/IP work. But in your case I think you are directly trying to get data from NIC, so it will not be processed by NIC but by your processor, and as you have infinite loop to fetch data, you doing lot of processing... so CPU usage spiked.

Linux shared memory allocation on x86_64

I have 64 bit REHL linux, Linux ipms-sol1 2.6.32-71.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux
RAM size = ~38GB
I changed default shared memory limits as follows in /etc/sysctl.conf & loaded changed file in memory as sysctl -p
kernel.shmmni=81474836
kernel.shmmax=32212254720
kernel.shmall=7864320
Just for experimental basis I have changed shmmax size to 32GB and tried allocating 10GB in code using shmget() as given below, but it fails to get 10GB of shared memory in single shot but when I reduce my demand for shared space to 8GB it succeeds any clue as to where am I possibly going wrong?
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <stdio.h>
#define SHMSZ 10737418240
main()
{
char c;
int shmspaceid;
key_t key;
char *shm, *s;
struct shmid_ds shmid;
key = 5678;
fprintf(stderr,"Changed code\n");
if ((shmspaceid = shmget(key, SHMSZ, IPC_CREAT | 0666)) < 0) {
fprintf(stderr,"ERROR memory allocation failed\n");
return 1;
}
shmctl(shmspaceid, IPC_RMID, &shmid);
return 0;
}
Regards
Himanshu
I'm not sure that this solution is applicable to shared memory as well, but I know this phenomenon from normal malloc() calls.
It's pretty usual that you cannot allocate very large blocks of memory as you try it here. What the functions call means is "Allocate me a block of continuous memory of 10737418240 bytes". Often times, even if the total system memory could theoretically satisfy this need, the implied "a block of continuous memory" forces the limit of allocatable memory to be much lower.
The in-memory program structure, the number of programs loaded can all contribute to blocking certain areas of memory and not allow there to be 10 continuous gigabytes of memory allocatable.
I have found often times that a reboot will change that (as programs get loaded to a different position on the heap). You can try out your maximum allocatable block size with something like this:
int i=1024;
int error=0;
while(!error) {
char *a=(char*)malloc(i);
error=(a==null);
if(!error)
printf("Successfully allocated %i.\n", i);
i*=2;
}
Hope this helps or is applicable here. I found this out while checking why I could not allocate close to maximum system memory to a JVM.
Shot in the dark: you don't have enough swap space. Shared memory, by default, requires reserving space in swap. You can disable this behavior using SHM_NORESERVE:
http://linux.die.net/man/2/shmget
SHM_NORESERVE (since Linux 2.6.15) This flag serves the same purpose
as the mmap(2) MAP_NORESERVE flag. Do not reserve swap space for this
segment. When swap space is reserved, one has the guarantee that it is
possible to modify the segment. When swap space is not reserved one
might get SIGSEGV upon a write if no physical memory is available. See
also the discussion of the file /proc/sys/vm/overcommit_memory in
proc(5).
I was just looking at this and I recommend printing out the exact errno value and description for the problem, rather than just noting that it failed. For example:
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
//#define SHMSZ 10737418240
#define SHMSZ 8589934592
int main()
{
int shmspaceid;
key_t key = 5678;
struct shmid_ds shmid;
if ((shmspaceid = shmget(key, SHMSZ, IPC_CREAT | 0666)) < 0) {
fprintf(stderr,"ERROR with shmget (%d: %s)\n", (int)(errno), strerror(errno));
return 1;
}
shmctl(shmspaceid, IPC_RMID, &shmid);
return 0;
}
I tried to reproduce your problem with an 8 GB block and 8 GB smhmax and shmall on my 16 GB system, but I could not. It worked fine. I do recommend using ipcs -m to look for other shared blocks that might prevent your 10 GB allocation from being honored. And definitely look closely at the exact error code that shmget() is returning through errno.

Resources