Different number of cycles when running a benchmark more than once on C++ emulator - riscv

When running a benchmark e.g. dhrystone with the command:
make output/dhrystone.riscv.out
as described at: http://riscv.org/download.html#tab_rocket,
on the C++ emulator. I get the following output:
When running it for the first time:
Microseconds for one run through Dhrystone: 1064
Dhrystones per Second: 939
cycle = 533718
instret = 148672
and the second time:
Microseconds for one run through Dhrystone: 1064
Dhrystones per Second: 939
cycle = 533715
instret = 148672
Why do the cycles differ? Shouldn't they be exactly the same. I have tried this with other benchmarks too and had even higher deviations. If this is normal where do the deviations come from?

There are small amounts of nondeterminism from randomly initialized registers (e.g., the clock that is recovered by the HTIF is initialized to a random phase). It doesn't seem like these minor deviations would impact any performance benchmarking.
If you need identical results each time (e.g., for verification?), you could modify the emulator code to initialize registers to some known value each time.

Related

What timeframe does the "perf sched record" use?

I've been trying to analyze the output of perf sched record but I don't understand with what frame of reference do I try to understand the "20624.983302 secs". It isn't Unix time for sure, so what is it? How would I go about converting this into Unix time?
*A0 20624.983302 secs A0 => migration/0:12
*. 20624.983311 secs . => swapper:0
*B0 20624.983318 secs B0 => IPC I/O Child:33924
*. 20624.983355 secs
*C0 20624.983485 secs C0 => WRScene~lder#15:39974
*. 20624.983581 secs
*D0 20624.983972 secs D0 => IPC I/O Parent:33780
These timestamps are captured using the kernel scheduler clock, which counts in nanoseconds since boot. The exact details depend on the compile-time parameters chosen to build a particular Linux distribution and the target architecture.
In general, the timestamp of a sample is captured around the same time when it's recorded. Timestamps on the same core are guaranteed to be monotonically increasing as long the core remains in an active state. The samples you've shown were all captured on the same core and the core remained active from the first sample to the last sample. So the timestamps are guaranteed to be monotonic in this case irrespective of the platform and distribution. When profiling on multiple cores, there is no guarantee that the clocks on all cores are in sync.
All perf tools use the same clock to capture timestamps, but they may differ in the way timestamps are printed and it may happen that two tools print timestamps from the same sample file differently. This depends on the kernel version.
It's possible to specify a clock source when calling perf_event_open() by setting use_clockid to 1 and setting clockid to one of the clock sources defined in linux/time.h, such as CLOCK_MONOTONIC. perf record provides the -k or --clockid option to specify the clock source for capturing timestamps.
Modern distributions on x86 typically use TSC as the source for the scheduler clock (check /sys/devices/system/clocksource/clocksource0/current_clocksource). So if you're on an x86 processor, most probably the TSC of the profiled core was used to capture the current value of TSC cycles, which internally gets converted into nanoseconds. When a timestamp is printed, it may get converted to a different unit. In this case, timestamps are printed in the format "seconds.microseconds". A summary of the behavior of TSC on Intel processors can be found at: Can constant non-invariant tsc change frequency across cpu states?.

Communication frequency vs Simulation Time for FMU

Lets say we have a FMU which is getting inputs from Python and simulating at an interval of 0.001s. Does the FMI/FMU standard allow us to run the FMU multiple times for a same input (so Python provides the input at 0.01s interval and the FMU simulates that 10 times at each step)? Would that be faster since we have reduced the communication interface by 1/10th ?
(For CS FMUs:) Updating the inputs only every 10th step can be seen as a special co-simualtion algorithm and is ok. Input variables keep their values until they they are newly set.
This will only lead to a benefit in simulation speed, if the the internal calculation time (of a doStep) is small compared to the communication runtime.

Unexpected periodic behaviour of an ultra low latency hard real time multi-threaded x86 code

I am running code in a loop for multiple iterations on a dedicated CPU with RT priority and want to observe its behaviour over a long time. I found a very strange periodic behaviour of the code.
Briefly, this is what the code does:
Arraythread
{
while(1)
{
if(flag)
Multiply matrix
record time;
reset flag;
}
}
mainthread
{
for(30 mins)
{
set flag;
record time;
busy while(500 μs)
}
}
Here are the details about the machine I am using:
CPU: Intel(R) Xeon(R) Gold 6230 CPU # 2.10 GHz
L1 cache: 32K d and 32K i
L2 cache: 1024K
L3 cache: 28160K
Kernel: 3.10.0-693.2.2.rt56.623.el7.x86_64 #1 SMP PREEMPT RT
OS: CentOS
Current active profile: latency-performance
I modified the global limit of Linux real time scheduling (sched_rt_runtime_us) from 95% to 100%
Both the above mentioned threads are bound on a single NUMA node each with priority 99
More details about the code:
mainthread sets a flag every 500 μs. I used CLOCK_MONOTOMIC_RAW with clock_gettime function to read the time (let's say T0).
I put all the variables in a structure to reduce the cache misses.
Arraythread runs a busy while loop and waits for the flag to set.
Once the flag is set it multiplies two big arrays.
Once the multiplication is done it reset the flag and record the time (let's say T1).
I run this experiment for 30 mins (= 3600000 iterations)
I measure the time difference T1-T0 once the experiment is over.
Here is the clock:
The average time of the clock is ~500.5 microseconds. There are flactuations which are expected.
Here is the time taken by the array multiplication:
This is the full 30 minute view of the result.
There are four peaks in the results. The first peak is expected since for the very first time data comes from main memory and the CPU was on sleep.
Apart from the first peak, there are three more peaks and the time difference between peak_3 and peak_2 is 11.99364 mins where the time difference between peak_4 and peak_3 is 11.99358 mins. (I assumed the clock to be 500 μsec)
If I zoom it further:
This image shows what happened over 5 minutes.
If I zoom it further:
This image shows what happened over ~1.25 mins.
You notice that average time is around 113 μsec of the multiplication and there are peaks everywhere.
If I zoom it further:
This image shows what happened over 20 seconds.
If I zoom it further:
This image shows what happened over 3.5 seconds.
The time differences between the starting line of these peaks are: 910 ms, 910 ms, 902 ms (assuming two consecutive points are at 500 μs difference)
If I zoom it further:
This image shows what happened over 500 ms
~112.6 μs is the average time here and complete data is under 1 μs range.
Here are my questions:
Given that L3 cache is good enough to store the complete executable and there is no file read right and there is nothing else is running on the machine, no context switch is happening as well, why do some of the executions take almost double (or sometimes more than double) time? [see the peaks in first result image]
If we forget about those four peaks from the first image, how do I justify the periodic peaks in the results with almost constant time difference? What does the CPU do? These periodic peaks lasts few milliseconds.
I expect the results to be near constant like in the last image. Is there a way or OS/CPU settings I can apply to run the code like last image for infinite time?
Here is the complete code:
https://github.com/sghoslya/kite/blob/main/multiThreadProfCheckArray.c

PERF STAT does not count memory-loads but counts memory-stores

Linux Kernel : 4.10.0-20-generic (also tried this on 4.11.3)
Ubuntu : 17.04
I have been trying to collect stats of memory-accesses using perf stat. I am able to collect stats for memory-stores but the count for memory-loads return me a 0 value.
The below is the details for memory-stores :-
perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
158,115,510 cpu/mem-stores/u
0.559922797 seconds time elapsed
For memory-loads, I get a 0 count as can be seen below :-
perf stat -e cpu/mem-loads/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
0 cpu/mem-loads/u
0.563806170 seconds time elapsed
I cannot understand why this does not count properly. Should I use a different event in any way to get proper data ?
The mem-loads event is mapped to the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3 performance monitoring unit event on Intel processors. The events MEM_TRANS_RETIRED.LOAD_LATENCY_* are special and can only be counted by using the p modifier. That is, you have to specify mem-loads:p to perf to use the event correctly.
MEM_TRANS_RETIRED.LOAD_LATENCY_* is a precise event and it only makes sense to be counted at the precise level. According to this Intel article (emphasis mine):
When a user elects to sample one of these events, special hardware is
used that can keep track of a data load from issue to completion.
This is more complicated than simply counting instances of an event
(as with normal event-based sampling), and so only some loads are
tracked. Loads are randomly chosen, the latency determined for each,
and the correct event(s) incremented (latency >4, >8, >16, etc). Due
to the nature of the sampling for this event, only a small percentage
of an application's data loads can be tracked at any one time.
As you can see, MEM_TRANS_RETIRED.LOAD_LATENCY_* by no means count the total number of loads and it is not designed for that purpose at all.
If you want to to determine which instructions in your code are issuing load requests that take more than a specific number of cycles to complete, then MEM_TRANS_RETIRED.LOAD_LATENCY_* is the right performance event to use. In fact, that is exactly the purpose of perf-mem and it achieves its purpose by using this event.
If you want to count the total number of load uops retired, then you should use L1-dcache-loads, which is mapped to the MEM_UOPS_RETIRED.ALL_LOADS performance event on Intel processors.
On the other hand, mem-stores and L1-dcache-stores are mapped to the exact same performance event on all current Intel processors, namely, MEM_UOPS_RETIRED.ALL_STORES, which does count all retired store uops.
So in summary, if you are using perf-stat, you should (almost) always use L1-dcache-loads and L1-dcache-stores to count retired loads and stores, respectively. These are mapped to the raw events you have used in the answer you posted, only more portable because they also work on AMD processors.
I have used a Broadwell(CPU e5-2620) server machine to collect all of the below events.
To collect memory-load events, I had to use a numeric event value. I basically ran the below command -
./perf record -e "r81d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
Here r81d0 represents the raw event for counting "memory loads amongst all instructions retired". "u" as can be understood represents user-space.
The below command, on the other hand,
./perf record -e "r82d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
has "r82d0:u" as a raw event representing "memory stores amongst all instructions retired in userspace".

Measuring time: differences among gettimeofday, TSC and clock ticks

I am doing some performance profiling for part of my program. And I try to measure the execution with the following four methods. Interestingly they show different results and I don't fully understand their differences. My CPU is Intel(R) Core(TM) i7-4770. System is Ubuntu 14.04. Thanks in advance for any explanation.
Method 1:
Use the gettimeofday() function, result is in seconds
Method 2:
Use the rdtsc instruction similar to https://stackoverflow.com/a/14019158/3721062
Method 3 and 4 exploits Intel's Performance Counter Monitor (PCM) API
Method 3:
Use PCM's
uint64 getCycles(const CounterStateType & before, const CounterStateType &after)
Its description (I don't quite understand):
Computes the number core clock cycles when signal on a specific core is running (not halted)
Returns number of used cycles (halted cyles are not counted). The counter does not advance in the following conditions:
an ACPI C-state is other than C0 for normal operation
HLT
STPCLK+ pin is asserted
being throttled by TM1
during the frequency switching phase of a performance state transition
The performance counter for this event counts across performance state transitions using different core clock frequencies
Method 4:
Use PCM's
uint64 getInvariantTSC (const CounterStateType & before, const CounterStateType & after)
Its description:
Computes number of invariant time stamp counter ticks.
This counter counts irrespectively of C-, P- or T-states
Two samples runs generate result as follows:
(Method 1 is in seconds. Methods 2~4 are divided by a (same) number to show a per-item cost).
0.016489 0.533603 0.588103 4.15136
0.020374 0.659265 0.730308 5.15672
Some observations:
The ratio of Method 1 over Method 2 is very consistent, while the others are not. i.e., 0.016489/0.533603 = 0.020374/0.659265. Assuming gettimeofday() is sufficiently accurate, the rdtsc method exhibits the "invariant" property. (Yep I read from Internet that current generation of Intel CPU has this feature for rdtsc.)
Methods 3 reports higher than Method 2. I guess its somehow different from the TSC. But what is it?
Methods 4 is the most confusing one. It reports an order of magnitude larger number than Methods 2 and 3. Shouldn't it be also kind of cycle counts? Let alone it carries the "Invariant" name.
gettimeofday() is not designed for measuring time intervals. Don't use it for that purpose.
If you need wall time intervals, use the POSIX monotonic clock. If you need CPU time spent by a particular process or thread, use the POSIX process time or thread time clocks. See man clock_gettime.
PCM API is great for fine tuned performance measurement when you know exactly what you are doing. Which is generally obtaining a variety of separate memory, core, cache, low-power, ... performance figures. Don't start messing with it if you are not sure what exact services you need from it that you can't get from clock_gettime.

Resources