Measuring time: differences among gettimeofday, TSC and clock ticks

Measuring time: differences among gettimeofday, TSC and clock ticks - linux

I am doing some performance profiling for part of my program. And I try to measure the execution with the following four methods. Interestingly they show different results and I don't fully understand their differences. My CPU is Intel(R) Core(TM) i7-4770. System is Ubuntu 14.04. Thanks in advance for any explanation.
Method 1:
Use the gettimeofday() function, result is in seconds
Method 2:
Use the rdtsc instruction similar to https://stackoverflow.com/a/14019158/3721062
Method 3 and 4 exploits Intel's Performance Counter Monitor (PCM) API
Method 3:
Use PCM's
uint64 getCycles(const CounterStateType & before, const CounterStateType &after)
Its description (I don't quite understand):
Computes the number core clock cycles when signal on a specific core is running (not halted)
Returns number of used cycles (halted cyles are not counted). The counter does not advance in the following conditions:
an ACPI C-state is other than C0 for normal operation
HLT
STPCLK+ pin is asserted
being throttled by TM1
during the frequency switching phase of a performance state transition
The performance counter for this event counts across performance state transitions using different core clock frequencies
Method 4:
Use PCM's
uint64 getInvariantTSC (const CounterStateType & before, const CounterStateType & after)
Its description:
Computes number of invariant time stamp counter ticks.
This counter counts irrespectively of C-, P- or T-states
Two samples runs generate result as follows:
(Method 1 is in seconds. Methods 2~4 are divided by a (same) number to show a per-item cost).
0.016489 0.533603 0.588103 4.15136
0.020374 0.659265 0.730308 5.15672
Some observations:
The ratio of Method 1 over Method 2 is very consistent, while the others are not. i.e., 0.016489/0.533603 = 0.020374/0.659265. Assuming gettimeofday() is sufficiently accurate, the rdtsc method exhibits the "invariant" property. (Yep I read from Internet that current generation of Intel CPU has this feature for rdtsc.)
Methods 3 reports higher than Method 2. I guess its somehow different from the TSC. But what is it?
Methods 4 is the most confusing one. It reports an order of magnitude larger number than Methods 2 and 3. Shouldn't it be also kind of cycle counts? Let alone it carries the "Invariant" name.

gettimeofday() is not designed for measuring time intervals. Don't use it for that purpose.
If you need wall time intervals, use the POSIX monotonic clock. If you need CPU time spent by a particular process or thread, use the POSIX process time or thread time clocks. See man clock_gettime.
PCM API is great for fine tuned performance measurement when you know exactly what you are doing. Which is generally obtaining a variety of separate memory, core, cache, low-power, ... performance figures. Don't start messing with it if you are not sure what exact services you need from it that you can't get from clock_gettime.

Related

What timeframe does the "perf sched record" use?

I've been trying to analyze the output of perf sched record but I don't understand with what frame of reference do I try to understand the "20624.983302 secs". It isn't Unix time for sure, so what is it? How would I go about converting this into Unix time?
*A0 20624.983302 secs A0 => migration/0:12
*. 20624.983311 secs . => swapper:0
*B0 20624.983318 secs B0 => IPC I/O Child:33924
*. 20624.983355 secs
*C0 20624.983485 secs C0 => WRScene~lder#15:39974
*. 20624.983581 secs
*D0 20624.983972 secs D0 => IPC I/O Parent:33780

These timestamps are captured using the kernel scheduler clock, which counts in nanoseconds since boot. The exact details depend on the compile-time parameters chosen to build a particular Linux distribution and the target architecture.
In general, the timestamp of a sample is captured around the same time when it's recorded. Timestamps on the same core are guaranteed to be monotonically increasing as long the core remains in an active state. The samples you've shown were all captured on the same core and the core remained active from the first sample to the last sample. So the timestamps are guaranteed to be monotonic in this case irrespective of the platform and distribution. When profiling on multiple cores, there is no guarantee that the clocks on all cores are in sync.
All perf tools use the same clock to capture timestamps, but they may differ in the way timestamps are printed and it may happen that two tools print timestamps from the same sample file differently. This depends on the kernel version.
It's possible to specify a clock source when calling perf_event_open() by setting use_clockid to 1 and setting clockid to one of the clock sources defined in linux/time.h, such as CLOCK_MONOTONIC. perf record provides the -k or --clockid option to specify the clock source for capturing timestamps.
Modern distributions on x86 typically use TSC as the source for the scheduler clock (check /sys/devices/system/clocksource/clocksource0/current_clocksource). So if you're on an x86 processor, most probably the TSC of the profiled core was used to capture the current value of TSC cycles, which internally gets converted into nanoseconds. When a timestamp is printed, it may get converted to a different unit. In this case, timestamps are printed in the format "seconds.microseconds". A summary of the behavior of TSC on Intel processors can be found at: Can constant non-invariant tsc change frequency across cpu states?.

How to know max event period of perf-record

does anyone know how to get the maximum event period value (or the value that kernel actually passes to PMU) of Perf event?
I'm using perf to measure my program as follow:
perf record -d -e cpu/event=0xd0,umask=0x81/ppu,cpu/event=0xd0,umask=0x82/ppu -c 5
cpu/event=0xd0,umask=0x81/ppu means measure all loads in cpu, and cpu/event=0xd0,umask=0x82/ppu is all stores.
I tried to understand how arguments passing in perf by strace, but found nothing.
Is the PMU received a value that over its ability, will still try to reach it? If so, where can find related code and what is its maximum event period of those events?
Thanks everyone.

The perf record command accepts period values much larger than 255. Internally, the processor maintains a counter for recording all the memory loads and memory stores(or for that matter, any other supported event). Once the counter overflows, the processor will record all the information about the memory load/store that you are trying to record(information about architectural state/registers etc.) .
Also once the counter overflows, it must be reset again. Usually the counter is reset to a value less than 0. Since it is set to a value less than zero and it increments, the counter will overflow once it hits 0 again.
This counter reset value that I was talking about is the period value that you asked for. What I mean is that, if the period is specified by -c 1 , it means that the counter reset value will be set to -1, so the next memory load/store will increment the counter to 0(leading to a counter overflow) and you will record the events.
Thus, if you set the period to 1, there will be a counter overflow on each memory load/store event and you will record all of them (this is only conceptual however, the hardware usually cannot do this).
What this means is that, the period value can go as large as the size of a hardware counter for these events. Usually in modern microarchitectures , like Broadwell/Haswell/Skylake, these counters are 48-bits in size. So the period might go as large as 2^48-1. However, usage of such large values are not recommended.
Usually, the period value should be kept to a maximum of 2^32-1 in 32-bit systems and is usually the norm in other systems too.
Sources :
Chapter 18 of this book
Please read the topic Sampling with perf record in this link too
If you want you can read the answer to this question too.

PERF STAT does not count memory-loads but counts memory-stores

Linux Kernel : 4.10.0-20-generic (also tried this on 4.11.3)
Ubuntu : 17.04
I have been trying to collect stats of memory-accesses using perf stat. I am able to collect stats for memory-stores but the count for memory-loads return me a 0 value.
The below is the details for memory-stores :-
perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
158,115,510 cpu/mem-stores/u
0.559922797 seconds time elapsed
For memory-loads, I get a 0 count as can be seen below :-
perf stat -e cpu/mem-loads/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
0 cpu/mem-loads/u
0.563806170 seconds time elapsed
I cannot understand why this does not count properly. Should I use a different event in any way to get proper data ?

The mem-loads event is mapped to the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3 performance monitoring unit event on Intel processors. The events MEM_TRANS_RETIRED.LOAD_LATENCY_* are special and can only be counted by using the p modifier. That is, you have to specify mem-loads:p to perf to use the event correctly.
MEM_TRANS_RETIRED.LOAD_LATENCY_* is a precise event and it only makes sense to be counted at the precise level. According to this Intel article (emphasis mine):
When a user elects to sample one of these events, special hardware is
used that can keep track of a data load from issue to completion.
This is more complicated than simply counting instances of an event
(as with normal event-based sampling), and so only some loads are
tracked. Loads are randomly chosen, the latency determined for each,
and the correct event(s) incremented (latency >4, >8, >16, etc). Due
to the nature of the sampling for this event, only a small percentage
of an application's data loads can be tracked at any one time.
As you can see, MEM_TRANS_RETIRED.LOAD_LATENCY_* by no means count the total number of loads and it is not designed for that purpose at all.
If you want to to determine which instructions in your code are issuing load requests that take more than a specific number of cycles to complete, then MEM_TRANS_RETIRED.LOAD_LATENCY_* is the right performance event to use. In fact, that is exactly the purpose of perf-mem and it achieves its purpose by using this event.
If you want to count the total number of load uops retired, then you should use L1-dcache-loads, which is mapped to the MEM_UOPS_RETIRED.ALL_LOADS performance event on Intel processors.
On the other hand, mem-stores and L1-dcache-stores are mapped to the exact same performance event on all current Intel processors, namely, MEM_UOPS_RETIRED.ALL_STORES, which does count all retired store uops.
So in summary, if you are using perf-stat, you should (almost) always use L1-dcache-loads and L1-dcache-stores to count retired loads and stores, respectively. These are mapped to the raw events you have used in the answer you posted, only more portable because they also work on AMD processors.

I have used a Broadwell(CPU e5-2620) server machine to collect all of the below events.
To collect memory-load events, I had to use a numeric event value. I basically ran the below command -
./perf record -e "r81d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
Here r81d0 represents the raw event for counting "memory loads amongst all instructions retired". "u" as can be understood represents user-space.
The below command, on the other hand,
./perf record -e "r82d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
has "r82d0:u" as a raw event representing "memory stores amongst all instructions retired in userspace".

Different number of cycles when running a benchmark more than once on C++ emulator

When running a benchmark e.g. dhrystone with the command:
make output/dhrystone.riscv.out
as described at: http://riscv.org/download.html#tab_rocket,
on the C++ emulator. I get the following output:
When running it for the first time:
Microseconds for one run through Dhrystone: 1064
Dhrystones per Second: 939
cycle = 533718
instret = 148672
and the second time:
Microseconds for one run through Dhrystone: 1064
Dhrystones per Second: 939
cycle = 533715
instret = 148672
Why do the cycles differ? Shouldn't they be exactly the same. I have tried this with other benchmarks too and had even higher deviations. If this is normal where do the deviations come from?

There are small amounts of nondeterminism from randomly initialized registers (e.g., the clock that is recovered by the HTIF is initialized to a random phase). It doesn't seem like these minor deviations would impact any performance benchmarking.
If you need identical results each time (e.g., for verification?), you could modify the emulator code to initialize registers to some known value each time.

tuning pid in systems with delay

I need to tune PI(D) gains in a system which has a quite large delay. It's a common temperature controller, but the temperature probe is far away from the heater. Some further info:
the response of the probe is delayed about 10 seconds from any change on the heater
the temperature is sampled # 1 Hz, with a resolution of 0.01 °C
the heater is controller in PWM with a period of 1 Hz, with a 10-bit PWM
the goal is to maintain the oscillation below ±0.05 °C
Currently I'm using the controller as PI. I can't avoid oscillations. The higher the gain, the smaller and faster the oscillations. Still too high (about ±0.15 °C).
Reducing the P and I gains leads to very long and deep oscillations.
I think this is due to the delay.
The settling time is not a problem, it may take all the time it needs.
I'm puzzling over how get the system to work. Let's think to use only I. When the probe reaches the target value and the I output starts to decrease, the temperature will rise for some other time. I cannot use the derivative term because the variations are too slow and the dError is very close to zero (if I set the dGain to a huge value there is too much noise).
Any idea?

Try P-only. How fast are the proportional-only oscillations? If you can't tune Kp small enough to get no oscillations, then your heater is overpowered for your system.
If the dead time of the of the system is on the order of 10s, the time constant (T_i) for the Integral term should be 3.3 times the dead time, using a Ziegler Nichols open-loop PI rule ( https://controls.engin.umich.edu/wiki/index.php/PIDTuningClassical#Ziegler-Nichols_Open-Loop_Tuning_Method_or_Process_Reaction_Method: ) , and then Integral term should be Ki = Kp/T_i. So with deadtime = 10s, then Ki should be Kp/33 or slower.
If you are getting integral-only oscillations, then the integral is winding up and down quicker than the process responds, and it should be even smaller.
Also -- think of the units of the different terms. It might not be the delay causing your problems so much as the resolution of the measurement and control systems. If you're driving a (for example) 100W heater with a 1/1024 resolution PWM, you've got 0.1W resolution per PWM count that you are trying to adjust based on 0.01C temperature differences. At less than Kp = 100 PWMcount/degree (or 10W/degree) you don't have enough resolution in the PWM to make changes in response to a 0.01C error. At a Kp=10PWM/C you might need a 0.10C change to result in an actual change in the PWM power. Can you use a higher resolution PWM?
Thinking of it the other way, if you want to operate a system over a range of 30C at 0.01C, I'd think you would want at least a 15bit PWM to have 10 times the resolution in the controlled system. With only 10 bits of PWM you only get about 1C of total range with control at 10x the resolution of the measurements.

Normally for large delays you have two options: Lower the gains of the system or, if you have a model of the plant you are controlling, use a Smith Predictior.
I would start by modelling your system (using open-loop steps in the input) to quantify the delay and the time constant of your plant, then check if the sampling of the temperature and the PWM rate are OK.
Notice that if your PWM frequency is too small in comparison to the plant dynamics, you will have sustained oscillations because of the slow PWM. You can check it using just an constant input to your PWM (with no controllers, open loop).
EDIT: Didn't see that the problem was already solved, but I'll leave this here for reference.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string