Is gettimeofday() guaranteed to be of microsecond resolution? - linux

I am porting a game, that was originally written for the Win32 API, to Linux (well, porting the OS X port of the Win32 port to Linux).
I have implemented QueryPerformanceCounter by giving the uSeconds since the process start up:
BOOL QueryPerformanceCounter(LARGE_INTEGER* performanceCount)
{
gettimeofday(&currentTimeVal, NULL);
performanceCount->QuadPart = (currentTimeVal.tv_sec - startTimeVal.tv_sec);
performanceCount->QuadPart *= (1000 * 1000);
performanceCount->QuadPart += (currentTimeVal.tv_usec - startTimeVal.tv_usec);
return true;
}
This, coupled with QueryPerformanceFrequency() giving a constant 1000000 as the frequency, works well on my machine, giving me a 64-bit variable that contains uSeconds since the program's start-up.
So is this portable? I don't want to discover it works differently if the kernel was compiled in a certain way or anything like that. I am fine with it being non-portable to something other than Linux, however.

Maybe. But you have bigger problems. gettimeofday() can result in incorrect timings if there are processes on your system that change the timer (ie, ntpd). On a "normal" linux, though, I believe the resolution of gettimeofday() is 10us. It can jump forward and backward and time, consequently, based on the processes running on your system. This effectively makes the answer to your question no.
You should look into clock_gettime(CLOCK_MONOTONIC) for timing intervals. It suffers from several less issues due to things like multi-core systems and external clock settings.
Also, look into the clock_getres() function.

High Resolution, Low Overhead Timing for Intel Processors
If you're on Intel hardware, here's how to read the CPU real-time instruction counter. It will tell you the number of CPU cycles executed since the processor was booted. This is probably the finest-grained counter you can get for performance measurement.
Note that this is the number of CPU cycles. On linux you can get the CPU speed from /proc/cpuinfo and divide to get the number of seconds. Converting this to a double is quite handy.
When I run this on my box, I get
11867927879484732
11867927879692217
it took this long to call printf: 207485
Here's the Intel developer's guide that gives tons of detail.
#include <stdio.h>
#include <stdint.h>
inline uint64_t rdtsc() {
uint32_t lo, hi;
__asm__ __volatile__ (
"xorl %%eax, %%eax\n"
"cpuid\n"
"rdtsc\n"
: "=a" (lo), "=d" (hi)
:
: "%ebx", "%ecx");
return (uint64_t)hi << 32 | lo;
}
main()
{
unsigned long long x;
unsigned long long y;
x = rdtsc();
printf("%lld\n",x);
y = rdtsc();
printf("%lld\n",y);
printf("it took this long to call printf: %lld\n",y-x);
}

#Bernard:
I have to admit, most of your example went straight over my head. It does compile, and seems to work, though. Is this safe for SMP systems or SpeedStep?
That's a good question... I think the code's ok.
From a practical standpoint, we use it in my company every day,
and we run on a pretty wide array of boxes, everything from 2-8 cores.
Of course, YMMV, etc, but it seems to be a reliable and low-overhead
(because it doesn't make a context switch into system-space) method
of timing.
Generally how it works is:
declare the block of code to be assembler (and volatile, so the
optimizer will leave it alone).
execute the CPUID instruction. In addition to getting some CPU information
(which we don't do anything with) it synchronizes the CPU's execution buffer
so that the timings aren't affected by out-of-order execution.
execute the rdtsc (read timestamp) execution. This fetches the number of
machine cycles executed since the processor was reset. This is a 64-bit
value, so with current CPU speeds it will wrap around every 194 years or so.
Interestingly, in the original Pentium reference, they note it wraps around every
5800 years or so.
the last couple of lines store the values from the registers into
the variables hi and lo, and put that into the 64-bit return value.
Specific notes:
out-of-order execution can cause incorrect results, so we execute the
"cpuid" instruction which in addition to giving you some information
about the cpu also synchronizes any out-of-order instruction execution.
Most OS's synchronize the counters on the CPUs when they start, so
the answer is good to within a couple of nano-seconds.
The hibernating comment is probably true, but in practice you
probably don't care about timings across hibernation boundaries.
regarding speedstep: Newer Intel CPUs compensate for the speed
changes and returns an adjusted count. I did a quick scan over
some of the boxes on our network and found only one box that
didn't have it: a Pentium 3 running some old database server.
(these are linux boxes, so I checked with: grep constant_tsc /proc/cpuinfo)
I'm not sure about the AMD CPUs, we're primarily an Intel shop,
although I know some of our low-level systems gurus did an
AMD evaluation.
Hope this satisfies your curiosity, it's an interesting and (IMHO)
under-studied area of programming. You know when Jeff and Joel were
talking about whether or not a programmer should know C? I was
shouting at them, "hey forget that high-level C stuff... assembler
is what you should learn if you want to know what the computer is
doing!"

You may be interested in Linux FAQ for clock_gettime(CLOCK_REALTIME)

Wine is actually using gettimeofday() to implement QueryPerformanceCounter() and it is known to make many Windows games work on Linux and Mac.
Starts http://source.winehq.org/source/dlls/kernel32/cpu.c#L312
leads to http://source.winehq.org/source/dlls/ntdll/time.c#L448

So it says microseconds explicitly, but says the resolution of the system clock is unspecified. I suppose resolution in this context means how the smallest amount it will ever be incremented?
The data structure is defined as having microseconds as a unit of measurement, but that doesn't mean that the clock or operating system is actually capable of measuring that finely.
Like other people have suggested, gettimeofday() is bad because setting the time can cause clock skew and throw off your calculation. clock_gettime(CLOCK_MONOTONIC) is what you want, and clock_getres() will tell you the precision of your clock.

The actual resolution of gettimeofday() depends on the hardware architecture. Intel processors as well as SPARC machines offer high resolution timers that measure microseconds. Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz. In such cases, the time resolution will be less accurate.
I obtained this answer from High Resolution Time Measurement and Timers, Part I

This answer mentions problems with the clock being adjusted. Both your problems guaranteeing tick units and the problems with the time being adjusted are solved in C++11 with the <chrono> library.
The clock std::chrono::steady_clock is guaranteed not to be adjusted, and furthermore it will advance at a constant rate relative to real time, so technologies like SpeedStep must not affect it.
You can get typesafe units by converting to one of the std::chrono::duration specializations, such as std::chrono::microseconds. With this type there's no ambiguity about the units used by the tick value. However, keep in mind that the clock doesn't necessarily have this resolution. You can convert a duration to attoseconds without actually having a clock that accurate.

From my experience, and from what I've read across the internet, the answer is "No," it is not guaranteed. It depends on CPU speed, operating system, flavor of Linux, etc.

Reading the RDTSC is not reliable in SMP systems, since each CPU maintains their own counter and each counter is not guaranteed to by synchronized with respect to another CPU.
I might suggest trying clock_gettime(CLOCK_REALTIME). The posix manual indicates that this should be implemented on all compliant systems. It can provide a nanosecond count, but you probably will want to check clock_getres(CLOCK_REALTIME) on your system to see what the actual resolution is.

Related

What options do I have for running recurring events on a microsecond resolution from a kernel driver?

I want to create a simulation of an actual device on an x86 Linux Kernel. Part of this will involve simulating timings as close to possible as I can get. Based on some research it seems I will need at least microsecond resolution timing. I understand that on a non-realtime system it won't be possible to get perfect timing, but I don't perfect, just as close as I can get, perhaps with hacking around with thread scheduling / preemption options.
What I actually want to do is perform an action every interval, i.e. run a some code every Xµs. I've been trying to research the best ways to do this from a Kernel driver as well as some research into whether it's possible to do this reasonably accurately from user mode (keeping the above paragraph in mind). One of the first things that caught my eye was the HPET timer, that is programmable to generate interrupts based on programmable comparators. Unfortunately, it seems on many chipsets it has been rather buggy in the past, and there's not much information on using it for anything that obtaining a timestamp or using it as the main clock source. The linux Kernel provides an HPET driver that in the past, seemed to provide both kernel and user mode interfaces, but seems only to provide a barely documented usermode interface in more recent kernel versions. I've also read about various other kernel functions and interfaces such as the hrtimer interface and the various delay functions, though I'm having a bit of trouble understanding them and if they are suited for my purpose.
Given my current use case, what are the best options I have running recurring events at a µs resolution from say a kernel driver? Obviously accuracy is probably my biggest criteria, but ease of use would be second.
Well, it's possible to achieve your accuracy in userspace -- clock_nanosleep is one ideal option, which has relative and absolute mode. Since clock_nanosleep is based on hrtimer in kernel mode, you may want to use hrtimer if you'd like to implement it in kernel space.
However, to make the timer work accurately, there're two IMPORTENT things worth mentioning.
You should set the timerslack of your process (either by writing nonzero value in ns to /proc/self/timerslack_ns or via prctl(PR_SET_TIMERSLACK,...)). This value is considered as the 'tolerance' of the timer.
The CPU power management also matters here. The CPU has many different Cstates, each of which has a different exit latency. So you need to configure your cpuidle module to not use Cstates other than C0, e.g. for an Intel CPU you could simply write 1 to /sys/devices/system/cpu/cpu$c/cpuidle/state$s/disable to disable state $s of CPU $c. Or just add idle=poll to your kernel options to let CPU keep active (in C0) while kernel idle. NOTE that this significantly influences the power of the computer and leads the cooling fans to make noise.
You can get a timer with delays under 10 microseconds if the two things mentioned above are configured correctly. There is a trade-off between latency and power consumption that you should made.

How is the microsecond time of linux gettimeofday() obtained and what is its accuracy?

Wall clock time is usually provided by the systems RTC. This mostly only provides times down to the millisecond range and typically has a granularity of 10-20 miliseconds. However the resolution/granularity of gettimeofday() is often reported to be in the few microseconds range. I assume the microsecond granularity must be taken from a different source.
How is the microsecond resolution/granularity of gettimeofday() accomplished?
When the part down to the millisecond is taken from the RTC and the mircoseconds are taken from a different hardware, a problem with phasing of the two sources arises. The two sources have to be synchronized somehow.
How is the synchronization/phasing between these two sources accomplished?
Edit: From what I've read in links provided by amdn, particulary the following Intel link, I would add a question here:
Does gettimeofday() provide resolution/granularity in the microsecond regime at all?
Edit 2: Summarizing the amdns answer with some more results of reading:
Linux only uses the realtime clock (RTC) at boot time
to synchronize with a higher resolution counter, i.g. the Timestampcounter (TSC). After the boot gettimeofday() returns a time which is entirely based on the TSC value and the frequency of this counter. The initial value for the TSC frequency is corrected/calibrated by means of comparing the system time to an external time source. The adjustment is done/configured by the adjtimex() function. The kernel operates a phase locked loop to ensure that the time results are monotonic and consistent.
This way it can be stated that gettimeofday() has microsecond resolution. Taking into account that more modern Timestampcounter are running in the GHz regime, the obtainable resolution could be in the nanosecond regime. Therefore this meaningfull comment
/**
407 * do_gettimeofday - Returns the time of day in a timeval
408 * #tv: pointer to the timeval to be set
409 *
410 * NOTE: Users should be converted to using getnstimeofday()
411 */
can be found in Linux/kernel/time/timekeeping.c. This suggest that there will possibly
be an even higher resolution function available at a later point in time. Right now getnstimeofday() is only available in kernel space.
However, looking through all the code involved to get this about right, shows quite a few comments about uncertainties. It may be possible to obtain microsecond resolution. The function gettimeofday() may even show a granularity in the microsecond regime. But: There are severe daubts about its accuracy because the drift of the TSC frequency cannot be accurately corrected for. Also the complexity of the code dealing with this matter inside Linux is a hint to believe that it's in fact too difficult to get it right. This is particulary but not solely caused by the huge number of hardware platforms Linux is supposed to run on.
Result: gettimeofday() returns monotonic time with microsecond granularity but the time it provides is almost never is phase to one microsecond with any other time source.
How is the microsecond resolution/granularity of gettimeofday() accomplished?
Linux runs on many different hardware platforms, so the specifics differ. On a modern x86 platform Linux uses the Time Stamp Counter, also known as the TSC, which is driven by multiple of a crystal oscillator running at 133.33 MHz. The crystal oscillator provides a reference clock to the processor, and the processor multiplies it by some multiple - for example on a 2.93 GHz processor the multiple is 22. The TSC historically was an unreliable source of time because implementations would stop the counter when the processor went to sleep, or because the multiple wasn't constant as the processor shifted multipliers to change performance states or throttle down when it got hot. Modern x86 processors provide a TSC that is constant, invariant, and non-stop. On these processors the TSC is an excellent high resolution clock and the Linux kernel determines an initial approximate frequency at boot time. The TSC provides microsecond resolution for the gettimeofday() system call and nanosecond resolution for the clock_gettime() system call.
How is this synchronization accomplished?
Your first question was about how the Linux clock provides high resolution, this second question is about synchronization, this is the distinction between precision and accuracy. Most systems have a clock that is backed up by battery to keep time of day when the system is powered down. As you might expect this clock doesn't have high accuracy or precision, but it will get the time of day "in the ballpark" when the system starts. To get accuracy most systems use an optional component to get time from an external source on the network. Two common ones are
Network Time Protocol
Precision Time Protocol
These protocols define a master clock on the network (or a tier of clocks sourced by an atomic clock) and then measure network latencies to estimate offsets from the master clock. Once the offset from the master is determined the system clock is disciplined to keep it accurate. This can be done by
Stepping the clock (a relatively large, abrupt, and infrequent time adjustment), or
Slewing the clock (defined as how much the clock frequency should be adjusted by either slowly increasing or decreasing the frequency over a given time period)
The kernel provides the adjtimex system call to allow clock disciplining. For details on how modern Intel multi-core processors keep the TSC synchronized between cores see CPU TSC fetch operation especially in multicore-multi-processor environment.
The relevant kernel source files for clock adjustments are kernel/time.c and kernel/time/timekeeping.c.
When Linux starts, it initializes the software clock using the hardware clock. See the chapter How Linux Keeps Track of Time in the Clock HOWTO.

Implementation of clock()

I was going through stackoverflow threads on various mechanisms for computing CPU time of a process.
How is clock() internally implemented ? Does it use rdtsc() ( If that's the case then it is sensitive to migration between cores ).
Also, getrusage() implemented ? Does it also depend on TSC ?
Thanks in advance
The kernel keeps track of CPU utilization for processes in sizes of ticks.
Both clock() and getrusage() are both based on these.
Ticks are accumulated by processes by the kernel using a sampling method in which the kernel receives a hardware interrupt for the clock and executes the clock handler, which adds the tick to the currently running process. At least, this is how it worked last time I looked.
So, rtdsc does not come into play at all - which is a good thing since rdtsc does not measure accurately across CPUs.
You could easily glance at some libc code. Here is the time/ directory of musl-libc
On several libraries, some low level timing syscalls are using VDSO to avoid paying the cost of a real syscall (from user-space to kernel and back), so somehow uses RTDSC.
But I am surprised that you ask. If it is curiosity, just study the source code of free software implementation. Otherwise, trust the specifications & the implementations.
Gory details could be complex, since implementation and system specific. The real implementation could be dynamically tuned at run-time (eg thru VDSO set-up in the kernel).

Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?

I want to get the CPU cycles at a specific point. I use this function at that point:
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
// broken for 64-bit builds; don't copy this code
return x;
}
(editor's note: "=A" is wrong for x86-64; it picks either RDX or RAX. Only in 32-bit mode will it pick the EDX:EAX output you want. See How to get the CPU cycle count in x86_64 from C++?.)
The problem is that it returns always an increasing number (in every run). It's as if it is referring to the absolute time.
Am I using the functions incorrectly?
As long as your thread stays on the same CPU core, the RDTSC instruction will keep returning an increasing number until it wraps around. For a 2GHz CPU, this happens after 292 years, so it is not a real issue. You probably won't see it happen. If you expect to live that long, make sure your computer reboots, say, every 50 years.
The problem with RDTSC is that you have no guarantee that it starts at the same point in time on all cores of an elderly multicore CPU and no guarantee that it starts at the same point in time time on all CPUs on an elderly multi-CPU board.
Modern systems usually do not have such problems, but the problem can also be worked around on older systems by setting a thread's affinity so it only runs on one CPU. This is not good for application performance, so one should not generally do it, but for measuring ticks, it's just fine.
(Another "problem" is that many people use RDTSC for measuring time, which is not what it does, but you wrote that you want CPU cycles, so that is fine. If you do use RDTSC to measure time, you may have surprises when power saving or hyperboost or whatever the multitude of frequency-changing techniques are called kicks in. For actual time, the clock_gettime syscall is surprisingly good under Linux.)
I would just write rdtsc inside the asm statement, which works just fine for me and is more readable than some obscure hex code. Assuming it's the correct hex code (and since it neither crashes and returns an ever-increasing number, it seems so), your code is good.
If you want to measure the number of ticks a piece of code takes, you want a tick difference, you just need to subtract two values of the ever-increasing counter. Something like uint64_t t0 = rdtsc(); ... uint64_t t1 = rdtsc() - t0;
Note that for if very accurate measurements isolated from surrounding code are necessary, you need to serialize, that is stall the pipeline, prior to calling rdtsc (or use rdtscp which is only supported on newer processors). The one serializing instruction that can be used at every privilegue level is cpuid.
In reply to the further question in the comment:
The TSC starts at zero when you turn on the computer (and the BIOS resets all counters on all CPUs to the same value, though some BIOSes a few years ago did not do so reliably).
Thus, from your program's point of view, the counter started "some unknown time in the past", and it always increases with every clock tick the CPU sees. Therefore if you execute the instruction returning that counter now and any time later in a different process, it will return a greater value (unless the CPU was suspended or turned off in between). Different runs of the same program get bigger numbers, because the counter keeps growing. Always.
Now, clock_gettime(CLOCK_PROCESS_CPUTIME_ID) is a different matter. This is the CPU time that the OS has given to the process. It starts at zero when your process starts. A new process starts at zero, too. Thus, two processes running after each other will get very similar or identical numbers, not ever growing ones.
clock_gettime(CLOCK_MONOTONIC_RAW) is closer to how RDTSC works (and on some older systems is implemented with it). It returns a value that ever increases. Nowadays, this is typically a HPET. However, this is really time, and not ticks. If your computer goes into low power state (e.g. running at 1/2 normal frequency), it will still advance at the same pace.
There's lots of confusing and/or wrong information about the TSC out there, so I thought I'd try to clear some of it up.
When Intel first introduced the TSC (in original Pentium CPUs) it was clearly documented to count cycles (and not time). However, back then CPUs mostly ran at a fixed frequency, so some people ignored the documented behaviour and used it to measure time instead (most notably, Linux kernel developers). Their code broke in later CPUs that don't run at a fixed frequency (due to power management, etc). Around that time other CPU manufacturers (AMD, Cyrix, Transmeta, etc) were confused and some implemented TSC to measure cycles and some implemented it so it measured time, and some made it configurable (via. an MSR).
Then "multi-chip" systems became more common for servers; and even later multi-core was introduced. This led to minor differences between TSC values on different cores (due to different startup times); but more importantly it also led to major differences between TSC values on different CPUs caused by CPUs running at different speeds (due to power management and/or other factors).
People that were trying to use it wrong from the start (people who used it to measure time and not cycles) complained a lot, and eventually convinced CPU manufacturers to standardise on making the TSC measure time and not cycles.
Of course this was a mess - e.g. it takes a lot of code just to determine what the TSC actually measures if you support all 80x86 CPUs; and different power management technologies (including things like SpeedStep, but also things like sleep states) may effect TSC in different ways on different CPUs; so AMD introduced a "TSC invariant" flag in CPUID to tell the OS that the TSC can be used to measure time correctly.
All recent Intel and AMD CPUs have been like this for a while now - TSC counts time and doesn't measure cycles at all. This means if you want to measure cycles you had to use (model specific) performance monitoring counters. Unfortunately the performance monitoring counters are an even worse mess (due to their model specific nature and convoluted configuration).
good answers already, and Damon already mentioned this in a way in his answer, but I'll add this from the actual x86 manual (volume 2, 4-301) entry for RDTSC:
Loads the current value of the processor's time-stamp counter (a 64-bit MSR) into the EDX:EAX registers. The EDX register is loaded with the high-order 32 bits of the MSR and the EAX register is loaded with the low-order 32 bits. (On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared.)
The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset. See "Time Stamp Counter" in Chapter 17 of the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3B, for specific details of the time stamp counter behavior.

What is the lowest time resolution for somewhat accurate measurements of cpu usage?

Some of the things I want to measure are very short,and I can only repeat them so many times if I don't run any of the setup/dispose code in the middle.
note: on linux,reading /proc/stat
Not very portable and you'll have to take great care so it is reliable, but the Time Stamp Counter definitely has the highest resolution available (increases at every CPU tick).
The time stamp counter has, until
recently, been an excellent
high-resolution, low-overhead way of
getting CPU timing information. With
the advent of multi-core/hyperthreaded
CPUs, systems with multiple CPUs, and
"hibernating" operating systems, the
TSC cannot be relied on to provide
accurate results - unless great care
is taken to correct the possible
flaws: rate of tick and whether all
cores (processors) have identical
values in their time-keeping
registers. There is no promise that
the timestamp counters of multiple
CPUs on a single motherboard will be
synchronized. In such cases,
programmers can only get reliable
results by locking their code to a
single CPU. Even then, the CPU speed
may change due to power-saving
measures taken by the OS or BIOS, or
the system may be hibernated and later
resumed (resetting the time stamp
counter). In those latter cases, to
stay relevant, the counter must be
recalibrated periodically (according
to the time resolution your
application requires).
There's some notes there about Linux specific solutions on the page, too:
Under Linux, similar functionality is
provided by reading the value of
CLOCK_MONOTONIC clock using POSIX
clock_gettime function.

Resources