clock_monotonic_raw alternative in linux versions older than 2.6.28 - linux

clock_monotonic_raw is only supported as of Linux 2.6.28.
is there another way i can get a monotonic time which isn't subject to NTP adjustments or the incremental adjustments performed by adjtime?
i can't use clock_monotonic since it's affected by NTP & adjtime.

Take a closer look at CLOCK_MONOTONIC instead of just CLOCK_MONOTONIC_RAW. I wanted to use CLOCK_MONOTONIC_RAW with condition waits, but found that it was not supported (Fedora 25/Linux 4.10.17).
The situation is vaguely infuriating, but the upshot on Linux, to the best of my current understanding, is:
CLOCK_MONOTONIC_RAW is the closest you are going to get to a square-wave accumulator running at constant frequency. However, this is well-suited to less than you might think.
CLOCK_MONOTONIC is based on CLOCK_MONOTONIC_RAW with some gradual frequency corrections applied that allow it to eventually overtake or fall behind some other clock reference. NTP and adjtime can both make these corrections. However, to avoid doing things like breaking the hell out of software builds, the clock is still guaranteed to monotonically advance:
"The adjustment that adjtime() makes to the clock is carried out in such a manner that the clock is always monotonically increasing."
--adjtime man page
"You lying bastard." --me
Yup-- that was the plan, but bugs exist in kernel versions before 2.6.32.19; see discussion here: https://stackoverflow.com/a/3657433/3005946 which includes a link to a patch, if that affects you. It's hard for me to tell what the maximum error from that bug is (and I'd really, really like to know).
Even in the 4.x kernel, most POSIX synchronization objects don't seem to support CLOCK_MONOTONIC_RAW or CLOCK_MONOTONIC_COARSE. I have found this out the hard way. ALWAYS error-check your *_setclock calls.
POSIX semaphores (sem_t) don't support ANY monotonic clocks at all, which is infuriating. If you need this, you will have to roll your own using condition-waits. (As a bonus, doing this will allow you to have semaphores with negative initial levels, which can be handy.)
If you're just trying to keep a synchronization object's wait-function from deadlocking forever, and you've got something like a three-second bailout, you can just use CLOCK_MONOTONIC and call it a day-- the adjustments CLOCK_MONOTONIC is subject to are, effectively, jitter correction, well-below your accuracy requirements. Even in the buggy implementations, CLOCK_MONOTONIC is not going to jump backwards an hour or something like that. Again, what happens instead is that things like adjtime tweak the frequency of the clock, so that it gradually overtakes, or falls behind, some other, parallel-running clock.
CLOCK_REALTIME is in fact CLOCK_MONOTONIC with some other correction factors applied. Or the other way around. Whatever, it's effectively the same thing. The important part is that, if there is any chance your application will change time zones (moving vehicle, aircraft, ship, cruise missile, rollerblading cyborg) or encounter administrative clock adjustments, you should definitely NOT use CLOCK_REALTIME, or anything that requires it. Same thing if the server uses daylight savings time instead of UTC. However, a stationary server using UTC may be able to get away with CLOCK_REALTIME for coarse deadlock-avoidance purposes, if necessary. Avoid this unless you are on a pre-2.6 kernel and have no choice, though.
CLOCK_MONOTONIC_RAW is NOT something you want to use for timestamping. It has not been jitter-corrected yet, etc. It may be appropriate for DACs and ADCs, etc., but is not what you want to use for logging events on human-discernible time scales. We have NTP for a reason.
Hope this is helpful, and I can certainly understand the frustration.

Related

Is there a standard constant *nix benchmark, and if not, how to make a `bogobench`?

Ordinary single-threaded *nix programs can be benchmarked with utils like time, i.e.:
# how long does `seq` take to count to 100,000,000
/usr/bin/time seq 100000000 > /dev/null
Outputs:
1.16user 0.06system 0:01.23elapsed 100%CPU (0avgtext+0avgdata 1944maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps
...but numbers returned are always system dependent, which in a sense also measures the user's hardware.
Is there some non-relative benchmarking method or command-line util which would return approximately the same virtual timing numbers on any system, (or at least a reasonably large subset of systems)? Just like grep -m1 bogo /proc/cpuinfo returns a roughly approximate but stable unit, such a benchmark should also return a somewhat similar unit of duration.
Suppose for benchmarking ordinary commands we have a magic util bogobench (where "bogo" is an adjective signifying "a somewhat bogus status", but not necessarily having algorithms in common with BogoMIPs):
bogobench foo bar.data
And we run this on two physically separate systems:
a 1996 Pentium II
a 2015 Xeon
Desired output would be something like:
21 bogo-seconds
So bogobench should return about the same number in both cases, even though it probably would finish in much less time on the 2nd system.
A hardware emulator like qemu might be one approach, but not necessarily the only approach:
Insert the code to benchmark into a wrapper script bogo.sh
Copy bogo.sh to a bootable Linux disk image bootimage.iso, within a directory where bogo.sh would autorun then promptly shutdown the emulator. During which it outputs some form of timing data to parse into bogo-seconds.
Run bootimage.iso using one of qemu's more minimal -machine options:
qemu-system-i386 -machine type=isapc bootimage.iso
But I'm not sure how to make qemu use a virtual clock, rather than the host CPU's clock, and qemu itself seems like a heavy tool for a seemingly simple task. (Really MAME or MESS would be more versatile emulators than qemu for such a task -- but I'm not adept with MAME, although MAME currently has some capacity for 80486 PC emulation.)
Online we sometimes compare and contrast timing-based benchmarks made on machine X with one made on machine Y. Whereas I'd like both user X and Y to be able to do their benchmark on a virtual machine Z, with bonus points for emulating X or Y (like MAME) if need be, except with no consideration of X or Y's real run-time, (unlike MAME where emulations are often playable). In this way users could report how programs perform in interesting cases without the programmer having to worry that the results were biased by idiosyncrasies of a user's hardware, such as CPU quirks, background processes hogging resources, etc.
Indeed, even on the user's own hardware, a time based benchmark can be unreliable, as often the user can't be sure some background process, (or bug, or hardware error like a bad sector, or virus), might not be degrading some aspect of performance. Whereas a more virtual benchmark ought to be less susceptible to such influences.
The only sane way I see to implement this is with a cycle-accurate simulator for some kind of hardware design.
AFAIK, no publicly-available cycle-accurate simulators for modern x86 hardware exist, because it's extremely complex and despite a lot of stuff being known about x86 microarchitecture internals (Agner Fog's stuff, Intel's and AMD's own optimization guides, and other stuff in the x86 tag wiki), enough of the behaviour is still a black box full of CPU-design trade-secrets that it's at best possible to simulate something similar. (E.g. branch prediction is definitely one of the most secret but highly important parts).
While it should be possible to come close to simulating Intel Sandybridge or Haswell's actual pipeline and out-of-order core / ROB / RS (at far slower than realtime), nobody has done it that I know of.
But cycle-accurate simulators for other hardware designs do exist: Donald Knuth's MMIX architecture is a clean RISC design that could actually be built in silicon, but currently only exists on paper.
From that link:
Of particular interest is the MMMIX meta-simulator, which is able to do dynamic scheduling of a complex pipeline, allowing superscalar execution with any number of functional units and with many varieties of caching and branch prediction, etc., including a detailed implementation of both hard and soft interrupts.
So you could use this as a reference machine for everyone to run their benchmarks on, and everyone could get comparable results that will tell you how fast something runs on MMIX (after compiling for MMIX with gcc). But not how fast it runs on x86 (presumably also compiling with gcc), which may differ by a significant factor even for two programs that do the same job a different way.
For [fastest-code] challenges over on the Programming puzzles and Code Golf site, #orlp created the GOLF architecture with a simulator that prints timing results, designed for exactly this purpose. It's a toy architecture with stuff like print to stdout by storing to 0xffffffffffffffff, so it's not necessarily going to tell you anything about how fast something will run on any real hardware.
There isn't a full C implementation for GOLF, AFAIK, so you can only really use it with hand-written asm. This is a big difference from MMIX, which optimizing compilers do target.
One practical approach that could (maybe?) be extended to be more accurate over time is to use existing tools to measure some hardware invariant performance metric(s) for the code under test, and then apply a formula to come up with your bogoseconds score.
Unfortunately most easily measurable hardware metrics are not invariant - rather, they depend on the hardware. An obvious one that should be invariant, however, would be "instructions retired". If the code is taking the same code paths every time it is run, the instructions retired count should be the same on all hardware1.
Then you apply some kind of nominal clock speed (let's say 1 GHz) and nominal CPI (let's say 1.0) to get your bogoseconds - if you measure 15e9 instructions, you output a result of 15 bogoseconds.
The primary flaw here is that the nominal CPI may be way off from the actual CPI! While most programs hover around 1 CPI, it's easy to find examples where they can approach 0.25 or whatever the inverse of the width is, or alternately be 10 or more if there are many lengthy stalls. Of course such extreme programs may be what you'd want to benchmark - and even if not you have the issue that if you are using your benchmark to evaluate code changes, it will ignore any improvements or regressions in CPI and look only at instruction count.
Still, it satisfies your requirement in as much as it effectively emulates a machine that executes exactly 1 instruction every cycle, and maybe it's a reasonable broad-picture approach. It is pretty easy to implement with tools like perf stat -e instructions (like one-liner easy).
To patch the holes then you could try to make the formula better - let's say you could add in a factor for cache misses to account for that large source of stalls. Unfortunately, how are you going to measure cache-misses in a hardware invariant way? Performance counters won't help - they rely on the behavior and sizes of your local caches. Well, you could use cachegrind to emulate the caches in a machine-independent way. As it turns out, cachegrind even covers branch prediction. So maybe you could plug your instruction count, cache miss and branch miss numbers into a better formula (e.g., use typical L2, L3, RAM latencies, and a typical cost for branch misses).
That's about as far as this simple approach will take you, I think. After that, you might as well just rip apart any of the existing x862 emulators and add your simple machine model right in there. You don't need to cycle accurate, just pick a nominal width and model it. Probably whatever underlying emulation cachegrind is going might be a good match and you get the cache and branch prediction modeling already for free.
1 Of course, this doesn't rule out bugs or inaccuracies in the instruction counting mechanism.
2 You didn't tag your question x86 - but I'm going to assume that's your target since you mentioned only Intel chips.

Is clock_nanosleep affected by adjtime and NTP?

Usually CLOCK_MONOTONIC_RAW is used for obtaining a clock that is not affected by NTP or adjtime(). However clock_nanosleep() doesn't support CLOCK_MONOTONIC_RAW and trying to use it anyway will result in return code 95 Operation not supported (Kernel 4.6.0).
Does clock_nanosleep() somehow take these clock adjustments into account or will the sleep time be affected by it?
What are the alternatives if a sleeping time is required which should not be affected by clock adjustments?
CLOCK_MONOTONIC_RAW never had support for clock_nanosleep() since it was introduced in Linux 2.6.28. It was also explicitly fixed to not have this support in 2.6.32 because of oopses. The code had been refactored several times after that, but still there is no support for CLOCK_MONOTONIC_RAW in clock_nanosleep() and I wasn't able to find any comments on why is that.
At the very minimum, the fact that there was a patch that explicitly disabled this functionality and it passed all reviews tells us that it doesn't look like a big problem for kernel developers. So, at the moment (4.7) the only things CLOCK_MONOTONIC_RAW supports are clock_getres() and clock_gettime().
Speaking of adjustments, as already noted by Rich CLOCK_MONOTONIC is subject to rate adjustments just by the nature of this clock. This happens because hrtimer_interrupt() runs its queues with adjusted monotonic time value (ktime_get_update_offsets_now()->timekeeping_get_ns()->timekeeping_delta_to_ns() and that operates with xtime_nsec which is subject to adjustment). Actually, looking at this code I'm probably no longer surprised that CLOCK_MONOTONIC_RAW has no support for clock_nanosleep() (and probably won't have it in future) — adjusted monotonic clock usage seems to be the basis for hrtimers.
As for alternatives, I think there are none. nanosleep() uses the same CLOCK_MONOTONIC, setitimer() has its own set of timers, alarm() uses ITIMER_REAL (same as setitimer()), that (with some indirection) is also our good old friend CLOCK_MONOTONIC. What else do we have? I guess nothing.
As an unrelated side note, there is an interesting observation in that if you call clock_nanosleep() for relative interval (that is not TIMER_ABSTIME) then CLOCK_REALTIME actually becomes a synonym for CLOCK_MONOTONIC.

Unique time stamp in Linux

Is there a clock in Linux with nanosecond precision that is strictly increasing and maintained through a power cycle? I am attempting to store time series data in a database where each row has a unique time stamp. Do I need to use an external time source such as a GPS receiver to do this? I would like the time stamp to be in or convertible to GPS time.
This is not a duplicate of How to create a high resolution timer in Linux to measure program performance?. I am attempting to store absolute times, not calculate relative time differences. The clock must persist over a power cycle.
Most computers now have software that periodically corrects the system time using the internet. This means that the system clock can go up or down some milliseconds every so often. Remember that the computer clock has some drift. If you don't want problems with leap seconds, use sidereal time (no leap second corrections). ntp will be off in the microsecond or millisecond range because of differences in latency over the internet. The clocks that would meet your requirements are fifty thousand dollars and up.
Based on the question, the other answers, and discussion in comments...
You can get "nanosecond precision that is strictly increasing and maintained through a power cycle" by combining the results of clock_gettime() using the CLOCK_REALTIME and from using CLOCK_MONOTONIC - with some caveats.
First, turn off NTP. Run NTP once at each system restart to sync your time with the world, but do not update the time while the system is up. This will avoid rolling the time backwards. Without doing this you could get a sequence such as
20160617173556 1001
20160617173556 1009
20160617173556 1013
20160617173555 1020 (notice the second went backward)
20160617173556 1024
(For this example, I'm just using YYMMDDhhmmss followed by some fictional monotonic clock value.)
Then you face business decisions.
• How important is matching the world's time, compared to strictly increasing uniqueness?
. (hardware clock drift could throw off accuracy with world time)
• Given that decision, is it worth the investment in specialized hardware, rather than a standard (or even high-end) PC?
• If two events actually happen during the same nanosecond, is it acceptable to have duplicate entries?
• etc.
There are many tradeoffs possible based on the true requirements that lead to developing this application.
In no particular order:
To sure your time is free of savings time changes use date's -u argument to use UTC time. That is always increasing, baring time corrections from system admins.
The trouble with %N is that the precision you actually get depends on the hardware and can be much less than %N allows. Run a few experiments to find out. Warnings on this tend to be everywhere but it is still overlooked.
If you are writing Cish code, see the utime() function and use gmtime(), not localtime() type functions to convert to text. Look at the strftime() function to format the integer part of the time. You will find the strftime() format fields magically match those of the date command formats as basically, date calls strftime. The true paranoid willing to write additional code can use CLOCK_MONOTONIC to be sure your time is increasing.
If you truly require increasing times you may need to write your own command or function that remembers the last time. If called during the same time add and offset of 1. Keep incrementing the offset as often as required to assure unique times until you get hardware time greater than your adjusted time.
Linux tends to favor NTP to obtain Network time. The previously mentioned function to assure increasing time will help with backwards jumps as the jumps are usually not large.
If nanosecond precision is really sufficient for you:
date +%s%N

How can I prove __udelay() is working correctly on my ARM embedded system?

We have an ARM9 using the 3.2 kernel -- everything seems to work fine. Recently I was asked to add some code to add a 50ms pulse on some GPIO lines at startup. The pulse code is fine; I can see the lines go down and up, as expected. What does not work the way I expected is the udelay() function. Reading the docs makes me think the units are in microseconds, but as measured in the logic analyzer it was way too short. So I finally added this code to get 50ms.
// wait 50ms to be sure PCIE reset takes
for (i=0;i<6100;i++) // measured on logic analyzer - seems wrong to me!!
{
__udelay(2000); // 2000 is max
}
I don't like it, but it works fine. There are some odd constants and instructions in the udelay code. Can someone enlighten me as to how this is supposed to work? This code is called after all the clocks are initialized, so everything else seems ok.
According to Linus in this thread:
If it's about 1% off, it's all fine. If somebody picked a delay value
that is so sensitive to small errors in the delay that they notice
that - or even notice something like 5% - then they have picked too
short of a delay.
udelay() was never really meant to be some kind of precision
instrument. Especially with CPU's running at different frequencies,
we've historically had some rather wild fluctuation. The traditional
busy loop ends up being affected not just by interrupts, but also by
things like cache alignment (we used to inline it), and then later the
TSC-based one obviously depended on TSC's being stable (which they
weren't for a while).
So historically, we've seen udelay() being really off (ie 50% off
etc), I wouldn't worry about things in the 1% range.
Linus
So it's not going to be perfect. It's going to be off. By how much is dependent on a lot of factors. Instead of using a for loop, consider using mdelay instead. It might be a bit more accurate. From the O'Reilly Linux Device Drivers book:
The udelay call should be called only for short time lapses because
the precision of loops_per_second is only eight bits, and noticeable
errors accumulate when calculating long delays. Even though the
maximum allowable delay is nearly one second (since calculations
overflow for longer delays), the suggested maximum value for udelay is
1000 microseconds (one millisecond). The function mdelay helps in
cases where the delay must be longer than one millisecond.
It's also important to remember that udelay is a busy-waiting function
(and thus mdelay is too); other tasks can't be run during the time
lapse. You must therefore be very careful, especially with mdelay, and
avoid using it unless there's no other way to meet your goal.
Currently, support for delays longer than a few microseconds and
shorter than a timer tick is very inefficient. This is not usually an
issue, because delays need to be just long enough to be noticed by
humans or by the hardware. One hundredth of a second is a suitable
precision for human-related time intervals, while one millisecond is a
long enough delay for hardware activities.
Specifically the line "the suggested maximum value for udelay is 1000 microseconds (one millisecond)" sticks out at me since you state that 2000 is the max. From this document on inserting delays:
mdelay is macro wrapper around udelay, to account for possible
overflow when passing large arguments to udelay
So it's possible you're running into an overflow error. Though I wouldn't normally consider 2000 to be a "large argument".
But if you need real accuracy in your timing, you'll need to deal with the offset like you have, roll your own or use a different kernel. For information on how to roll your own delay function using assembler or using hard real time kernels, see this article on High-resolution timing.
See also: Linux Kernel: udelay() returns too early?

Microsecond accurate (or better) process timing in Linux

I need a very accurate way to time parts of my program. I could use the regular high-resolution clock for this, but that will return wallclock time, which is not what I need: I needthe time spent running only my process.
I distinctly remember seeing a Linux kernel patch that would allow me to time my processes to nanosecond accuracy, except I forgot to bookmark it and I forgot the name of the patch as well :(.
I remember how it works though:
On every context switch, it will read out the value of a high-resolution clock, and add the delta of the last two values to the process time of the running process. This produces a high-resolution accurate view of the process' actual process time.
The regular process time is kept using the regular clock, which is I believe millisecond accurate (1000Hz), which is much too large for my purposes.
Does anyone know what kernel patch I'm talking about? I also remember it was like a word with a letter before or after it -- something like 'rtimer' or something, but I don't remember exactly.
(Other suggestions are welcome too)
The Completely Fair Scheduler suggested suggested by Marko is not what I was looking for, but it looks promising. The problem I have with it is that the calls I can use to get process time are still not returning values that are granular enough.
times() is returning values 21, 22, in milliseconds.
clock() is returning values 21000, 22000, same granularity.
getrusage() is returning values like 210002, 22001 (and somesuch), they look to have a bit better accuracy but the values look conspicuously the same.
So now the problem I'm probably having is that the kernel has the information I need, I just don't know the system call that will return it.
If you are looking for this level of timing resolution, you are probably trying to do some micro-optimization. If that's the case, you should look at PAPI. Not only does it provide both wall-clock and virtual (process only) timing information, it also provides access to CPU event counters, which can be indispensable when you are trying to improve performance.
http://icl.cs.utk.edu/papi/
See this question for some more info.
Something I've used for such things is gettimeofday(). It provides a structure with seconds and microseconds. Call it before the code, and again after. Then just subtract the two structs using timersub, and you can get the time it took in seconds from the tv_usec field.
If you need very small time units to for (I assume) testing the speed of your software, I would reccomend just running the parts you want to time in a loop millions of times, take the time before and after the loop and calculate the average. A nice side-effect of doing this (apart from not needing to figure out how to use nanoseconds) is that you would get more consistent results because the random overhead caused by the os sceduler will be averaged out.
Of course, unless your program doesn't need to be able to run millions of times in a second, it's probably fast enough if you can't measure a millisecond running time.
I believe CFC (Completely Fair Scheduler) is what you're looking for.
You can use the High Precision Event Timer (HPET) if you have a fairly recent 2.6 kernel. Check out Documentation/hpet.txt on how to use it. This solution is platform dependent though and I believe it is only available on newer x86 systems. HPET has at least a 10MHz timer so it should fit your requirements easily.
I believe several PowerPC implementations from Freescale support a cycle exact instruction counter as well. I used this a number of years ago to profile highly optimized code but I can't remember what it is called. I believe Freescale has a kernel patch you have to apply in order to access it from user space.
http://allmybrain.com/2008/06/10/timing-cc-code-on-linux/
might be of help to you (directly if you are doing it in C/C++, but I hope it will give you pointers even if you're not)... It claims to provide microsecond accuracy, which just passes your criterion. :)
I think I found the kernel patch I was looking for. Posting it here so I don't forget the link:
http://user.it.uu.se/~mikpe/linux/perfctr/
http://sourceforge.net/projects/perfctr/
Edit: It works for my purposes, though not very user-friendly.
try the CPU's timestamp counter? Wikipedia seems to suggest using clock_gettime().

Resources