Microsecond accurate (or better) process timing in Linux - linux

I need a very accurate way to time parts of my program. I could use the regular high-resolution clock for this, but that will return wallclock time, which is not what I need: I needthe time spent running only my process.
I distinctly remember seeing a Linux kernel patch that would allow me to time my processes to nanosecond accuracy, except I forgot to bookmark it and I forgot the name of the patch as well :(.
I remember how it works though:
On every context switch, it will read out the value of a high-resolution clock, and add the delta of the last two values to the process time of the running process. This produces a high-resolution accurate view of the process' actual process time.
The regular process time is kept using the regular clock, which is I believe millisecond accurate (1000Hz), which is much too large for my purposes.
Does anyone know what kernel patch I'm talking about? I also remember it was like a word with a letter before or after it -- something like 'rtimer' or something, but I don't remember exactly.
(Other suggestions are welcome too)
The Completely Fair Scheduler suggested suggested by Marko is not what I was looking for, but it looks promising. The problem I have with it is that the calls I can use to get process time are still not returning values that are granular enough.
times() is returning values 21, 22, in milliseconds.
clock() is returning values 21000, 22000, same granularity.
getrusage() is returning values like 210002, 22001 (and somesuch), they look to have a bit better accuracy but the values look conspicuously the same.
So now the problem I'm probably having is that the kernel has the information I need, I just don't know the system call that will return it.

If you are looking for this level of timing resolution, you are probably trying to do some micro-optimization. If that's the case, you should look at PAPI. Not only does it provide both wall-clock and virtual (process only) timing information, it also provides access to CPU event counters, which can be indispensable when you are trying to improve performance.
http://icl.cs.utk.edu/papi/

See this question for some more info.
Something I've used for such things is gettimeofday(). It provides a structure with seconds and microseconds. Call it before the code, and again after. Then just subtract the two structs using timersub, and you can get the time it took in seconds from the tv_usec field.

If you need very small time units to for (I assume) testing the speed of your software, I would reccomend just running the parts you want to time in a loop millions of times, take the time before and after the loop and calculate the average. A nice side-effect of doing this (apart from not needing to figure out how to use nanoseconds) is that you would get more consistent results because the random overhead caused by the os sceduler will be averaged out.
Of course, unless your program doesn't need to be able to run millions of times in a second, it's probably fast enough if you can't measure a millisecond running time.

I believe CFC (Completely Fair Scheduler) is what you're looking for.

You can use the High Precision Event Timer (HPET) if you have a fairly recent 2.6 kernel. Check out Documentation/hpet.txt on how to use it. This solution is platform dependent though and I believe it is only available on newer x86 systems. HPET has at least a 10MHz timer so it should fit your requirements easily.
I believe several PowerPC implementations from Freescale support a cycle exact instruction counter as well. I used this a number of years ago to profile highly optimized code but I can't remember what it is called. I believe Freescale has a kernel patch you have to apply in order to access it from user space.

http://allmybrain.com/2008/06/10/timing-cc-code-on-linux/
might be of help to you (directly if you are doing it in C/C++, but I hope it will give you pointers even if you're not)... It claims to provide microsecond accuracy, which just passes your criterion. :)

I think I found the kernel patch I was looking for. Posting it here so I don't forget the link:
http://user.it.uu.se/~mikpe/linux/perfctr/
http://sourceforge.net/projects/perfctr/
Edit: It works for my purposes, though not very user-friendly.

try the CPU's timestamp counter? Wikipedia seems to suggest using clock_gettime().

Related

clock_monotonic_raw alternative in linux versions older than 2.6.28

clock_monotonic_raw is only supported as of Linux 2.6.28.
is there another way i can get a monotonic time which isn't subject to NTP adjustments or the incremental adjustments performed by adjtime?
i can't use clock_monotonic since it's affected by NTP & adjtime.
Take a closer look at CLOCK_MONOTONIC instead of just CLOCK_MONOTONIC_RAW. I wanted to use CLOCK_MONOTONIC_RAW with condition waits, but found that it was not supported (Fedora 25/Linux 4.10.17).
The situation is vaguely infuriating, but the upshot on Linux, to the best of my current understanding, is:
CLOCK_MONOTONIC_RAW is the closest you are going to get to a square-wave accumulator running at constant frequency. However, this is well-suited to less than you might think.
CLOCK_MONOTONIC is based on CLOCK_MONOTONIC_RAW with some gradual frequency corrections applied that allow it to eventually overtake or fall behind some other clock reference. NTP and adjtime can both make these corrections. However, to avoid doing things like breaking the hell out of software builds, the clock is still guaranteed to monotonically advance:
"The adjustment that adjtime() makes to the clock is carried out in such a manner that the clock is always monotonically increasing."
--adjtime man page
"You lying bastard." --me
Yup-- that was the plan, but bugs exist in kernel versions before 2.6.32.19; see discussion here: https://stackoverflow.com/a/3657433/3005946 which includes a link to a patch, if that affects you. It's hard for me to tell what the maximum error from that bug is (and I'd really, really like to know).
Even in the 4.x kernel, most POSIX synchronization objects don't seem to support CLOCK_MONOTONIC_RAW or CLOCK_MONOTONIC_COARSE. I have found this out the hard way. ALWAYS error-check your *_setclock calls.
POSIX semaphores (sem_t) don't support ANY monotonic clocks at all, which is infuriating. If you need this, you will have to roll your own using condition-waits. (As a bonus, doing this will allow you to have semaphores with negative initial levels, which can be handy.)
If you're just trying to keep a synchronization object's wait-function from deadlocking forever, and you've got something like a three-second bailout, you can just use CLOCK_MONOTONIC and call it a day-- the adjustments CLOCK_MONOTONIC is subject to are, effectively, jitter correction, well-below your accuracy requirements. Even in the buggy implementations, CLOCK_MONOTONIC is not going to jump backwards an hour or something like that. Again, what happens instead is that things like adjtime tweak the frequency of the clock, so that it gradually overtakes, or falls behind, some other, parallel-running clock.
CLOCK_REALTIME is in fact CLOCK_MONOTONIC with some other correction factors applied. Or the other way around. Whatever, it's effectively the same thing. The important part is that, if there is any chance your application will change time zones (moving vehicle, aircraft, ship, cruise missile, rollerblading cyborg) or encounter administrative clock adjustments, you should definitely NOT use CLOCK_REALTIME, or anything that requires it. Same thing if the server uses daylight savings time instead of UTC. However, a stationary server using UTC may be able to get away with CLOCK_REALTIME for coarse deadlock-avoidance purposes, if necessary. Avoid this unless you are on a pre-2.6 kernel and have no choice, though.
CLOCK_MONOTONIC_RAW is NOT something you want to use for timestamping. It has not been jitter-corrected yet, etc. It may be appropriate for DACs and ADCs, etc., but is not what you want to use for logging events on human-discernible time scales. We have NTP for a reason.
Hope this is helpful, and I can certainly understand the frustration.

PBS walltime: how much was actually used?

How do I figure out how much walltime (mem? vmem?) a PBS job (PBS Pro) actually ended up using, if it's not presented in the stodut/sterr logs?
In Torque, this information is visible in the accounting log and in the qstat -f output for the job. In qstat -f, you wanted to look at the resources_used information.
This may have diverged somewhat in PBS Pro, but my guess is they have something similar.
Wall time is always measured outside of the system. That's why it refers to the "clock on the wall".
This is important because it often encompasses elements that some systems fail to measure, or measure poorly. To illustrate, before a system can capture the time, some code must run to allocate the memory to capture the time, and then some code must run to assign that memory. Everything before that happens is misreported to not have "cost" any time at all.
While I may have described the essence of wall time, do look to dbeer's excellent answer for capturing a time close to wall clock time (and hopefully solving your metric gathering problem).

How can I prove __udelay() is working correctly on my ARM embedded system?

We have an ARM9 using the 3.2 kernel -- everything seems to work fine. Recently I was asked to add some code to add a 50ms pulse on some GPIO lines at startup. The pulse code is fine; I can see the lines go down and up, as expected. What does not work the way I expected is the udelay() function. Reading the docs makes me think the units are in microseconds, but as measured in the logic analyzer it was way too short. So I finally added this code to get 50ms.
// wait 50ms to be sure PCIE reset takes
for (i=0;i<6100;i++) // measured on logic analyzer - seems wrong to me!!
{
__udelay(2000); // 2000 is max
}
I don't like it, but it works fine. There are some odd constants and instructions in the udelay code. Can someone enlighten me as to how this is supposed to work? This code is called after all the clocks are initialized, so everything else seems ok.
According to Linus in this thread:
If it's about 1% off, it's all fine. If somebody picked a delay value
that is so sensitive to small errors in the delay that they notice
that - or even notice something like 5% - then they have picked too
short of a delay.
udelay() was never really meant to be some kind of precision
instrument. Especially with CPU's running at different frequencies,
we've historically had some rather wild fluctuation. The traditional
busy loop ends up being affected not just by interrupts, but also by
things like cache alignment (we used to inline it), and then later the
TSC-based one obviously depended on TSC's being stable (which they
weren't for a while).
So historically, we've seen udelay() being really off (ie 50% off
etc), I wouldn't worry about things in the 1% range.
Linus
So it's not going to be perfect. It's going to be off. By how much is dependent on a lot of factors. Instead of using a for loop, consider using mdelay instead. It might be a bit more accurate. From the O'Reilly Linux Device Drivers book:
The udelay call should be called only for short time lapses because
the precision of loops_per_second is only eight bits, and noticeable
errors accumulate when calculating long delays. Even though the
maximum allowable delay is nearly one second (since calculations
overflow for longer delays), the suggested maximum value for udelay is
1000 microseconds (one millisecond). The function mdelay helps in
cases where the delay must be longer than one millisecond.
It's also important to remember that udelay is a busy-waiting function
(and thus mdelay is too); other tasks can't be run during the time
lapse. You must therefore be very careful, especially with mdelay, and
avoid using it unless there's no other way to meet your goal.
Currently, support for delays longer than a few microseconds and
shorter than a timer tick is very inefficient. This is not usually an
issue, because delays need to be just long enough to be noticed by
humans or by the hardware. One hundredth of a second is a suitable
precision for human-related time intervals, while one millisecond is a
long enough delay for hardware activities.
Specifically the line "the suggested maximum value for udelay is 1000 microseconds (one millisecond)" sticks out at me since you state that 2000 is the max. From this document on inserting delays:
mdelay is macro wrapper around udelay, to account for possible
overflow when passing large arguments to udelay
So it's possible you're running into an overflow error. Though I wouldn't normally consider 2000 to be a "large argument".
But if you need real accuracy in your timing, you'll need to deal with the offset like you have, roll your own or use a different kernel. For information on how to roll your own delay function using assembler or using hard real time kernels, see this article on High-resolution timing.
See also: Linux Kernel: udelay() returns too early?

Accurate way of measuring overhead in kernel space

I recently implemented a security mechanism for Linux which hooks into system calls. Now I have to measure the overhead caused by it. The project requires to compare the execution time of typical Linux apps with and without the mechanism. By typical Linux apps I assume ex. gzipping 1G file, doing 'find /', grepping files. The main goal is to show the overhead in different types of tasks: CPU bound, I/O bound etc.
The question is: how to organise the test so that they will be reliable? The first important thing is the fact that my mechanism works only in kernel space, so it is relevant to compare systime. I can use 'time' command for it, but is it the most accurate way of measuring systime? Another idea is to run those apps in long loops to minimize error. Then the loops should be inside or outside time command? If they are outside I will get many results - should I choose min, max, median, average?
Thanks for any suggestions.
I think you want more to measure a typical application payload (as Ninjajl's comment suggests, the compilation of the kernel could be a good payload). You probably don't want to measure the overhead inside each syscall itself, or even inside the kernel as a whole.
The reason for this is that most applications spend much more time and resource in user-space than in kernel-land (i.e. syscalls), so overhead inside syscalls is a "second-order" effect and probably don't matter as much. Of course, there are probable exceptions.
Perhaps phoronix test suite might be relevant.
You might be interested by oprofile
See also this answer and this question

Question about cycle counting accuracy when emulating a CPU

I am planning on creating a Sega Master System emulator over the next few months, as a hobby project in Java (I know it isn't the best language for this but I find it very comfortable to work in, and as a frequent user of both Windows and Linux I thought a cross-platform application would be great). My question regards cycle counting;
I've looked over the source code for another Z80 emulator, and for other emulators as well, and in particular the execute loop intrigues me - when it is called, an int is passed as an argument (let's say 1000 as an example). Now I get that each opcode takes a different number of cycles to execute, and that as these are executed, the number of cycles is decremented from the overall figure. Once the number of cycles remaining is <= 0, the execute loop finishes.
My question is that many of these emulators don't take account of the fact that the last instruction to be executed can push the number of cycles to a negative value - meaning that between execution loops, one may end up with say, 1002 cycles being executed instead of 1000. Is this significant? Some emulators account for this by compensating on the next execute loop and some don't - which approach is best? Allow me to illustrate my question as I'm not particularly good at putting myself across:
public void execute(int numOfCycles)
{ //this is an execution loop method, called with 1000.
while (numOfCycles > 0)
{
instruction = readInstruction();
switch (instruction)
{
case 0x40: dowhatever, then decrement numOfCycles by 5;
break;
//lets say for arguments sake this case is executed when numOfCycles is 3.
}
}
After the end of this particular looping example, numOfCycles would be at -2. This will only ever be a small inaccuracy but does it matter overall in peoples experience? I'd appreciate anyone's insight on this one. I plan to interrupt the CPU after every frame as this seems appropriate, so 1000 cycles is low I know, this is just an example though.
Many thanks,
Phil
most emulators/simulators dealing just with CPU Clock tics
That is fine for games etc ... So you got some timer or what ever and run the simulation of CPU until CPU simulate the duration of the timer. Then it sleeps until next timer interval occurs. This is very easy to simulate. you can decrease the timing error by the approach you are asking about. But as said here for games is this usually unnecessary.
This approach has one significant drawback and that is your code works just a fraction of a real time. If the timer interval (timing granularity) is big enough this can be noticeable even in games. For example you hit a Keyboard Key in time when emulation Sleeps then it is not detected. (keys sometimes dont work). You can remedy this by using smaller timing granularity but that is on some platforms very hard. In that case the timing error can be more "visible" in software generated Sound (at least for those people that can hear it and are not deaf-ish to such things like me).
if you need something more sophisticated
For example if you want to connect real HW to your emulation/simulation then you need to emulate/simulate BUS'es. Also things like floating bus or contention of system is very hard to add to approach #1 (it is doable but with big pain).
If you port the timings and emulation to Machine cycles things got much much easier and suddenly things like contention or HW interrupts, floating BUS'es are solving themselves almost on their own. I ported my ZXSpectrum Z80 emulator to this kind of timing and see the light. Many things get obvious (like errors in Z80 opcode documentation, timings etc). Also the contention got very simple from there (just few lines of code instead of horrible decoding tables almost per instruction type entry). The HW emulation got also pretty easy I added things like FDC controlers AY chips emulations to the Z80 in this way (no hacks it really runs on their original code ... even Floppy formating :)) so no more TAPE Loading hacks and not working for custom loaders like TURBO
To make this work I created my emulation/simulation of Z80 in a way that it uses something like microcode for each instruction. As I very often corrected errors in Z80 instruction set (as there is no single 100% correct doc out there I know of even if some of them claim that they are bug free and complete) I come with a way how to deal with it without painfully reprogramming the emulator.
Each instruction is represented by an entry in a table, with info about timing, operands, functionality... Whole instruction set is a table of all theses entries for all instructions. Then I form a MySQL database for my instruction set. and form similar tables to each instruction set I found. Then painfully compared all of them selecting/repairing what is wrong and what is correct. The result is exported to single text file which is loaded at emulation startup. It sound horrible but In reality it simplifies things a lot even speedup the emulation as the instruction decoding is now just accessing pointers. The instruction set data file example can be found here What's the proper implementation for hardware emulation
Few years back I also published paper on this (sadly institution that holds that conference does not exist anymore so servers are down for good on those old papers luckily I still got a copy) So here image from it that describes the problematics:
a) Full throtlle has no synchronization just raw speed
b) #1 has big gaps causing HW synchronization problems
c) #2 needs to sleep a lot with very small granularity (can be problematic and slow things down) But the instructions are executed very near their real time ...
Red line is the host CPU processing speed (obviously what is above it take a bit more time so it should be cut and inserted before next instruction but it would be hard to draw properly)
Magenta line is the Emulated/Simulated CPU processing speed
alternating green/blue colors represent next instruction
both axises are time
[edit1] more precise image
The one above was hand painted... This one is generated by VCL/C++ program:
generated by these parameters:
const int iset[]={4,6,7,8,10,15,21,23}; // possible timings [T]
const int n=128,m=sizeof(iset)/sizeof(iset[0]); // number of instructions to emulate, size of iset[]
const int Tps_host=25; // max possible simulation speed [T/s]
const int Tps_want=10; // wanted simulation speed [T/s]
const int T_timer=500; // simulation timer period [T]
so host can simulate at 250% of wanted speed and simulation granularity is 500T. Instructions where generated pseudo-randomly...
Was a quite interesting article on Arstechnica talking about console simulation recently, also links to quite a few simulators that might make for quite good research:
Accuracy takes power: one man's 3GHz quest to build a perfect SNES emulator
The relevant bit is that the author mentions, and I am inclined to agree, that most games will appear to function pretty correctly even with timing deviations of +/-20%. The issue you mention looks likely to never really introduce more than a fraction of a percent timing error, which is probably imperceptible whilst playing the final game. The authors probably didn't consider it worth dealing with.
I guess that depends on how accurate you want your emulator to be. I do not think that it has to be that accurate. Think emulation of x86 platform, there are so many variants of processors and each has different execution latencies and issue rates.

Resources