Linux' hrtimer - microsecond precision?

Linux' hrtimer - microsecond precision? - linux

Is it possible to execute tasks on a Linux host with microsecond precision? I.e., I'd like to execute a task at a specific instant of time. I know, Linux is no real-time system but I'm searching for the best solution on Linux.
So far, I've created a kernel module, setup hrtimer and measured the jitter when the callback function is entered (I don't really care too much about the actual delay, it's jitter that counts) - it's about 20-50us. That's not significantly better than using timerfd in userspace (also tried using real-time priority for the process but that did not really change anything).
I'm running Linux 3.5.0 (just an example, tried different kernels from 2.6.35 to 3.7), /proc/timer_list shows hrtimer_interrupt, I'm not running in failsafe mode which disables hrtimer functionality. Tried on different CPUs (Intel Atom to Core i7).
My best idea so far would be using hrtimer in combination with ndelay/udelay. Is this really the best way to do it? I can't believe it's not possible to trigger a task with microsecond precision. Running the code in kernel space as module is acceptable, would be great if the code was not interrupted by other tasks though. I dont' really care too much about the rest of the system, the task will be executed only very few times a second so using mdelay/ndelay for burning the CPU for some microseconds every time the task should be executed would not really matter. Altough, I'd prefer a more elegent solution.
I hope the question is clear, found a lot of topics concerning timer precision but no real answer to that problem.

You can do what you want from user space
use clock_gettime() with CLOCK_REALTIME to get the time-of-day with nano-second resolution
use nanosleep() to yield the CPU until you are close to the time you need to execute your task (it is at least milli-second resolution).
use a spin loop with clock_gettime() until you reach the desired time
execute your task
The clock_gettime() function is implemented as a VDSO in recent kernels and modern x86 processors - it takes 20-30 nanoseconds to get the time-of-day with nano-second resolution - you should be able to call clock_gettime() over 30 times per micro-second. Using this method your task should dispatch within 1/30th of a micro-second of the intended time.

The default Linux kernel timer ticks each millisecond. Microseconds is way beyond anything current user hardware is capable of.
The jitter you see is due to a host of factors, like interrupt handling and servicing higher priority tasks. You can cut that down somewhat by selecting hardware carefully, only enabling what is really needed. The real-time patchseries to the kernel (see the HOWTO) might be an option to reduce it a bit further.
Always keep in mind that any gain has a definite cost in terms of interactiveness, stability, and (last, but by far not least) your time in building, tuning, troubleshooting, and keeping the house of cards from falling apart.

Related

Thread Quantum: How to compute it

I have been reading a few posts and articles regarding thread quanta (here, here and here). Apparently Windows allocate a fix number of CPU ticks for a thread quantum depending on the windows "mode" (server, or something else). However from the last link we can read:
(A thread quantum) between 10-200 clock ticks (i.e. 10-200 ms) under Linux, though some
granularity is introduced in the calculation
Is there any way to compute the quantum length on Linux?
Does that make any sense to compute it anyway? (since from my understanding threads can still be pre-empted, nothing forces a thread to run during the full duration of the quantum)
From a developer's perspective, I could see the interest in writing a program that could predict the running time of a program given its number of threads, and "what they do" (possibly removing all the testing to find the optimal number of threads would be kind of neat, although I am not sure it is the right approach)

On Linux, the default realtime quantum length constant is declared as RR_TIMESLICE, at least in 4.x kernels; HZ must be defined while configuring the kernel.
The interval between pausing the thread whose quantum has expired and resuming it may depend on a lot of things like, say, load average.
To be able to predict the running time at least with some degree of accuracy, give the target process realtime priority; realtime processes are scheduled following a round-robin algorithm, which is generally simpler and more predictable than the common Linux scheduling algo.
To get the realtime quantum length, call sched_rr_get_interval().

Does a Tickless Linux Kernel Introduce Benchmark Timing Variations?

I'm running some benchmarks and I'm wondering whether using a "tickless" (a.k.a CONFIG_NO_HZ_FULL_ALL) Linux kernel would be useful or detrimental to benchmarking.
The benchmarks I am running will be repeated many times using a new process each time. I want to control as many sources of variation as possible.
I did some reading on the internet:
https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt
https://lwn.net/Articles/549580/
From these sources I have learned that:
In the default configuration (CONFIG_NO_HZ=y), only non-idle CPUs receive ticks. Therefore under this mode my benchmarks would always receive ticks.
In "tickless" mode (CONFIG_NO_HZ_FULL_ALL), all CPUs but one (the boot processor) are in "adaptive-tick" mode. When a CPU is in adaptive-tick mode, ticks are only received if there is more than a single job in the schedule queue for the CPU. The idea being that if there is a sole process in the queue, a context switch cannot happen, so sending ticks is not necessary.
On one hand, not having benchmarks receive ticks seems like a great idea, since we rule out the tick routine as a source of variation (we do not know how long the tick routines take). On the other hand, I think tickless mode could introduce variations in benchmark timings.
Consider my benchmarking scenario running on a tickless kernel. Suppose we repeat a benchmark twice.
Suppose the first run is lucky, and gets scheduled onto an adaptive-tick CPU which was previously idle. This benchmark will therefore not be interrupted by ticks.
When the benchmark is run a second time, perhaps it is not so lucky, and gets put on a CPU which already has some processes scheduled. This run will be interrupted by ticks at regular intervals in order to decide if one of the other processes should we switched in.
We know that ticks impose a performance hit (context switch plus the time taken to run the routine). Therefore the first benchmark run had an unfair advantage, and would appear to run faster.
Note also that a benchmark which initially has an adaptive-tick CPU to itself may find that mid-benchmark another process gets thrown on to the same CPU. In this case the benchmark is initially not receiving ticks, then later starts receiving them. This means benchmark performance can change over time.
So I think tickless mode (under my benchmarking scenario at-least) introduces timing variations. Is my reasoning correct?
One solution would be to use an isolated adaptive-tick CPU for benchmarking (isolcpus + taskset), however we have already ruled out isolated CPUs since this introduces artificial slowdowns in our multi-threaded benchmarks.
Thanks

For your "unlucky" scenario above, there has to be an active job scheduled on the same processor. This is not likely to be the case on an otherwise generally idle system, assuming that you have multiple processors. Even if this happens on one or two occasions, that means your benchmark might see the effect of one or two ticks - which hardly seems problematic.
On the other hand if it happens on many more occasions, this would be a general indication of high processor load - not an ideal scenario for running benchmarks anyway.
I would suggest, though, that "ticks" are not likely to be a significant source of variation in your benchmark timings. The scheduler is supposed to be O(1). I doubt you will see much difference in variation between tickless and non-tickless mode.

What makes a kernel/OS real-time?

I was reading this article, but my question is on a generic level, I was thinking along the following lines:
Can a kernel be called real time just because it has a real time scheduler? Or in other words, say I have a linux kernel, and if I change the default scheduler from O(1) or CFS to a real time scheduler, will it become an RTOS?
Does it require any support from the hardware? Generally I have seen embedded devices having an RTOS (eg VxWorks, QNX), do these have any special provisions/hw to support them? I know RTOS process's running time is deterministic, but then one can use longjump/setjump to get the output in determined time.
I'd really appreciate some input/insight on it, if I am wrong about something, please correct me.

After doing some research, talking to poeple (Jamie Hanrahan, Juha Aaltonen #linkedIn Group - Device Driver Experts) and ofcourse the input from #Jim Garrison, this what I can conclude:
In Jamie Hanrahan's words-
What makes a kernel real time?
The sine qua non of a real time OS -
The ability to guarantee a maximum latency between an external interrupt and the start of the interrupt handler.
Note that the maximum latency need not be particularly short (e.g. microseconds), you could have a real time OS that guaranteed an absolute maximum latency of 137 milliseconds.
A real time scheduler is one that offers completely predictable (to the developer) behavior of thread scheduling - "which thread runs next".
This is generally separate from the issue of a guaranteed maximum latency to responding to an interrupt (since interrupt handlers are not necessarily scheduled like ordinary threads) but it is often necessary to implement a real-time application. Schedulers in real-time OSs generally implement a large number of priority levels. And they almost always implement priority inheritance, to avoid priority inversion situations.
So, it is good to have a guaranteed latency for an interrupt and predictability of thread scheduling, then why not make every OS real time?
Because an OS suited for general purpose use (servers and/or desktops) needs to have characteristics that are generally at odds with real-time latency guarantees.
For example, a real-time scheduler should have completely predictable behavior. That means, among other things, that whatever priorities have been assigned to the various tasks by the developer should be left alone by the OS. This might mean that some low-priority tasks end up being starved for long periods of time. But the RT OS has to shrug and say "that's what the dev wanted." Note that to get the correct behavior, the RT system developer has to worry a lot about things like task priorities and CPU affinities.
A general-purpose OS is just the opposite. You want to be able to just throw apps and services on it, almost always things written by many different vendors (instead of being one tightly integrated system as in most R-T systems), and get good performance. Perhaps not the absolute best possible performance, but good.
Note that "good performance" is not just measured in interrupt latency. In particular, you want CPU and other resource allocations that are often described as "fair", without the user or admin or even the app developers having to worry much if at all about things like thread priorities and CPU affinities and NUMA nodes. One job might be more important than another, but in a general-purpose OS, that doesn't mean that the second job should get no resources at all.
So the general purpose OS will usually implement time-slicing among threads of equal priority, and it may adjust the priorities of threads according to their past behavior (e.g. a CPU hog might have its priority reduced; an I/O bound thread might have its priority increased, so it can keep the I/O devices working; a CPU-starved thread might have its priority boosted so it can get a little bit of CPU time now and then).
Can a kernel be called real time just because it has a real time scheduler?
No, an RT scheduler is a necessary component of an RT OS, but you also need predictable behavior in other parts of the OS.
Does it require any support from the hardware?
In general, the simpler the hardware the more predictable its behavior is. So PCI-E is less predictable than PCI, and PCI is less predictable than ISA, etc. There are specific I/O buses that were designed for (among other things) easy predictability of e.g. interrupt latency, but a lot of R-T requirements can be met these days with commodity hardware.

The specific description of real-time is that processes have minimum response time guarantees. This is often not sufficient for the application, and even less important than determinism. This is especially hard to achieve with modern feature rich OS's. Consider:
If I want to command some hardware or a machine at precise points in time, I need to be able to generate command signals at those specific moments, often with far sub millisecond accuracy. Generally if you compile let's say a C-code that runs a loop that waits for "half a millisecond" and does something, the wait time is not exactly half a millisecond, it is a little bit more, since the way common OS's handle this, is that they put the process aside at least up until the correct time has passed, after which the scheduler might (at some point) pick it up again.
What is seriously problematic is not that the time t is not exactly half a second but that it cannot be known in advance how much more it is. This inaccuracy is not constant nor deterministic.
This has surprising consequences when doing physical automation. For example it is impossible to command a stepper motor accurately with any typical OS without using dedicated hardware through kernel interfaces and telling them how long time steps you really want. Because of this, a single AVR module can command several motors accurately, but a Raspberry Pi (that absolutely stomps the AVR in terms of clockspeed) cannot manage more than 2 with any typical OS.

sleep(0)? consistent time keeping in code?

Right now i am loading a file then using gettimeofday and tracking the CPU time with tv_usec
My results varies, i get 250's to 280s but sometimes 300's or 500's. I wrote usleep and sleep (0) and (1) with no success. The time still varies vastly. I thought sleep(1) (seconds in linux, not the windows Sleep in ms) would have solved it. How can i keep track of time in a more consistent way for testing? Maybe i should wait until i have a much larger test data and more complex code before starting measurements?

The currently recommended interface for high-rez time on Linux (and POSIX in general) is clock_gettime. See the man page.
clock_gettime(CLOCK_REALTIME, struct timespec *tp) // for wall-clock time
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, struct timespec *tp) // for CPU time
But read the man page. Note that you need to link with -lrt, because POSIX says so, I guess. Maybe to avoid symbol conflicts in -lc, for old programs that defined their own clock_gettime? But dynamic libs use weak symbols...
The best sleep function is nanosleep. It doesn't mess around with signals or any crap like usleep. It is defined to just sleep, and not have any other side effects. And it tells you if you woke up early (e.g. from signals), so you don't necessarily have to call another time function.
Anyway, you're going to have a hard time testing one rep of something that short that involves a system call. There's a huge amount of opportunity for variation. e.g. the scheduler may decide that some other work needs doing (unlikely if your process just started; you won't have used up your timeslice yet). CPU cache (L2 and TLB) are easily possible.
If you have a multi-core machine and a single-threaded benchmark for the code you're optimizing, you can give it realtime priority pinned to one of your cores. Make sure you choose the core that isn't handling interrupts, or your keyboard (and everything else) will be locked out until it's done. Use taskset (for pinning to one CPU) and chrt (for setting realtime prio).
See this mail I sent to gmp-devel with this trick:
http://gmplib.org/list-archives/gmp-devel/2008-March/000789.html
Oh yeah, for the most precise timing, you can use rdtsc yourself (on x86/amd64). If you don't have any other syscalls in what you're benching, it's not a bad idea. Grab a benchmarking framework to put your function into. GMP has a pretty decent one. It's maybe not set up well for benchmarking functions that aren't in GMP and called mpn_whatever, though. I don't remember, and it's worth a look.

Are you trying to measure how long it takes to load a file? Usually if you're performance testing some bit of code that is already pretty fast (sub-second), then you will want to repeat the same code a number of times (say a thousand or a million), time the whole lot, then divide the total time by the number of iterations.
Having said that, I'm not quite sure what you're using sleep() for. Can you post an example of what you intend to do?

I would recommend putting that code in a for loop. Run it over 1000 or 10000 iterations. There's problems with this if you're doing only a few instructions, but it should help.
Larger data sets also help of course.
sleep is going to deschedule your thread from the cpu. It does not accurately count time with precision.

How "Real-Time" is Linux 2.6?

I am looking at moving my product from an RTOS to embedded Linux. I don't have many real-time requirements, and the few RT requirements I have are on the order of 10s of milliseconds.
Can someone point me to a reference that will tell me how Real-Time the current version of Linux is?
Are there any other gotchas from moving to a commercial RTOS to Linux?

You can get most of your answers from the Real Time Linux wiki and FAQ
What are real-time capabilities of the stock 2.6 linux kernel?
Traditionally, the Linux kernel will only allow one process to preempt another only under certain circumstances:
When the CPU is running user-mode code
When kernel code returns from a system call or an interrupt back to user space
When kernel code code blocks on a mutex, or explicitly yields control to another process
If kernel code is executing when some event takes place that requires a high priority thread to start executing, the high priority thread can not preempt the running kernel code, until the kernel code explicitly yields control. In the worst case, the latency could potentially be hundreds milliseconds or more.
The Linux 2.6 configuration option CONFIG_PREEMPT_VOLUNTARY introduces checks to the most common causes of long latencies, so that the kernel can voluntarily yield control to a higher priority task waiting to execute. This can be helpful, but while it reduces the occurences of long latencies (hundreds of milliseconds to potentially seconds or more), it does not eliminate them. However unlike CONFIG_PREEMPT (discussed below), CONFIG_PREEMPT_VOLUNTARY has a much lower impact on the overall throughput of the system. (As always, there is a classical tradeoff between throughput --- the overall efficiency of the system --- and latency. With the faster CPU's of modern-day systems, it often makes sense to trade off throughput for lower latencies, but server class systems that do not need minimum latency guarantees may very well chose to use either CONFIG_PREEMPT_VOLUNTARY, or to stick with the traditional non-preemptible kernel design.)
The 2.6 Linux kernel has an additional configuration option, CONFIG_PREEMPT, which causes all kernel code outside of spinlock-protected regions and interrupt handlers to be eligible for non-voluntary preemption by higher priority kernel threads. With this option, worst case latency drops to (around) single digit milliseconds, although some device drivers can have interrupt handlers that will introduce latency much worse than that. If a real-time Linux application requires latencies smaller than single-digit milliseconds, use of the CONFIG_PREEMPT_RT patch is highly recommended.
They also have a list of "Gotcha's" as you called them in the FAQ.
What are important things to keep in
mind while writing realtime
applications?
Taking care of the following during
the initial startup phase:
Call mlockall() as soon as possible from main().
Create all threads at startup time of the application, and touch each page of the entire stack of each thread. Never start threads dynamically during RT show time, this will ruin RT behavior.
Never use system calls that are known to generate page faults, such as
fopen(). (Opening of files does the
mmap() system call, which generates a
page-fault).
If you use 'compile time global variables' and/or 'compile time global
arrays', then use mlockall() to
prevent page faults when accessing
them.
more information: HOWTO: Build an
RT-application
They also have a large publications page you might want to checkout.

Have you had a look at Xenomai? It will let you run "hard real time" processes above Linux, while still allowing you to access the regular Linux APIs for all the non-real-time needs.

There are two fundamentally different approaches to achieve real-time capabilities with Linux.
Patch the existing kernel with things like the rt-preempt patches. This will eventually lead to a fully preemptive kernel
Dual kernel approach (like xenomai, RTLinux, RTAI,...)
There are lots of gotchas moving from a RTOS to Linux.
Maybe you don't really need real-time?
I'm talking about real-time Linux in my training sessions:
https://rlbl.me/elisa
https://rlbl.me/elisa-en-pdf
https://rlbl.me/intely
https://rlbl.me/intely-en-pdf
https://rlbl.me/entirety-en-all-pdf

The answer is probably "good enough".
If you're running an embedded system, you probably have control of all or most of the software on the box.
Stock Linux 2.6 has several features suitable for low-latency tasks - chiefly these are:
Scheduling policies
Memory locking
Assuming you're using a single-core machine, if you have just one task which has set its scheduling policy to SCHED_FIFO or SCHED_RR (it doesn't matter which if you have just one task), AND locked all its memory in with mlockall(), then it WILL get scheduled as soon as it is ready to run.
Then the only thing you'd have to worry about was some non-preemptable part of the kernel taking longer than your acceptable latency to complete - which is unlikely to happen in an embedded system unless something bad happens, such as extreme memory pressure, or your drivers are dodgy.
I guess "try it and see" is a good answer, but that's probably rather complicated in your case (and might involve writing device drivers etc).
Look at the doc for sched_setscheduler for some good info.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string