How NOHZ=ON affects do_timer() in Linux kernel? - linux

In a simple experiment I set NOHZ=OFF and used printk() to print how often the do_timer() function gets called. It gets called every 10 ms on my machine.
However if NOHZ=ON then there is a lot of jitter in the way do_timer() gets called. Most of the times it does get called every 10 ms but there are times when it completely misses the deadlines.
I have researched about both do_timer() and NOHZ. do_timer() is the function responsible for updating jiffies value and is also responsible for the round robin scheduling of the processes.
NOHZ feature switches off the hi-res timers on the system.
What I am unable to understand is how can hi-res timers affect the do_timer()? Even if hi-res hardware is in sleep state the persistent clock is more than capable to execute do_timer() every 10 ms. Secondly if do_timer() is not executing when it should, that means some processes are not getting their timeshare when they should ideally be getting it. A lot of googling does show that for many people many applications start working much better when NOHZ=OFF.
To make long story short, how does NOHZ=ON affect do_timer()?
Why does do_timer() miss its deadlines?

First lets understand what is a tickless kernel ( NOHZ=On or CONFIG_NO_HZ set ) and what was the motivation of introducing it into the Linux Kernel from 2.6.17
From http://www.lesswatts.org/projects/tickless/index.php,
Traditionally, the Linux kernel used a periodic timer for each CPU.
This timer did a variety of things, such as process accounting,
scheduler load balancing, and maintaining per-CPU timer events. Older
Linux kernels used a timer with a frequency of 100Hz (100 timer events
per second or one event every 10ms), while newer kernels use 250Hz
(250 events per second or one event every 4ms) or 1000Hz (1000 events
per second or one event every 1ms).
This periodic timer event is often called "the timer tick". The timer
tick is simple in its design, but has a significant drawback: the
timer tick happens periodically, irrespective of the processor state,
whether it's idle or busy. If the processor is idle, it has to wake up
from its power saving sleep state every 1, 4, or 10 milliseconds. This
costs quite a bit of energy, consuming battery life in laptops and
causing unnecessary power consumption in servers.
With "tickless idle", the Linux kernel has eliminated this periodic
timer tick when the CPU is idle. This allows the CPU to remain in
power saving states for a longer period of time, reducing the overall
system power consumption.
So reducing power consumption was one of the main motivations of the tickless kernel. But as it goes, most of the times, Performance takes a hit with decreased power consumption. For desktop computers, performance is of utmost concern and hence you see that for most of them NOHZ=OFF works pretty well.
In Ingo Molnar's own words
The tickless kernel feature (CONFIG_NO_HZ) enables 'on-demand' timer
interrupts: if there is no timer to be expired for say 1.5 seconds
when the system goes idle, then the system will stay totally idle for
1.5 seconds. This should bring cooler CPUs and power savings: on our (x86) testboxes we have measured the effective IRQ rate to go from HZ
to 1-2 timer interrupts per second.
Now, lets try to answer your queries-
What I am unable to understand is how can hi-res timers affect the
do_timer ?
If a system supports high-res timers, timer interrupts can occur more frequently than the usual 10ms on most systems. i.e these timers try to make the system more responsive by leveraging the system capabilities and by firing timer interrupts even faster, say every 100us. So with NOHZ option, these timers are cooled down and hence the lower execution of do_timer
Even if hi-res hardware is in sleep state the persistent clock is more
than capable to execute do_timer every 10ms
Yes it is capable. But the intention of NOHZ is exactly the opposite. To prevent frequent timer interrupts!
Secondly if do_timer is not executing when it should that means some
processes are not getting their timeshare when they should ideally be
getting it
As caf noted in the comments, NOHZ does not cause processes to get scheduled less often, because it only kicks in when the CPU is idle - in other words, when no processes are schedulable. Only the process accounting stuff will be done at a delayed time.
Why does do_timer miss it's deadlines ?
As elaborated, it is the intended design of NOHZ
I suggest you go through the tick-sched.c kernel sources as a starting point. Search for CONFIG_NO_HZ and try understanding the new functionality added for the NOHZ feature
Here is one test performed to measure the Impact of a Tickless Kernel

Related

How to start two CPU cores to run instructions at the same time?

For example, in X86, 2 CPU cores are running different software threads.
At a moment, these 2 threads need to run on their CPU cores at the same time.
Is there a way to sync-up these 2 CPU cores/threads, or something like this to make them start to run at (almost) the same time (at instruction level)?
Use a shared variable to communicate a rdtsc based deadline between the two threads. E.g., set a deadline of say the current rdtsc value plus 10,000.
Then have both threads spin on rdtsc waiting until the gap between the current rdtsc value and the threshold is less than a threshold value T (T = 100 should be fine). Finally, use the final gap value (that is, the deadline rdtsc value minus last read rdtsc value) to jump into a sequence of dependent add instructions such that the number of add instructions is equal to the gap.
This final step compensates for the fact that each chip will generally not be "in phase" with respect to their rdtsc spin loop. E.g., assuming a 30-cycle back-to-back throughput for rdtsc readings, one chip may get readings of 890, 920, 950 etc, while the other may read 880, 910, 940 so there will be a 10 or 20 cycle error if rdtsc alone is used. Using the add slide compensation, if the deadline was 1,000, and with a threshold of 100, the first thread would trigger at rdtsc == 920 and execute 80 additions, while the second would trigger at rdtsc == 910 and execute 90 additions. In principle both cores are then approximately synced up.
Some notes:
The above assumes CPU frequency equal to the nominal rdtsc frequency - if that's not the case you'll have to apply a compensation factor based on the nominal to true frequency ration when calculating where to jump into the add slide.
Don't expect your CPUs to say synced for long: anything like an interrupt, a variable latency operation like a cache miss, or a lot of other things can make them get out of sync.
You want all your payload code, and the addition slide to be hot in the icache of each core, or else they are very likely to get out of sync immediately. You can warm up the icache by doing one or more dummy runs through this code prior to the sync.
You want T to be large enough that the gap is always positive, so somewhat larger than the back-to-back rdtsc latency, but no so large as to increase the chance of events like interrupts during the add slide.
You can check the effectiveness of the "sync" by issuing a rdtsc or rdtscp at various points in the "payload" code following the sync up and seeing how close the recorded values are across threads.
A totally different option would be to use Intel TSX: transactional extensions. Organize for the two threads that want to coordinate to both read a shared line inside a transactional region and then spin, and have a third thread to write to the shared line. This will cause an abort on both of the waiting threads. Depending on the inter-core topology, the two waiting threads may receive the invalidation and hence the subsequent TSX abort at nearly the same time. Call the code you want to run "in sync" from the abort handler.
Depending on your definition of "(almost) the same time", this is a very hard problem microarchitecturally.
Even the definition of "Run" isn't specific enough if you care about timing down to the cycle. Do you mean issue from the front-end into the out-of-order back-end? Execute? (dispatch to an execution unit? or complete execution successfully without needing a replay?) Or retire?
I'd tend to go with Execute1 because that's when an instruction like rdtsc samples the timestamp counter. This it's the one you can actually record the timing of and then compare later.
footnote 1: on the correct path, not in the shadow of a mis-speculation, unless you're also ok with executions that don't reach retirement.
But if the two cores have different ROB / RS states when the instruction you care about executes, they won't continue in lock-step. (There are extremely few in-order x86-64 CPUs, like some pre-Silvermont Atoms, and early Xeon Phi: Knight's Corner. The x86-64 CPUs of today are all out-of-order, and outside of low-power Silvermont-family are aggressively so with large ROB + scheduler.)
x86 asm tricks:
I haven't used it, but x86 asm monitor / mwait to have both CPUs monitor and wait for a write to a given memory location could work. I don't know how synchronized the wakeup is. I'd guess that the less deep the sleep, the less variable the latency.
Early wake-up from an interrupt coming before a write is always possible. Unless you disable interrupts, you aren't going to be able to make this happen 100% of the time; hopefully you just need to make it happen with some reasonable chance of success, and be able to tell after the fact whether you achieved it.
(On very recent low-power Intel CPUs (Tremont), a user-space-usable version of these are available: umonitor / umwait. But in kernel you can probably just use monitor/mwait)
If umonitor/umwait are available, that means you have the WAITPKG CPU feature which also includes tpause: like pause but wait until a given TSC timestamp.
On modern x86 CPUs, the TSC is synchronized between all cores by hardware, so using the same wake-up time for multiple cores makes this trivial.
Otherwise you could spin-wait on a rdtsc deadline and probably get within ~25 cycles at worst on Skylake.
rdtsc has one per 25 cycle throughput on Skylake (https://agner.org/optimize/) so you expect each thread to be on average 12.5 cycles late leaving the spin-wait loop, +-12.5. I'm assuming the branch-mispredict cost for both threads is the same. These are core clock cycles, not the reference cycles that rdtsc counts. RDTSC typically ticks close to the max non-turbo clock. See How to get the CPU cycle count in x86_64 from C++? for more about RDTSC from C.
See How much delay is generated by this assembly code in linux for an asm function that spins on rdtsc waiting for a deadline. You could write this in C easily enough.
Staying in sync after initial start:
On a many-core Xeon where each core can change frequency independently, you'll need to fix the CPU frequency to something, probably max non-turbo would be a good choice. Otherwise with cores at different clock speeds, they'll obviously de-sync right away.
On a desktop you might want to do this anyway, in case pausing the clock to change CPU frequency throws things off.
Any difference in branch mispredicts, cache misses, or even different initial states of ROB/RS could lead to major desync.
More importantly, interrupts are huge and take a very long time compared to running 1 more instruction in an already-running task. And it can even lead to the scheduler doing a context switch to another thread. Or a CPU migration for the task, obviously costing a lot of cycles.

Determining latency in threads that use sleep_for using SCHED_FIFO

I have a Linux embedded system built with PREEMPT_RT (real time patch) that creates multiple SCHED_FIFO threads, with a priority of 90 each. The goal is that they execute without being preempted, but with the same priority.
Each thread does a little bit of work, then goes to sleep using std::this_thread::sleep_for() for a few milliseconds, then gets scheduled back and executes the same amount of work.
Most of the time, each thread latency is impeccable, but once every minute or so (not an exact regular interval) all threads get hogged at the same time for one second or more (instead of the low milliseconds they usually get called at).
I have made sure Power management is disabled in the kernel kconfig, I have called mlockall() to avoid memory getting paged out, to no avail.
I have tried to use ftrace with wakeup_rt as the tracer, but the highest latency recorded was around 5ms, not nearly enough time to be the cause of the issue.
I am not sure what tool would be best to identify where the latency is coming from. Does anyone have ideas please?

what is hrtick_clear(rq); in linux scheduler?

while going through linux kernel code inside __scheduler() function I saw hrtick_clear(rq).
Can anyone explain what is this and why it is used?
it seems something related to timer, but unable to proceed further.
Classic OS design involves system timer - an entity that ticks at fixed intervals. During each tick, scheduler is called and if process/thread should be switched. But system timer frequency is pretty low (i.e. 1000 HZ, which means once in 1 ms), and if process have only 100us of its timeslice left, it will get extra time (under certain circumstances), while other processes are starve.
However, modern CPUs provide more precision hardware timers like HPET on Intel, which are provided by hrtimers subsystem. They can be enabled for be used in scheduler by CONFIG_SCHED_HRTICK option.
But if you already called __schedule() (i.e. on path of system call), you do not need to call it second time from hrtimer, because you already scheduling, so before doing so, hrtick_clear disables that hrtimer.

Multiple hardware timers in Linux

Problem - There is an intermittent clock drift (of 2 seconds) on my Linux system, so once in a while the kernel timer threads get executed 2 seconds + timeout time
Question - There are multiple hardware clocksources (TSC, HPET, ACPI_PM), is it possible to create kernel timer threads that forcibly uses a secondary clocksource as a fallback, if the primary clocksource drifts..?
What you describe doesn't sound like clock drift (systematic error) but rather like lost timer interrupts. If you have another piece of hardware that can generate timed interrupts (HPET, RTC, but not TSC), you can make your time-sensitive processing from either the timer or the interrupt handler (or handlers), whichever happens first, you just need to design some kind of synchronization between them.
If you experience genuine clock drift, when the speed of your clock is less than real time, you can try to estimate it and compensate when timers are scheduled. But lost interrupts is a sign of a more serious problem and it makes sense to address the root cause, which may affect your secondary interrupt source as well.

Incrementing Clocks

When a process is set to run with an initial time slice of 10 for example, someone in the hardware should know this initial timeslice and decrement it and when the time slice turns 0, an interrupt should be fired!
In freeBSD kernel, I understand that hardclock and the softclock does this task of accounting. But my question is, is this decrementing of clock parallel to the execution of the process?
I'll use the PIT as an example here, because it's the simplest timing mechanism (and has been around for quite a while).
Also, this answer is fairly x86-specific; and also OS-agnostic. I don't know enough about the internals of FreeBSD and Linux to answer for them specifically. Someone else might be more capable of that.
Essentially, the timeslice is "decremented" parallel to the execution of the process as the timer creates an IRQ for each "tick" (note that timers such as the HPET can do 'one-shot' mode, which fires an IRQ after a specific delay, which can be used for scheduling as well). Once the timeslice decrements to zero, the scheduler is notified and a task switch occurs. All this happens "at the same time" as your process: the IRQ jumps in, runs some code, then lets your process keep going until the timeslice runs out.
It should be noted that, generally speaking, you don't see a process running to the end of it's timeslice as task switches can occur as the direct result of a system call (for example, a read from disk that blocks, or even writing to a terminal).
This was simpler in the misty past: a clock chip -- a discrete device on the motherboard -- would be configured to fire interrupts periodically at a rate of X Hz. Every time this "timer interrupt" went off, execution of the current program would be suspended (just like any other interrupt) and the kernel's scheduler code would decrement its timeslice. When the timeslice got all the way to zero, the kernel would take the CPU away from the program and give it to another one. The clock chip, being separate from the CPU, obviously runs in parallel with the execution of the program, but the kernel's bookkeeping work has to interrupt the program (this is the misty past we're talking about, so there is only one CPU, so kernel code and user code cannot run simultaneously).
Nowadays, the clock is not a discrete device, it's part of the CPU, and it can be programmed to do all sorts of clever things. Most importantly it can be programmed to fire one interrupt after N microseconds, where N can be quite large; this allows the kernel to idle the CPU for a very long time (in computer terms; maybe, like, a whole second) if there's nothing constructive for it to do, saving power. Meanwhile, it's hard to find a single-core CPU anymore, kernels do all sorts of clever tricks to push their bookkeeping work off to CPUs that don't have anything better to do, and timeslice accounting has gotten a whole lot more complicated. Linux currently uses the "Completely Fair Scheduler" which doesn't even really have a concept of "time slices". I don't know what FreeBSD's got, but I would be surprised if it was simple.
So the short answer to your question is "mostly in parallel, more so now than in the past, but it's not remotely as simple as a countdown timer anymore".

Resources