Multiple hardware timers in Linux

Multiple hardware timers in Linux - linux

Problem - There is an intermittent clock drift (of 2 seconds) on my Linux system, so once in a while the kernel timer threads get executed 2 seconds + timeout time
Question - There are multiple hardware clocksources (TSC, HPET, ACPI_PM), is it possible to create kernel timer threads that forcibly uses a secondary clocksource as a fallback, if the primary clocksource drifts..?

What you describe doesn't sound like clock drift (systematic error) but rather like lost timer interrupts. If you have another piece of hardware that can generate timed interrupts (HPET, RTC, but not TSC), you can make your time-sensitive processing from either the timer or the interrupt handler (or handlers), whichever happens first, you just need to design some kind of synchronization between them.
If you experience genuine clock drift, when the speed of your clock is less than real time, you can try to estimate it and compensate when timers are scheduled. But lost interrupts is a sign of a more serious problem and it makes sense to address the root cause, which may affect your secondary interrupt source as well.

Related

What consequences are there to disabling interrupts/preemption for a long period?

In the Linux kernel, there are a lot of functions, for example on_each_cpu_mask, that have documentation warning against passing in callbacks that run for long periods of time because interrupts and/or preemption will be disabled for the duration of the callback. It's unclear though if the callback "must" be short because it being long will just cause terrible system performance, or because a long running callback will actually break the correctness of the system somehow.
Obviously while interrupts and preemption are disabled the busy core can't do any other work, and you can concoct situations where you could force deadlock by having two CPUs waiting for one another. But for the sake of argument say the callback just does a bunch of pure computation that takes a significant amount of time and then returns. Can this somehow break the kernel? If so how long is too long? Or does performance just suffer until the computation is done?

Disabling interrupts on one CPU for any period of time will eventually result in all other CPUs hanging, as the kernel frequently needs to perform short operations on all CPUs. Leaving interrupts off on any CPU will prevent it from completing these operations. (ref)
Disabling interrupts on all CPUs, either intentionally or unintentionally, will make the system completely unresponsive. This includes the user interface (including TTY switching at the console), as well as tasks which the kernel normally performs automatically, like responding to network activity (like ARP responses and TCP acknowledgements) or performing thermal management tasks (like adjusting system fan speeds). This is bad. Don't do it.

Why softirq is used for highly threaded and high frequency uses?

What makes the softirq so special that we use it for high frequency uses., like in network drivers and block drivers.

SoftIrqs are typically used to complete queued work from a processed interrupt because they fit that need very well - they run with second-highest priority, but still run with hardware interrupts enabled.
Processing hw interrupts is the utmost priority, since if they are not processed quickly, then either too high of latency will be introduced and user experience suffers, or the hardware buffer will fill before the interrupt services the device, and data is lost. Dont service a network adapter fast enough? It's going to overwrite data in the fifo and you'll lose packets. Don't service a hard disk fast enough? The hard drive stalls queued read requests because it has nowhere to put the results.
SoftIrqs allow the critical part of servicing hardware interrupts to be as short as possible; instead of having to process the entire hw interrupt on the spot, the important data is read off the device into RAM or otherwise, and then a SoftIrq is started to finish the work. This keeps the hardware interrupts disabled for the shortest period of time, while still completing the work with high priority.
This article is a decent reference on the matter:
https://lwn.net/Articles/520076/
Edits for questions:
SoftIrqs are re-entrant - they can be processed on any cpu. From the article I linked:
There are two places where software interrupts can "fire" and preempt
the current thread. One of them is at the end of the processing for a
hardware interrupt; it is common for interrupt handlers to raise
softirqs, so it makes sense (for latency and optimal cache use) to
process them as soon as hardware interrupts can be re-enabled
Emphasis added. They can be processed inline - I believe this means they can be processed without causing a context switch, which means as soon as hardware interrupts are enabled, we can jump straight to the SoftIrq right where we are with as little CPU cache abuse as possible. All of this contributes to SoftIrqs being lightweight but flexible, which makes them ideal for high-frequency processing.
They can be pushed to another CPU if needed, which improves throughput.
They can be processed immediately after hwints are enabled right in the current context, preserving processor state as much as possible, improving latency
They allow hardware interrupts to keep processing, since those are our most important goal
They can be rescheduled to the ksoftirqd process if load is too high and we need to take time from normal user processes.

How does the OS scheduler regain control of CPU?

I recently started to learn how the CPU and the operating system works, and I am a bit confused about the operation of a single-CPU machine with an operating system that provides multitasking.
Supposing my machine has a single CPU, this would mean that, at any given time, only one process could be running.
Now, I can only assume that the scheduler used by the operating system to control the access to the precious CPU time is also a process.
Thus, in this machine, either the user process or the scheduling system process is running at any given point in time, but not both.
So here's a question:
Once the scheduler gives up control of the CPU to another process, how can it regain CPU time to run itself again to do its scheduling work? I mean, if any given process currently running does not yield the CPU, how could the scheduler itself ever run again and ensure proper multitasking?
So far, I had been thinking, well, if the user process requests an I/O operation through a system call, then in the system call we could ensure the scheduler is allocated some CPU time again. But I am not even sure if this works in this way.
On the other hand, if the user process in question were inherently CPU-bound, then, from this point of view, it could run forever, never letting other processes, not even the scheduler run again.
Supposing time-sliced scheduling, I have no idea how the scheduler could slice the time for the execution of another process when it is not even running?
I would really appreciate any insight or references that you can provide in this regard.

The OS sets up a hardware timer (Programmable interval timer or PIT) that generates an interrupt every N milliseconds. That interrupt is delivered to the kernel and user-code is interrupted.
It works like any other hardware interrupt. For example your disk will force a switch to the kernel when it has completed an IO.

Google "interrupts". Interrupts are at the centre of multithreading, preemptive kernels like Linux/Windows. With no interrupts, the OS will never do anything.
While investigating/learning, try to ignore any explanations that mention "timer interrupt", "round-robin" and "time-slice", or "quantum" in the first paragraph – they are dangerously misleading, if not actually wrong.
Interrupts, in OS terms, come in two flavours:
Hardware interrupts – those initiated by an actual hardware signal from a peripheral device. These can happen at (nearly) any time and switch execution from whatever thread might be running to code in a driver.
Software interrupts – those initiated by OS calls from currently running threads.
Either interrupt may request the scheduler to make threads that were waiting ready/running or cause threads that were waiting/running to be preempted.
The most important interrupts are those hardware interrupts from peripheral drivers – those that make threads ready that were waiting on IO from disks, NIC cards, mice, keyboards, USB etc. The overriding reason for using preemptive kernels, and all the problems of locking, synchronization, signaling etc., is that such systems have very good IO performance because hardware peripherals can rapidly make threads ready/running that were waiting for data from that hardware, without any latency resulting from threads that do not yield, or waiting for a periodic timer reschedule.
The hardware timer interrupt that causes periodic scheduling runs is important because many system calls have timeouts in case, say, a response from a peripheral takes longer than it should.
On multicore systems the OS has an interprocessor driver that can cause a hardware interrupt on other cores, allowing the OS to interrupt/schedule/dispatch threads onto multiple cores.
On seriously overloaded boxes, or those running CPU-intensive apps (a small minority), the OS can use the periodic timer interrupts, and the resulting scheduling, to cycle through a set of ready threads that is larger than the number of available cores, and allow each a share of available CPU resources. On most systems this happens rarely and is of little importance.
Every time I see "quantum", "give up the remainder of their time-slice", "round-robin" and similar, I just cringe...

To complement #usr's answer, quoting from Understanding the Linux Kernel:
The schedule( ) Function
schedule( ) implements the scheduler. Its objective is to find a
process in the runqueue list and then assign the CPU to it. It is
invoked, directly or in a lazy way, by several kernel routines.
[...]
Lazy invocation
The scheduler can also be invoked in a lazy way by setting the
need_resched field of current [process] to 1. Since a check on the value of this
field is always made before resuming the execution of a User Mode
process (see the section "Returning from Interrupts and Exceptions" in
Chapter 4), schedule( ) will definitely be invoked at some close
future time.

Disadvantage(s) of preempt_rt [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
the target hardware platform has limited horsepower, and/or you want
the real-time job to put the smallest possible overhead on your
system. This is where dual kernels are usually better than a native
preemption system.
From here: http://www.xenomai.org/index.php/Xenomai:Roadmap#Xenomai_3_FAQ
Preempt_rt does preempt the whole Linux. In what way does preempting Linux put load on the system?
The FAQ there talks about the preempt_rt as compared to Xenomai.

CONFIG_PREEMPT_VOLUNTARY -
This option introduces checks to the most common causes of long latencies in the kernel code, so that the kernel can voluntarily yield control to a higher priority task waiting to execute. This option is said to be reducing the occurrances of long latencies to a great degree but still it doesn't eliminate them totally.
CONFIG_PREEMPT_RT -
This option causes all kernel code outside of spinlock-protected regions (created by raw_spinlock_t), to be eligible for non-voluntary preemption by higher priority kernel threads. Spinlocks created by spinlock_t and rwlock_t, and the interrupts are also made preemptable with this option enabled. With this option, worst case latency drops to (around) single digit milliseconds.
Disadvantage -
The normal Linux kernel allows preemption of a task by a higher priority task only when the user space code is getting executed.
In order to reduce the latency, the CONFIG_PREEMPT_RT patch forces the kernel to non-voluntarily preempt the task at hand, at the arrival of a higher proiority kernel task. This is bound to cause a reduction in the overall throughput of the system since there will be several context switches and also the lower priority tasks won't be getting much a chance to get through.
Source:
https://rt.wiki.kernel.org/index.php/Frequently_Asked_Questions
Description of technical terms used:
What is latency?
The time elasped between a demand issued on a computer system and the begining of a response to the same demand is called latency or response time.
Kinds of latencies:
Interrupt Latency:
The time elapsed between the generation of an interrupt and the start of the execution of the corresponding interrupt handler.
Example: When a hardware device performs a task, it generates an interrupt. This interrupt has the information about the task to be performed and about the interrupt handler to be executed. The interrupt handler then performs the particular task.
Scheduling Latency:
It is the time between a wakeup signaling that an event has occurred and the kernel scheduler getting an opportunity to schedule the thread that is waiting for the wakeup to occur (the response). Scheduling latency is also known as dispatch latency.
Worst-case Latency:
The maximum amount of time that can laspe between a demand issued on a computer system and the begining of a response to the same demand.
What is throughput?
Amount of work that a computer can do in a given period of time is called throughput.
What is Context switch?
Context switch is the switching of the CPU from one process/thread to another. Context switches can occur only in kernel mode. This is the process of saving the current execution state of the process (for resuming execution later on), and loading the saved state of the new process/thread for execution.

Adding to top vote answer "lower priority tasks won't be getting much a chance to get through"
That's sort of the whole point (though on a 4+ core system those low priority tasks could still run as long as they are forbidden from doing things that would interfere with critical tasks - this is where it's important to be able to make sure all the peripherals connected play nice by not blocking when the app running critical thread wants to access them). The critical bit (if for example thinking about developing a useful system for processing external input timely or testing behaviour of data conversion with live data as opposed to model), is to have a way to tell the kernel where the time critical input is arriving from.
Problem with current eg. Windows systems is that you might be a "serious gamer or serious musician" that notices things like 150 microsecond jitters. If you have no way to specify that the keyboard, mouse or other human interface device should be treated at critical priority, then all sort of "windows updates" and such can come and do their thing which might in turn trigger some activity in the USB controller that has higher priority than the threads related to doing the input.
I read about a case where glitches in audio were resolved by adding a 2nd USB controller with nothing on it except the input device. In portable setting, you practically need Thunderbolt PCIe passthrough to add a dedicated hub (or fpga) that can, together with drivers, override everything else on the system. This is why the aren't much USB products on the market that provide good enough performance for musicians. (2 ms round trip latency with max 150 microsecond jitter over full day without dropouts)

How NOHZ=ON affects do_timer() in Linux kernel?

In a simple experiment I set NOHZ=OFF and used printk() to print how often the do_timer() function gets called. It gets called every 10 ms on my machine.
However if NOHZ=ON then there is a lot of jitter in the way do_timer() gets called. Most of the times it does get called every 10 ms but there are times when it completely misses the deadlines.
I have researched about both do_timer() and NOHZ. do_timer() is the function responsible for updating jiffies value and is also responsible for the round robin scheduling of the processes.
NOHZ feature switches off the hi-res timers on the system.
What I am unable to understand is how can hi-res timers affect the do_timer()? Even if hi-res hardware is in sleep state the persistent clock is more than capable to execute do_timer() every 10 ms. Secondly if do_timer() is not executing when it should, that means some processes are not getting their timeshare when they should ideally be getting it. A lot of googling does show that for many people many applications start working much better when NOHZ=OFF.
To make long story short, how does NOHZ=ON affect do_timer()?
Why does do_timer() miss its deadlines?

First lets understand what is a tickless kernel ( NOHZ=On or CONFIG_NO_HZ set ) and what was the motivation of introducing it into the Linux Kernel from 2.6.17
From http://www.lesswatts.org/projects/tickless/index.php,
Traditionally, the Linux kernel used a periodic timer for each CPU.
This timer did a variety of things, such as process accounting,
scheduler load balancing, and maintaining per-CPU timer events. Older
Linux kernels used a timer with a frequency of 100Hz (100 timer events
per second or one event every 10ms), while newer kernels use 250Hz
(250 events per second or one event every 4ms) or 1000Hz (1000 events
per second or one event every 1ms).
This periodic timer event is often called "the timer tick". The timer
tick is simple in its design, but has a significant drawback: the
timer tick happens periodically, irrespective of the processor state,
whether it's idle or busy. If the processor is idle, it has to wake up
from its power saving sleep state every 1, 4, or 10 milliseconds. This
costs quite a bit of energy, consuming battery life in laptops and
causing unnecessary power consumption in servers.
With "tickless idle", the Linux kernel has eliminated this periodic
timer tick when the CPU is idle. This allows the CPU to remain in
power saving states for a longer period of time, reducing the overall
system power consumption.
So reducing power consumption was one of the main motivations of the tickless kernel. But as it goes, most of the times, Performance takes a hit with decreased power consumption. For desktop computers, performance is of utmost concern and hence you see that for most of them NOHZ=OFF works pretty well.
In Ingo Molnar's own words
The tickless kernel feature (CONFIG_NO_HZ) enables 'on-demand' timer
interrupts: if there is no timer to be expired for say 1.5 seconds
when the system goes idle, then the system will stay totally idle for
1.5 seconds. This should bring cooler CPUs and power savings: on our (x86) testboxes we have measured the effective IRQ rate to go from HZ
to 1-2 timer interrupts per second.
Now, lets try to answer your queries-
What I am unable to understand is how can hi-res timers affect the
do_timer ?
If a system supports high-res timers, timer interrupts can occur more frequently than the usual 10ms on most systems. i.e these timers try to make the system more responsive by leveraging the system capabilities and by firing timer interrupts even faster, say every 100us. So with NOHZ option, these timers are cooled down and hence the lower execution of do_timer
Even if hi-res hardware is in sleep state the persistent clock is more
than capable to execute do_timer every 10ms
Yes it is capable. But the intention of NOHZ is exactly the opposite. To prevent frequent timer interrupts!
Secondly if do_timer is not executing when it should that means some
processes are not getting their timeshare when they should ideally be
getting it
As caf noted in the comments, NOHZ does not cause processes to get scheduled less often, because it only kicks in when the CPU is idle - in other words, when no processes are schedulable. Only the process accounting stuff will be done at a delayed time.
Why does do_timer miss it's deadlines ?
As elaborated, it is the intended design of NOHZ
I suggest you go through the tick-sched.c kernel sources as a starting point. Search for CONFIG_NO_HZ and try understanding the new functionality added for the NOHZ feature
Here is one test performed to measure the Impact of a Tickless Kernel

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string