What makes the softirq so special that we use it for high frequency uses., like in network drivers and block drivers.
SoftIrqs are typically used to complete queued work from a processed interrupt because they fit that need very well - they run with second-highest priority, but still run with hardware interrupts enabled.
Processing hw interrupts is the utmost priority, since if they are not processed quickly, then either too high of latency will be introduced and user experience suffers, or the hardware buffer will fill before the interrupt services the device, and data is lost. Dont service a network adapter fast enough? It's going to overwrite data in the fifo and you'll lose packets. Don't service a hard disk fast enough? The hard drive stalls queued read requests because it has nowhere to put the results.
SoftIrqs allow the critical part of servicing hardware interrupts to be as short as possible; instead of having to process the entire hw interrupt on the spot, the important data is read off the device into RAM or otherwise, and then a SoftIrq is started to finish the work. This keeps the hardware interrupts disabled for the shortest period of time, while still completing the work with high priority.
This article is a decent reference on the matter:
https://lwn.net/Articles/520076/
Edits for questions:
SoftIrqs are re-entrant - they can be processed on any cpu. From the article I linked:
There are two places where software interrupts can "fire" and preempt
the current thread. One of them is at the end of the processing for a
hardware interrupt; it is common for interrupt handlers to raise
softirqs, so it makes sense (for latency and optimal cache use) to
process them as soon as hardware interrupts can be re-enabled
Emphasis added. They can be processed inline - I believe this means they can be processed without causing a context switch, which means as soon as hardware interrupts are enabled, we can jump straight to the SoftIrq right where we are with as little CPU cache abuse as possible. All of this contributes to SoftIrqs being lightweight but flexible, which makes them ideal for high-frequency processing.
They can be pushed to another CPU if needed, which improves throughput.
They can be processed immediately after hwints are enabled right in the current context, preserving processor state as much as possible, improving latency
They allow hardware interrupts to keep processing, since those are our most important goal
They can be rescheduled to the ksoftirqd process if load is too high and we need to take time from normal user processes.
Related
In the Linux kernel, there are a lot of functions, for example on_each_cpu_mask, that have documentation warning against passing in callbacks that run for long periods of time because interrupts and/or preemption will be disabled for the duration of the callback. It's unclear though if the callback "must" be short because it being long will just cause terrible system performance, or because a long running callback will actually break the correctness of the system somehow.
Obviously while interrupts and preemption are disabled the busy core can't do any other work, and you can concoct situations where you could force deadlock by having two CPUs waiting for one another. But for the sake of argument say the callback just does a bunch of pure computation that takes a significant amount of time and then returns. Can this somehow break the kernel? If so how long is too long? Or does performance just suffer until the computation is done?
Disabling interrupts on one CPU for any period of time will eventually result in all other CPUs hanging, as the kernel frequently needs to perform short operations on all CPUs. Leaving interrupts off on any CPU will prevent it from completing these operations. (ref)
Disabling interrupts on all CPUs, either intentionally or unintentionally, will make the system completely unresponsive. This includes the user interface (including TTY switching at the console), as well as tasks which the kernel normally performs automatically, like responding to network activity (like ARP responses and TCP acknowledgements) or performing thermal management tasks (like adjusting system fan speeds). This is bad. Don't do it.
I have been reading about performance tuning of Linux to get the fastest packet processing times when receiving financial market data. I see that when the NIC receives a packet, it puts it in memory via DMA, then raises a HardIRQ - which in turn sets some NAPI settings and raises a SoftIRQ. The SoftIRQ then uses NAPI/device drivers to read data from the RX Buffers via polling, but this is only run for some limited time (net.core.netdev_budget, defaulted to 300 packets). These are in reference to a real server running ubuntu, with a solarflare NIC My questions are below:
If each HardIRQ raises a SoftIRQ, and the Device Driver reads multiple packets in 1 go (netdev_budget), what happens to the SoftIRQs raised by each of the packets that were drained from the RX buffer in 1 go (Each pack received will raise a hard and then soft irq)? Are these queued?
Why does the NAPI use polling to drain the RX_buffer? The system has just generated a SoftIRQ and is reading the RX buffer, then why the polling?
Presumably, draining of the RX_Buffer via the softirq, will only happen from 1 specific RX_Buffer and not across multiple RX_Buffers? If so, then increasing the netdev_budget can delay the processing/draining of other RX_buffers? Or can this be mitigated by assigning different RX_buffers to different cores?
There are settings to ensure that HardIRQs are immediately raised and handled. However, SoftIRQs may be processed at a later time. Are there settings/configs to ensure that SoftIRQs related to network RX are also handled at top priority and without delays?
I wrote a comprehensive blog post explaining the answers to your questions and everything else about tuning, optimizing, profiling, and understanding the entire Linux networking stack here.
Answers to your questions:
sofirqs raised by the driver while a softirq is processing do nothing. This is because the NAPI helper code first checks to see if NAPI is already running before attempting to raise the softirq. Even if the NAPI did not check, you can see from the softirq source that softirqs are implemented as a bit vector. This means a softirq can only be 1 (pending) or 0 (not pending). While it is set to 1, additional calls to set it to 1 will have no effect.
The softirq is used to start the NAPI poll loop and to control the NAPI poll so it does not eat 100% of CPU usage. The NAPI poll loop is just a for loop, and the softirq code manages how much time it can spend and how much budget it has.
Each CPU processing packets can spend the full budget. So, if budget is set to 300, and you have 2 CPUs, each CPU can process 300 packets each for a total of 600. This is only true if your NIC supports multiple RX queues and you've distributed the IRQs to separate CPUs for processing. If your NIC doesn't, you can use Receive Packet Steering to help with this (RPS). See my blog post above for more information.
No, there are no settings for this. Note that the softirqs run on the same CPU which raised them. So, if you set your hardirq handler for RX queue 1 to CPU 16, then the softirq will run on CPU 16. One thing you can do is: set your hardirqs to specific CPUs and set the application that will use that data to those same CPUs. Pin all other applications (like cron jobs, background tasks, etc) to other CPUs -- this ensures that only the hardirq handler, softirq, and application that will process the data can run.
If you desire extremely low latency network packet reads, you should try using a new Linux kernel networking feature called busy polling. It can be used to help minimize and reduce networking packet processing latency. Using this option will increase CPU usage.
I was looking at this pic:
and have 2 questions regarding it:
1. how much faster should a disk be in order for polling to be refered over the interrupt?
I thought that beacuse of the ISR and the process jumping (when using interrupt) - that polling will be better when using fast SSD for example , where the polling takes less time than the the interrupt(ISR+ scheduler ). Am I mistaken?
and the second question is : If my disk is slower than the SSD in my first question, but still fast - is there any reason to prefer polling?
I was wondering if the fact that I'll have lot's of I/O - read requests is a good enough reaon to prefer polling.
thanks!
You could imagine a case where the overhead of running interrupt handlers (invalidating your caches, setting up the interrupt stack to run on) could be slower than actually doing the read or write, in which case I guess polling would be faster.
However, SSDs are fast compared to disk, but still much slower than memory. SSDs still take anywhere from tens of microseconds to milliseconds to complete each I/O, whereas doing the interrupt setup and teardown uses all in-memory operations and probably takes a maximum of, say, 100-1000 cycles (~100ns to 1us).
The main benefit of using interrupts instead of polling is that the "disabled" effect of using interrupts is much lower, since you don't have to schedule your I/O threads to continuously poll for more data while there is none available. It has the added benefit that I/O is handled immediately, so if a user types a key, there won't be a pause before the letter appears on the screen while the I/O thread is being scheduled. Combining these issues is a mess - inserting arbitrary stalls into your I/O thread makes polling less resource-intense at the expense of even slower response times. These may be the primary reasons nobody uses polling in kernel I/O designs.
From a user process' perspective (using software interrupt systems such as Unix signals), either way can make sense since polling usually means a blocking syscall such as read() or select() rather than (say) checking a boolean memory-mapped variable or issuing device instructions like the kernel version of polling might do. Having system calls to do this work for us in the OS can be beneficial for performance because the userland thread doesn't get its cache invalidated any more or less by using signals versus any other syscall. However this is pretty OS dependent, so profiling is your best bet for figuring out which codepath is faster.
Problem - There is an intermittent clock drift (of 2 seconds) on my Linux system, so once in a while the kernel timer threads get executed 2 seconds + timeout time
Question - There are multiple hardware clocksources (TSC, HPET, ACPI_PM), is it possible to create kernel timer threads that forcibly uses a secondary clocksource as a fallback, if the primary clocksource drifts..?
What you describe doesn't sound like clock drift (systematic error) but rather like lost timer interrupts. If you have another piece of hardware that can generate timed interrupts (HPET, RTC, but not TSC), you can make your time-sensitive processing from either the timer or the interrupt handler (or handlers), whichever happens first, you just need to design some kind of synchronization between them.
If you experience genuine clock drift, when the speed of your clock is less than real time, you can try to estimate it and compensate when timers are scheduled. But lost interrupts is a sign of a more serious problem and it makes sense to address the root cause, which may affect your secondary interrupt source as well.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
the target hardware platform has limited horsepower, and/or you want
the real-time job to put the smallest possible overhead on your
system. This is where dual kernels are usually better than a native
preemption system.
From here: http://www.xenomai.org/index.php/Xenomai:Roadmap#Xenomai_3_FAQ
Preempt_rt does preempt the whole Linux. In what way does preempting Linux put load on the system?
The FAQ there talks about the preempt_rt as compared to Xenomai.
CONFIG_PREEMPT_VOLUNTARY -
This option introduces checks to the most common causes of long latencies in the kernel code, so that the kernel can voluntarily yield control to a higher priority task waiting to execute. This option is said to be reducing the occurrances of long latencies to a great degree but still it doesn't eliminate them totally.
CONFIG_PREEMPT_RT -
This option causes all kernel code outside of spinlock-protected regions (created by raw_spinlock_t), to be eligible for non-voluntary preemption by higher priority kernel threads. Spinlocks created by spinlock_t and rwlock_t, and the interrupts are also made preemptable with this option enabled. With this option, worst case latency drops to (around) single digit milliseconds.
Disadvantage -
The normal Linux kernel allows preemption of a task by a higher priority task only when the user space code is getting executed.
In order to reduce the latency, the CONFIG_PREEMPT_RT patch forces the kernel to non-voluntarily preempt the task at hand, at the arrival of a higher proiority kernel task. This is bound to cause a reduction in the overall throughput of the system since there will be several context switches and also the lower priority tasks won't be getting much a chance to get through.
Source:
https://rt.wiki.kernel.org/index.php/Frequently_Asked_Questions
Description of technical terms used:
What is latency?
The time elasped between a demand issued on a computer system and the begining of a response to the same demand is called latency or response time.
Kinds of latencies:
Interrupt Latency:
The time elapsed between the generation of an interrupt and the start of the execution of the corresponding interrupt handler.
Example: When a hardware device performs a task, it generates an interrupt. This interrupt has the information about the task to be performed and about the interrupt handler to be executed. The interrupt handler then performs the particular task.
Scheduling Latency:
It is the time between a wakeup signaling that an event has occurred and the kernel scheduler getting an opportunity to schedule the thread that is waiting for the wakeup to occur (the response). Scheduling latency is also known as dispatch latency.
Worst-case Latency:
The maximum amount of time that can laspe between a demand issued on a computer system and the begining of a response to the same demand.
What is throughput?
Amount of work that a computer can do in a given period of time is called throughput.
What is Context switch?
Context switch is the switching of the CPU from one process/thread to another. Context switches can occur only in kernel mode. This is the process of saving the current execution state of the process (for resuming execution later on), and loading the saved state of the new process/thread for execution.
Adding to top vote answer "lower priority tasks won't be getting much a chance to get through"
That's sort of the whole point (though on a 4+ core system those low priority tasks could still run as long as they are forbidden from doing things that would interfere with critical tasks - this is where it's important to be able to make sure all the peripherals connected play nice by not blocking when the app running critical thread wants to access them). The critical bit (if for example thinking about developing a useful system for processing external input timely or testing behaviour of data conversion with live data as opposed to model), is to have a way to tell the kernel where the time critical input is arriving from.
Problem with current eg. Windows systems is that you might be a "serious gamer or serious musician" that notices things like 150 microsecond jitters. If you have no way to specify that the keyboard, mouse or other human interface device should be treated at critical priority, then all sort of "windows updates" and such can come and do their thing which might in turn trigger some activity in the USB controller that has higher priority than the threads related to doing the input.
I read about a case where glitches in audio were resolved by adding a 2nd USB controller with nothing on it except the input device. In portable setting, you practically need Thunderbolt PCIe passthrough to add a dedicated hub (or fpga) that can, together with drivers, override everything else on the system. This is why the aren't much USB products on the market that provide good enough performance for musicians. (2 ms round trip latency with max 150 microsecond jitter over full day without dropouts)