I have been reading about performance tuning of Linux to get the fastest packet processing times when receiving financial market data. I see that when the NIC receives a packet, it puts it in memory via DMA, then raises a HardIRQ - which in turn sets some NAPI settings and raises a SoftIRQ. The SoftIRQ then uses NAPI/device drivers to read data from the RX Buffers via polling, but this is only run for some limited time (net.core.netdev_budget, defaulted to 300 packets). These are in reference to a real server running ubuntu, with a solarflare NIC My questions are below:
If each HardIRQ raises a SoftIRQ, and the Device Driver reads multiple packets in 1 go (netdev_budget), what happens to the SoftIRQs raised by each of the packets that were drained from the RX buffer in 1 go (Each pack received will raise a hard and then soft irq)? Are these queued?
Why does the NAPI use polling to drain the RX_buffer? The system has just generated a SoftIRQ and is reading the RX buffer, then why the polling?
Presumably, draining of the RX_Buffer via the softirq, will only happen from 1 specific RX_Buffer and not across multiple RX_Buffers? If so, then increasing the netdev_budget can delay the processing/draining of other RX_buffers? Or can this be mitigated by assigning different RX_buffers to different cores?
There are settings to ensure that HardIRQs are immediately raised and handled. However, SoftIRQs may be processed at a later time. Are there settings/configs to ensure that SoftIRQs related to network RX are also handled at top priority and without delays?
I wrote a comprehensive blog post explaining the answers to your questions and everything else about tuning, optimizing, profiling, and understanding the entire Linux networking stack here.
Answers to your questions:
sofirqs raised by the driver while a softirq is processing do nothing. This is because the NAPI helper code first checks to see if NAPI is already running before attempting to raise the softirq. Even if the NAPI did not check, you can see from the softirq source that softirqs are implemented as a bit vector. This means a softirq can only be 1 (pending) or 0 (not pending). While it is set to 1, additional calls to set it to 1 will have no effect.
The softirq is used to start the NAPI poll loop and to control the NAPI poll so it does not eat 100% of CPU usage. The NAPI poll loop is just a for loop, and the softirq code manages how much time it can spend and how much budget it has.
Each CPU processing packets can spend the full budget. So, if budget is set to 300, and you have 2 CPUs, each CPU can process 300 packets each for a total of 600. This is only true if your NIC supports multiple RX queues and you've distributed the IRQs to separate CPUs for processing. If your NIC doesn't, you can use Receive Packet Steering to help with this (RPS). See my blog post above for more information.
No, there are no settings for this. Note that the softirqs run on the same CPU which raised them. So, if you set your hardirq handler for RX queue 1 to CPU 16, then the softirq will run on CPU 16. One thing you can do is: set your hardirqs to specific CPUs and set the application that will use that data to those same CPUs. Pin all other applications (like cron jobs, background tasks, etc) to other CPUs -- this ensures that only the hardirq handler, softirq, and application that will process the data can run.
If you desire extremely low latency network packet reads, you should try using a new Linux kernel networking feature called busy polling. It can be used to help minimize and reduce networking packet processing latency. Using this option will increase CPU usage.
Related
Assuming I set the process to the highest possible priority and there is no swap...
What's the longest time that a thread, which is blocking on receiving from an RS232 serial port, can take to wake up?
I want to know whether the thread will be woken within microseconds of the UART interrupt hitting the kernel, or whether it will have to wait for the next 100ms timeslice on a CPU.
What's the longest time that a thread, which is blocking on receiving from an RS232 serial port, can take to wake up?
Depending on the mode (e.g. canonical) a process could wait forever (e.g. for the EOL character).
I want to know whether the thread will be woken within microseconds of the UART interrupt hitting the kernel, or
The end of frame (i.e. the stop bit) on the wire is a better (i.e. consistent) reference point.
"UART interrupt hitting the kernel" is a poor reference point considering interrupt generation and processing can be deferred.
A UART FIFO may not generate an interrupt for every character/byte.
The interrupt controller prioritizes pending interrupts, and UARTs are rarely assigned high priorities.
Software can disable interrupts for critical regions.
whether it will have to wait for the next 100ms timeslice on a CPU.
The highest-priority runable process gets control after a syscall completes.
Reference: Linux Kernel Development: Preemption and Context Switching:
Consequently, whenever the kernel is preparing to return to user-space, either
on return from an interrupt or after a system call, the value of need_resched
is checked. If it is set, the scheduler is invoked to select a new (more fit)
process to execute.
I'm looking to minimise Linux serial latency between the received stop bit and the start bit of the reply from a high-priority userspace thread.
I suspected that is what you are really seeking.
Configuration of the serial terminal is crucial for minimizing such latency, e.g. research the ASYNC_LOW_LATENCY serial flag.
However configuration of the Linux kernel can further improve/minimize such latency, e.g. this developer reports a magnitude reduction from millisecs to only ~100 microsec.
I'm only familiar with serial interfaces on ATMEGA and STM32 microcontrollers ...
Then be sure to review Linux serial drivers.
I have a real-time process sending occasional communication over RS232 to a high speed camera. I have several other real-time processes occupying a lot of CPU time, doing image processing on several GPU boards using CUDA. Normally the serial communication is very fast, with a message and response taking about 50 ms every time. However, when the background processes are busy doing image processing, the serial communication slows way down, often taking multiple seconds (sometimes more than 10 seconds).
In summary, during serial communication, Process A is delayed if Process B, C, etc., are very busy, even though process A has the highest priority:
Process A (real-time, highest priority): occasional serial communication
Processes B, C, D, etc. (real-time, lower priority): heavy CPU and GPU processing
When I change the background processes to be SCHED_OTHER (non-real-time) processes, the serial communication is fast; however, this isn't a solution for me, because the background processes need to be real-time processes (when they are not, the GPU processing doesn't keep up adequately with the high speed camera).
Apparently the serial communication is relying on some non-real-time process in the system, which is being pre-empted by my real-time background processes. I think if I knew which process was being used for serial communication, I could increase its priority and solve the problem. Does anyone know whether serial communication relies on any particular process running on the system?
I'm running RHEL 6.5, with the standard kernel (not PREEMPT_RT). It has dual 6-core CPUs.
At Erki A's suggestion, I captured an strace. Apparently it is a select() system call which is slow (the "set roi2" is the command to the camera, and the "Ok!" at the end is the response from the camera):
write(9, "set roi2"..., 26) = 26 <0.001106>
ioctl(9, TCSBRK, 0x1) = 0 <0.000263>
select(10, [9], NULL, NULL, {2, 0}) = 1 (in [9], left {0, 0}) <2.252840>
read(9, "Ok!\r\n", 4096) = 5 <0.000092>
The slow select() makes it seem like the camera itself is slow to respond. However, I know that isn't true, because of how the speed is impacted by changing the background process priorities. Is select() in this case dependent on a certain other process running?
If I skip the select() and just do the read(), the read() system call is the slow one.
Depending on your serial device/driver, the serial communications are most likely relying on a kernel worker thread (kworker) to shift the incoming serial data from the interrupt service routine buffers to the line discipline buffers. You could increase the priority of the kernel worker thread, however, worker threads process the shared work queue. So increasing the priority of the worker thread will increase the priority of the serial processing along with a whole bunch of other stuff that possibly doesn't need the priority boost.
You could modify the serial driver to use a dedicated high priority work queue rather than a shared one. Another option would be to use a tasklet, however, both these require driver level modifications.
I suspect the most straight forward solution would be to set the com port to low latency mode, either from the command line via the setserial command:
setserial /dev/ttySxx low_latency
or programatically:
struct serial_struct serinfo;
fd = open ("/dev/ttySxx");
ioctl (fd, TIOCGSERIAL, &serinfo);
serinfo.flags |= ASYNC_LOW_LATENCY;
ioctl (fd, TIOCSSERIAL, &serinfo);
close (fd);
This will cause the serial port interrupt handler to transfer the incoming data to the line discipline immediately rather than deferring the transfer by adding it to a work queue. In this mode, when you call read() from your application, you will avoid the possibility of the read() call sleeping, which it would otherwise do, if there is work in the work queue to flush. This sleep is probably the cause of your intermittent delays.
You can use strace to see where it locks up. If it is more than 10 seconds, it should be easy to see.
In my application there is a Linux thread that needs to be active every 10 ms,
thus I use usleep (10*1000). Result: thread never wakes up after 10 ms but always after 20 ms. OK, it is related to scheduler timeslice, CONFIG_HZ etc.
I was trying to use usleep(1*1000) (that is 1 ms) but the result was the same. Thread always wakes up after 20 ms.
But in the same application the other thread handles network events (UDP packets) that came in every 10 ms. There is blocking 'recvfrom' (or 'select') and it wakes up every 10 ms when there is incoming packet.
Why it is so ? Does select has to put the thread in 'sleep' when there are no packets? Why it behave differently and how can I cause my thread to be active every 10 ms (well more or less) without external network events?
Thanks,
Rafi
You seem to be under the common impression that these modern preemptive multitaskers are all about timeslices and quantums.
They are not.
They are all about software and hardware interrupts, and the timer hardware interrupt is only one of many that can set a thread ready and change the set of running threads. The hardware interrupt from a NIC that causes a network driver to run is an example of another one.
If a thread is blocked, waiting for UDP datagrams, and a datagram becomes avaialable because of a NIC interrupt running a driver, the blocked thread will be made ready as soon as the NIC driver has run because the driver will signal the thread and request an immediate reschedule on exit. If your box is not overloaded with higher-rpiority ready threads, it will be set running 'immediately' to handle the datagram that is now available. This mechanism provides high-performance I/O and has nothing to do with any timers.
The timer interrupt runs periodically to support sleep() and other system-call timeouts. It runs at a fairly low frequency/high interval, (like 1/10ms), because it is another overhead that should be minimised. Running such an interrupt at a higher frequency would reduce timer granularity at the expense of increased interrupt-state and rescheduling overhead that is not justified in most desktop installations.
Summary: your timer operations are subject to 10ms granularity but your datagram I/O responds quickly.
Also, why does you thread need to be active every 10ms? What are you polling for?
What makes the softirq so special that we use it for high frequency uses., like in network drivers and block drivers.
SoftIrqs are typically used to complete queued work from a processed interrupt because they fit that need very well - they run with second-highest priority, but still run with hardware interrupts enabled.
Processing hw interrupts is the utmost priority, since if they are not processed quickly, then either too high of latency will be introduced and user experience suffers, or the hardware buffer will fill before the interrupt services the device, and data is lost. Dont service a network adapter fast enough? It's going to overwrite data in the fifo and you'll lose packets. Don't service a hard disk fast enough? The hard drive stalls queued read requests because it has nowhere to put the results.
SoftIrqs allow the critical part of servicing hardware interrupts to be as short as possible; instead of having to process the entire hw interrupt on the spot, the important data is read off the device into RAM or otherwise, and then a SoftIrq is started to finish the work. This keeps the hardware interrupts disabled for the shortest period of time, while still completing the work with high priority.
This article is a decent reference on the matter:
https://lwn.net/Articles/520076/
Edits for questions:
SoftIrqs are re-entrant - they can be processed on any cpu. From the article I linked:
There are two places where software interrupts can "fire" and preempt
the current thread. One of them is at the end of the processing for a
hardware interrupt; it is common for interrupt handlers to raise
softirqs, so it makes sense (for latency and optimal cache use) to
process them as soon as hardware interrupts can be re-enabled
Emphasis added. They can be processed inline - I believe this means they can be processed without causing a context switch, which means as soon as hardware interrupts are enabled, we can jump straight to the SoftIrq right where we are with as little CPU cache abuse as possible. All of this contributes to SoftIrqs being lightweight but flexible, which makes them ideal for high-frequency processing.
They can be pushed to another CPU if needed, which improves throughput.
They can be processed immediately after hwints are enabled right in the current context, preserving processor state as much as possible, improving latency
They allow hardware interrupts to keep processing, since those are our most important goal
They can be rescheduled to the ksoftirqd process if load is too high and we need to take time from normal user processes.
I have a multiqueue NIC card on a 4-cores intel machine and
I bind every queue of the NIC card on a cpu core (set /proc/irq/xxx/smp_affinity)
Let's say queue0 on core0, queue1 on core1 and so on.
It is said that the softirq will invoke on the same core that the hardware interrupt occured.
Why the ksoftirqd can not run parellelly on my machine?
It's just one kernel thread (like ksoftirqd/2) that will use 100% of a core but others are 0%
when I use
cat /proc/interrupts | grep eth1
I can see that all packages are even distributed to all the NIC queues.
update:
Here is a solution for the 100% softirq problem if you can read Chinese
http://hi.baidu.com/higkoo/item/42ba6c353bc8aed76d15e9c3 (please see #7)
if not, which blog says that you can add a another card, this problem would be solved
ksoftirqd doesn't need to run parallel because it doesn't normally run the softirqs. All softirqs are normally run on the CPU where they were requested right after the interrupt that requested them.
softirqs will be run on ksoftirqd only in case of a "softirq" flood - after the kernel executes an interrupt, it checks if it needs to run anu softirqs. If it does it will run them. During the time that these runs interrupts are enabled, so it is possible that when you were running the softirqs an interrupt occurs that marked them for running again. This is why the kernel will again check for market softirqs after running them.
It should be obvious that given a flood of interrupts this can turn into a live lock very fast - where all we do is run softirqs and interrupt and never any user code. This is why the kernel has a "damper" mechanism - if after 10 times of checking if any softirqs are marked after running them they are still marked, the kernel will not run softirqs at end of the interrupt and instead wake up the special kernel thread ksoftirqd to run them until the flood is over.
This is a watchdog mechanism to handle IRQ floods and most of the time it is dormant, so having a multi threaded ksoftirqd will not really help your in the normal case.
ksoftirqd is the base of all polling routines in the kernel, including the polling of the network queues of your card.
As such, the triggering of ksoftirqd will affect how well it threads.
The fact is that it doesn't thread at all. This is because the timer triggering ksoftirqd is always delivered to the same core.
But, you ask this question with a goal in mind. It might make sense to talk about that goal first, not about this detail of the implementation towards that goal.