How to test linux NAPI feature? - linux

I am trying to test the NAPI functionalities on embedded linux environment. I used 'pktgen' to generate the large number of packets and tried to verify the interrupt count of my network interface at /proc/interrupts.
I found out that interrupt count is comparatively less than the packets generated.
Also I am trying to tune the 'netdev_budget' value from 1 to 1000(default is 300) so that I can observe the reduction in interrupt count when netdev_budget is increased.
However increasing the netdev_budget doesn't seems to help. The interrupt is similar to that of interrupt count observed with netdev_budget set to 300.
So here are my queries:
What is the effect of 'netdev_budget' on NAPI?
What other parameters I can/should tune to observe the changes in interrupt count?
Is there any other way I can use to test the NAPI functionality on Linux?(Apart from directly looking at the network driver code)
Any help is much appreaciated.
Thanks in advance.

I wrote a comprehensive blog post about Linux network tuning which explains everything about monitoring, tuning, and optimizing the Linux network stack (including the NAPI weight). Take a look.
Keep in mind: some drivers do not disable IRQs from the NIC when NAPI starts. They are supposed to, but some simply do not. You can verify this by examining the hard IRQ handler in the driver to see if hard IRQs are being disabled.
Note that hard IRQs are re-enabled in some cases as mentioned in the blog post and below.
As far as your questions:
Increasing netdev_budget increases the number of packets that the NET_RX softirq can process. The number of packets that can be processed is also limited by a time limit, which is not tunable. This is to prevent the NET_RX softirq from eating 100% of CPU usage. If the device does not receive enough packets to process during its time allocation, hardirqs are reneabled and NAPI is disabled.
You can also try modifying your IRQ coalescing settings for the NIC, if it is supported. See the blog post above for more information on how to do this and what this means, exactly.
You should add monitoring to your /proc/net/softnet_stat file. The fields in this file can help you figure out how many packets are being processed, whether you are running out of time, etc.
A question for you to consider, if I may:
Why does your hardirq rate matter? It probably doesn't matter, directly. The hardirq handler in your NIC driver should do as little work as possible, so it executing a lot is probably not a problem for your system. If it is, you should carefully measure that as it seems very unlikely. Nevertheless, you can adjust IRQ coalescing settings and IRQ CPU affinity to distribute processing to alter the number of hardirqs generated by the NIC and processed by a particular CPU, respectively.
You should consider whether you probably are more interested in packet processing throughput or packet processing latency. Depending on which is the concern, you can tune your network stack appropriately.
Remember: to completely tune and optimize your Linux networking stack, you have to monitor and tune each component. They are all intertwined and it is difficult (and often insufficient) to monitor and tune just a single aspect of the stack.

Related

What options do I have for running recurring events on a microsecond resolution from a kernel driver?

I want to create a simulation of an actual device on an x86 Linux Kernel. Part of this will involve simulating timings as close to possible as I can get. Based on some research it seems I will need at least microsecond resolution timing. I understand that on a non-realtime system it won't be possible to get perfect timing, but I don't perfect, just as close as I can get, perhaps with hacking around with thread scheduling / preemption options.
What I actually want to do is perform an action every interval, i.e. run a some code every Xµs. I've been trying to research the best ways to do this from a Kernel driver as well as some research into whether it's possible to do this reasonably accurately from user mode (keeping the above paragraph in mind). One of the first things that caught my eye was the HPET timer, that is programmable to generate interrupts based on programmable comparators. Unfortunately, it seems on many chipsets it has been rather buggy in the past, and there's not much information on using it for anything that obtaining a timestamp or using it as the main clock source. The linux Kernel provides an HPET driver that in the past, seemed to provide both kernel and user mode interfaces, but seems only to provide a barely documented usermode interface in more recent kernel versions. I've also read about various other kernel functions and interfaces such as the hrtimer interface and the various delay functions, though I'm having a bit of trouble understanding them and if they are suited for my purpose.
Given my current use case, what are the best options I have running recurring events at a µs resolution from say a kernel driver? Obviously accuracy is probably my biggest criteria, but ease of use would be second.
Well, it's possible to achieve your accuracy in userspace -- clock_nanosleep is one ideal option, which has relative and absolute mode. Since clock_nanosleep is based on hrtimer in kernel mode, you may want to use hrtimer if you'd like to implement it in kernel space.
However, to make the timer work accurately, there're two IMPORTENT things worth mentioning.
You should set the timerslack of your process (either by writing nonzero value in ns to /proc/self/timerslack_ns or via prctl(PR_SET_TIMERSLACK,...)). This value is considered as the 'tolerance' of the timer.
The CPU power management also matters here. The CPU has many different Cstates, each of which has a different exit latency. So you need to configure your cpuidle module to not use Cstates other than C0, e.g. for an Intel CPU you could simply write 1 to /sys/devices/system/cpu/cpu$c/cpuidle/state$s/disable to disable state $s of CPU $c. Or just add idle=poll to your kernel options to let CPU keep active (in C0) while kernel idle. NOTE that this significantly influences the power of the computer and leads the cooling fans to make noise.
You can get a timer with delays under 10 microseconds if the two things mentioned above are configured correctly. There is a trade-off between latency and power consumption that you should made.

SMP affinity routing doesn't work with GICv2 on ARM

There are 4 CPU cores and one Ethernet card on my Raspberry Pi.
I need interrupts from NIC to be routed to all the 4 CPU cores.
I set the /proc/irq/24/smp_affinity to 0xF (1111), but that doesn't help.
In sixth column of /proc/interrupts I don't see IO-APIC (which definitely supports* affinity routing) but GICv2 instead. Still can't find any useful info about GICv2 and smp_affinity.
Does GICv2 support SMP affinity routing?
*UPD:
from that post:
The only reason to look at this value is that SMP affinity will only
work for IO-APIC enabled device drivers.
TL;DR - The existence of /proc/irq/24/smp_affinity indicates that your Linux SMP system supports affinity. The text IO-APIC is the type of interrupt controller (typical PC) and it does NOT indicate that the system can handle affinities. On ARM systems a GIC is usually the interrupt controller, although some interrupts can be routed to a 'sub-controller'.
At least the mainline is supporting some affinities as per Kconfig. However, I am not sure what you are trying to do. The interrupt can only run on one CPU as only one CPU can take the data off the NIC. If a particular CPU is running network code and the rest are used for other purposes, the affinity makes sense.
The data on that core will probably not be in cache as the NIC buffers are probably DMA and not cacheable. So, I am not really sure what you would achieve or how you would expect the interrupts to run on all four CPUs? If you have four NIC interfaces, you can peg each to a CPU. This may be good for power consumption issues.
Specifically, for your case of four CPUs, the affinity mask of 0xf will disable any affinity and this is the default case. You can cat /proc/irq/24/smp_affinity to see the affinity is set. Also, the existence of this file would indicate that your Linux SMP system supports affinity. The text IO-APIC is the type of interrupt controller (typical PC) and it does NOT indicate that the system can handle affinities.
See also:
zero copy vs kernel by-pass
University of Waterloo doc
IRQ-affinity.txt
NOTE This part is speculative and is NOT how any cards I know of works.
The major part that you want is not generally possible. The NIC registers are a single resource. There are multiple registers and they have general sequences to reading and writing registers to perform an operation. If two CPUs were writing (or even reading) the register at the same time, then it will severely mix up the NIC. Often the CPU is not that involved in an interrupt and only some DMA engine needs to be told about a next buffer in an interrupt.
In order for what you want to be useful, you would need a NIC with several register 'banks' that can be used independently. For instance, just READ/WRITE packet banks is easy to comprehend. However, there may be several banks to write different packets and then the card would have to manage how to serialize them. Also, the card could do some packet inspection and interrupt different CPUs based on fixed packet values. Ie, a port and IP. This packet matching would generate different interrupt sources and different CPUs could handle different matches.
This would allow you to route different socket traffic to a particular CPU using a single NIC.
The problems are to make this card in hardware would be incredible complex compared to existing cards. It would be more expensive and it would take more power to operate.
If it is standard NIC hardware, there is no gain by rotating CPUs if the original CPU is not busy. If there is non-network activity, it is better to leave other CPUs alone so there cache can be use for a different workload (code/data). So in most case, it is best just to keep the interrupt on a fixed CPU unless it is busy and then it may ping-pong between a few CPUs. It would almost never be beneficial to run the interrupt on all CPUs.
I do not believe the the GICv2 supports IRQ balancing. Interrupts will always be handled by the same CPU. At least this was the case when I looked at this last for 5.1 kernels. The discussion at the time was that this would not be supported because it was not a good idea.
You will see interrupts will always be handled by CPU 0. Use something like ftrace or LTTng to observe what CPU is doing what.
I think via the affinity setting you could prevent the interrupt from running on a CPU, by setting that bit to zero. But this does not balance the IRQ over all CPUs on which it is allowed. It will still always go to the same CPU. But you could make this CPU 1 instead of 0.
So what you can do, is to put certain interrupts on different CPUs. This would allow something like SDIO and network to not vie for CPU time from the CPU 0 in their interrupt handlers. It's also possible to set the affinity of a userspace process such that it will not run on the same CPU which will handle interrupts and thereby reduce the time that the userspace process can be interrupted.
So why don't we do IRQ balancing? It ends up not being useful.
Keep in mind that the interrupt handler here is only the "hard" IRQ handler. This usually does not do very much work. It acknowledges the interrupt with the hardware and then triggers a back-end handler, like a work queue, IRQ thread, soft-irq, or tasklet. These don't run in IRQ context and can and will be scheduled to different CPU or CPUs based on the current workload.
So even if the network interrupt is always routed to the same CPU, the network stack is multi-threaded and runs on all CPUs. Its main work is not done in the hard IRQ handler that runs on one CPU. Again, use ftrace or LTTng to see this.
If the hard IRQ does very little, what is most important is to reduce latency, which is best done by running on the same CPU to improve cache effectiveness. Spreading it out is likely worse for latency and also for the total cost of handling the IRQs.
The hard IRQ handler can only run once instance at a time. So even if it was balanced, it could use just one CPU at any one time. If this was not the case, the handler would be virtually impossible to write without race conditions. If you want to use multiple CPUs at the same time, then don't do the work in a hard IRQ handler, do it in a construct like a workqueue. Which is how the network stack works. And the block device layer.
IRQs aren't balanced, because it's not usually the answer. The answer is to not do the work in IRQ context.

How to measure latency between hardware interrupt and related system call?

I have a Linux machine with two PCIe RS-485 cards (XR17V354 & XR17V352). I have one port on one of the cards hardwired to one port on the other card. These cards are driven by the generic serial driver (serial8250).
I am running a test and measuring latency. I have one Linux process sending two bytes out the port and then listens for two incoming bytes. The other process receives two bytes and immediately sends two bytes back.
I'm measuring this round trip latency to be around 1500 microseconds with a standard deviation of about 40 microseconds. I am trying to understand the source of this latency. Specifically, I'd like to understand the difference in time from which a hard IRQ fires to signal data is ready to read and the time that the bytes are made available to the user space process.
I am aware of the ftrace feature, but I am not sure how best to utilize it, or if there are other, more suitable tools. Thanks.
What kind of driver is this? I assume it's a driver in kernel space and not UIO.
Independent of your issue you could start looking at how long it takes from a hardware interrupt to the kernel driver and from there to user space.
Here[1] is some ancient test case which can be hacked a bit so you can compare interrupt latencies with "standard" Linux, preempt-rt patched Linux and maybe something like Xenomai as well (although the Xenomai solution would require that you rewrite your driver).
You might want to have a look at [2], cyclictest and friends and maybe try to drill with perf into your system to see more details system wide.
Last but not least have a look at LTTng[3] which enables you to instrument code and it already has many instrumentation points.
[1] http://www.denx.de/wiki/DULG/AN2008_03_Xenomai_gpioirqbench
[2] http://cgit.openembedded.org/openembedded-core/tree/meta/recipes-rt/rt-tests/
[3] http://lttng.org/

Critical Timing in an ARM Linux Kernel Driver

I am running linux on an MX28 (ARMv5), and am using a GPIO line to talk to a device. Unfortunately, the device has some special timing requirements. A low on the GPIO line cannot last longer than 7us, highs have no special timing requirements. The code is implemented as a kernel device driver, and toggles the GPIO with direct register writes rather than going through the kernel GPIO api. For testing, I am just generating 3 pulses. The process is as follows, all in one function so it should fit in the instruction cache:
set gpio high
Save Flags & Disable Interrupts
gpio low
pause
gpio high
repeat 2x more
Restore Flags/Reenable Interrups
Here's the output of a logic analyzer tied to the GPIO.
Most of the time it works just great, and the pulses last just under 1us. However, about 10% of the lows last for many, many microseconds. Even though interrupts are disabled, something is causing the flow of the code to be interrupted.
I am at a loss. RT Linux would likely not help here, because the problem is not latency, it appears to be something happening during the low, even though nothing should interrupt it with the IRQs disabled. Any suggestions would be greatly, greatly appreciated.
The ARM cache on an IMX25 (ARM926) is 16K Code, 16K Data L1 with a 32byte length or eight instructions. With the DDR-SDRAM controller running at 133Mhz and a 16bit bus the transfer rate is about 300MB/s. A cache fill should only take about 100nS, not 9uS; this is about 100 times too long.
However, you have four other issues with Linux.
TLB misses and a page table walk.
Data aborts.
DMA masters stealing.
FIQ interrupts.
It is unlikely that the LCD master is stealing enough bandwidth, unless you have a huge display. Is your display larger than 1/4VGA? If not, this is only 10% of the memory bandwidth and this will pipeline with the processor. Do you have either Ethernet or USB active? These peripherals are higher data rate and could cause this type of contention with SDRAM.
All of these issues maybe avoided by writing your toggler PC relative and copying it to the IRAM. See: iram_alloc.c; this file should be portable to older versions of Linux. The XBAR switch allows fetches from SDRAM and IRAM simultaneously. The IRAM can still be a target of other DMA masters. If you are really pressed, move the code to the ETB buffers which no other master in the system can access.
The TLB miss can actually be quite steep as it may need to run several single beat SDRAM cycles; still this should be under 1uS. You have not posted code, so it is possible that a variable and/or other is causing a data fault which is not maskable.
If you have any drivers using the FIQ, they may still be running even though you have masked the normal IRQ interrupts. For instance, the ALSA driver for this system normally uses the FIQ.
Both the ETB and the IRAM are 32-bit data paths and low wait state. Either one will probably give better response than the DDR-SDRAM.
We have achieved sub micro-second response by using a FIQ and IRAM to toggle GPIOs on an IMX258 with another protocol using bit banging.
One possible workaround to the problem Chris mentioned (in addition to problems with paging of kernel module code) is to use a PWM peripheral where the duration of the pulse is pre-programmed and the timing is implemented in hardware.
Fancy processors with caches are not suitable for hard realtime work. Execution time varies if cache misses are non-deterministic (and designs where cache misses are completely deterministic aren't complicated enough to justify a fancy processor).
You can try to avoid memory controller latency during critical sections by aligning the critical section so that it doesn't straddle cache lines. Or prefetch the code you will need. But this is going to be very non-portable and create a nightmare for future maintenance. And still doesn't protect the access to memory-mapped GPIO from bus contention.

low latency Interrupt handling (expected avg time to return from kernel to user space is?)

I have a Fibre Optic link, with a proprietary Device Driver.
The link goes into a PCIe card. Running on a RHEL 5.2 (2.6.18-128~)
I have mmap'ed the interface on the card for setup and FIFO access etc, and these read/writes take a few µs to complete, so all good there.
But of course cannot use this for interrupts, so I have to use the kernel module provided, with its user-space lib interface.
WaitForInterrupt(); // API lib interface to kernel module
// Interrupt occurs and am returned to my code in user space
time = CurrentTime() - LatchedTime(); // time to get to here
It takes around 70µs to return from WaitForInterrupt(). (The time the interrupt is raised is latched in the firmware, I read this which as I say above takes ~2µs, and compare it against the current time in the firmware)
What are expected access times between an interrupt occurring and the User Space API interrupt call wait method returning?
Network/other-high-speed interfaces take?
500ms is many orders of magnitudes larger than what a simple switch between userspace/kernel takes, but as someone mentioned in comments, linux is not a real time OS, so there's no guarantee 500ms "hickups" won't show up now and then.
It's quite impossible to tell what the culprit is, the device driver could simpliy be trying to bundle up data to be more efficient.
That said, we've had endless troubles with some custom cards and interactions with both APIC and ACPI, requireing a delicate balance of bios settings, what card goes into which PCI slot and whether a particular video card screws up everything - likely a cause of a dubious driver interacting with more or less buggy bios/video-cards..
If you're able to reliably exceed 500us on a system that's not heavily loaded, I think you're looking at a bad driver implementation (or its userspace wrapper/counterpart).
In my experience the latency to wake a user thread on interrupt should be less than 10us, though (as others have said) Linux provides no latency guarantees.
If you have a recent kernel, you can use the perf sched tool to measure the latency, and see where the time is being used. (500us does sound a tad on the high side, depending on your processor, how many tasks are running, ...)

Resources