I'm trying to understanding NAPI implementaion in linux kernel. These are my basic doubts.
1) NAPI disables further interrupts and handles the skbs' using polling
Who disables it?
Does the Interrupt handler should disable it?
If yes - Isn't the time gap between disabling interrupt and handling the SOFTIRQ net_rx_action where actually polling is done is way too much.
2) By default all NAPI enabled drivers on receiving a single frame disable interrupt and handle remaining frames using polling in bottom halfs?
or is there a logic where only if frames > 32 (on continous receving all frames in irq handler) makes a switch to poll mode?
3) Now coming to SHARED IRQ -
what happens to other devices interrupts , other device bottom half might not run since those devices are not there in poll_list.
I wrote a comprehensive guide to understanding, tuning, and optimizing the Linux network stack which explains everything about network drivers, NAPI, and more, so check it out.
As far as your questions:
Device IRQs are supposed to be disabled by the driver's IRQ handler after NAPI is enabled. Yes, there is a time gap, but it should be quite small. That is part of the tradeoff decision you must make: do you care more about throughput or latency? Depending on which, you can optimize your network stack appropriately. In any case, most NICs allow the user to increase (or decrease) the size of the ring buffer that tracks incoming network data. So, a pause is fine because packets will just be queued for processing later.
It depends on the driver, but in general most drivers will enable NAPI poll mode in the IRQ handler, as soon as it is fired (usually) with a call to napi_schedule. You can find a walkthrough of how NAPI is enabled for the Intel igb driver here. Note that IRQ handlers are not necessarily fired for every single packet. You can adjust the rate at which IRQ handlers fire on most cards by using a feature called interrupt coalescing. Some NICs may not support this option.
The IRQ handlers for other devices will be executed when the IRQ is fired because IRQ handlers have very high priority on the CPU. The NAPI poll loop (which runs in a SoftIRQ) will run on whichever CPU the device IRQ was handled. Thus, if you have multiple NICs and multiple CPUs, you can tune the IRQ affinity of the IRQs for each NIC to prevent starving a particular NIC.
As for the example you asked about in the comments:
say NIC 1 and NIC 2 share IRQ line , lets assume NIC 1 is low load , NIC 2 high load and NIC 1 receives interrupt, driver of NIC 1 would disable interrupt until it's softirq is handled , say that time gap as t1 . So for time t1 NIC 2 interrupts are too disabled, right?
This depends on the driver, but in the normal case, NIC 1 only disables interrupts while the IRQ handler is being executed. The call to napi_schedule tells the softirq code that it should start running if it hasn't started yet. The softirq code runs asynchronously, so no NIC 1 does not wait for the softirq to be handled.
Now, as far as shared IRQs go: again it depends on the device and the driver. The driver should be written in such a way that it can handle shared IRQs. If the driver disables an IRQ that is being shared, all devices sharing that IRQ will not receive interrupts. This would be bad. One way that some devices solve this is by allowing a driver to read/write to a specific register causing that specific device to stop generating interrupts. This is a preferred solution as it does not block other devices generating the same IRQ.
When IRQs are disabled for NAPI, what is meant is that the driver asks the NIC hardware to stop sending IRQs. Thus, other IRQs on the same line (for other devices) will still continue to be processed. Here's an example of how the Intel igb driver turns off IRQs for that device by writing to registers.
Related
This blog post talks about the difficulties in bringing pci passtrhough support for ARM devices: https://www.linaro.org/blog/kvm-pciemsi-passthrough-armarm64/
It cies GICv2/GICv3 which are ARM's interrupt controllers. You can write to it via MMIO and make it deliver interrupts to CPUs.
However, why interrupts are needed? Shouldn't the PCIe driver talk with the PCIe device through MMIO. That is, writing/reading from memory?
It is necessary because otherwise the operating-system doesn't have any way of knowing an event happened. Operating-systems are not polling memory constantly. They still need to know that an event happened and when. That's where interrupts come in.
Imagine you have an hard-disk PCIe controller. How does the operating-system know when the disk is done writing its data to RAM?
As we known, we can map IRQs of some devices to some CPU-Cores by using IRQ Affinity on Linux
cat <8-bit-core-mask> /proc/irq/[irq-num]/smp_affinity:
http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux
https://community.mellanox.com/docs/DOC-2123
Also
We known, that we can map IRQ (hardware-interrupts) on the some CPU-Nodes (Processors on motherboard) on NUMA-systems, by using: https://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf
cat <8-bit-node-mask> /proc/irq/[irq-num]/node
But if one PCIe-device (Ethernet, GPU, ...) connected to the NUMA-0, and other PCIe-device connected to the NUMA-1, then it would be optimal to use interrupts on those NUMA-nodes (CPU) to which these devices are connected, to avoid high latency communication between nodes: Is CPU access asymmetric to Network card
Does Linux automatically binds IRQs to the nodes to which the PCIe-devices are connected , or does it have to be done manually?
And if we have to do this with our hands, then what is the best way to do this?
Particularly interested in Linux x86_64: Debian 8 (Kernel 3.16) and Red Hat Enterprise Linux 7 (Kernel 3.10), and others...
Motherboard chipsets: Intel C612 / Intel C610, and others...
Ethernet cards: Solarflare Flareon Ultra SFN7142Q Dual-Port 40GbE QSFP+ PCIe 3.0 Server I/O Adapter - Part ID: SFN7142Q
By architecture all low IRQs mapped to Node 0.
Some of them CAN'T be remapped like IRQ 0 timer.
Anyway need to review your system (blueprints).
In case you have high network load and doing routing it makes sense to pin NIC queues. Most effectively to pin tx and rx queues to "nearest" cores in term of caches. But before suggest that would be great to know your architecture.
Need to know:
1. Your system (dmidecode, lspci output), cat /proc/interrupt
2. Your requirements (what the purpose of the server). IOW would be great to understand what's your server for. So just explain the flows and architecture.
I am working on a PCIe Linux driver. I would like to register an ISR for the device. IRQ number assigned to the device by the Linux system is 16, which is shared by other(USB host controller) device also. (Checked by lspci -v). It is a pin based interrupt.
By searching online I found almost all PCI driver example just provides only IRQF_SHARED as flag in API request_irq(), and does not provide any other flags to mention behaviour like High/Low level interrupt.
My question is, how the Linux kernel determines the behaviour of shared interrupt (for PCIe device), if it is low level or High level ?
PCIe uses MSI, so there is no hi/low level to be concerned with. Traditional PCI cards use level triggered interrupts, but most devices use active low signaling so this isn't something a driver writer has access to modify/tweak.
We're accessing an FPGA device via the Linux UIO device infrastructure. Under this model, we receive interrupts from the FPGA by poll(2)ing the device node /dev/uio0.
We'd like to make sure that we don't miss any interrupts. Hence we need a way to notify clients of the class encapsulating the device file descriptor when our polling thread is ready waiting in the poll(2) system call, so that we can be sure that we're only telling the FPGA to start generating interrupts when we're actually waiting for them.
Do you know of any way to achieve that?
Thanks,
Damian
I am working on a project in embedded Linux with beagle bone to transfer 300 bytes of data as one block in one write cycle to a slave (Atmel uC). After having read the Documentation on Spi ie /Documentation/spi I have found that the DMA gets enabled when the data transfer threshold exceeds 160 bytes as mentioned in /drivers/spi/omap2_mcspi.c
I would like to enable flow control based on exchange of const 4 byte values between my beaglebone and Atmel uC. Once I have sent a command say CMD_DATA, the slave responds with RC_RDY. I would like to make a kernel module that services interrupts and calls an interrupt handler every time upon receiving data from slave so that I can check for this ack from slave.
How do I enable interrupts and register interrupt handler for SPI? Any sample codes or tutorials would be helpful. I have looked extensively online and all I found was setting up interrupts for GPIO's
Thanks!