we have a custom driver to a FPGA that implements multiple devices. the network devices use NAPI. in the NAPI poll routine I have to read some registers from the FPGA.
We notice that we are spending a large amount of cpu time in sirq, and other device access is delayed.
My question is that since a read from the FPGA is a non posted read (requiring a wait for the returned data) is this violating the no block rule of the sirq context. Maybe the packet processing should be done in a tasklet ?
I found that if I moved one of the devices to it's own driver, and that device only writes to the FPGA (posted write) the performance of that device improves. I am being asked for an explanation of that result.
Related
I use TCP/IP over Ethernet 10 Gbit/s on Linux x86_64.
But what happen when interrupt occured in one of CPU-Core?
Is it true, that happens:
code of interrupt calculates checksum of IP-packet
code of interrupt copies data from kernel-space buffer to the required socket-buffer
code of interrupt copies data from the buffer on Ethernet-card to the buffer in kernel-space (or it occurs before an interrupt is generated, by using DMA-controller on Ethernet and is this initiated by Ethernet-card?)
Your question is a mix of hardware, protocol stack and user-space.
code of interrupt calculates checksum of IP-packet
this part is protocols - i think somewhere here net/ipv4/ip_input.c
code of interrupt copies data from kernel-space buffer to the required socket-buffer
mix of proto and user space for example here net/ipv4/tcp_input.c
code of interrupt copies data from the buffer on Ethernet-card to the buffer in kernel-space (or it occurs before an interrupt is generated, by using DMA-controller on Ethernet and is this initiated by Ethernet-card?)
this is hardware for example drivers/net/8139cp.c
Next - i think you are misunderstanding the "interrupt" term - there are hardware interrupts and software interrupts.
The only hardware interrupt here are rx/tx interrupts from Ethernet controller.
Not a full answer to your question:
First of all it is possible to divide networking into two parts the actual protocols (net/ipv4 directory)
and part which implements various network hardware (drivers/net).
Not all hardware drivers implement interrupt driven technique some drivers for high-bandwidth adapter use poll technique (NAPI interface - which i shortly describe).
Packets are first received by the card. When interface receives "data arrived interrupt", it disables interrupt and tells kernel to start polling the interface.
Then when packet is available the interrupt handler leaves it in the interface and a method netif_rx_schedule is called. Which causes interface driver poll method to be called in some future.
then it goes to network layer and finally (but not so short as i described) goes to user space, and user is notified about data available for read event, which i can't call an interrupt.
I recommend you reading the following article:
Linux Networking Kernel (http://www.ecsl.cs.sunysb.edu/elibrary/linux/network/LinuxKernel.pdf)
I think this it the way of working of DMA (bus mastering) capable network interfaces with NAPI driver:
When packet(s) arrive, Socket Buffers are already allocated and mapped to DMA memory buffer, and DMA is armed.
Packet is transferred from NIC to Socket Buffer through DMA
NIC raises hardware interrupt when DMA transfer is done.
Hardware interrupt handler schedules packet receiving software interrupt (SOFTIRQ)
SOFTIRQ calls NAPI poll() for further processing.
NAPI poll() process packets in DMA buffer queue and and passes them to upper layers as sk_buff and initializes new DMA buffers. if all packets (quota) are processed, IRQ is enabled and NAPI is told to stop polling.
I've build a linux driver for an SPI device.
The SPI device sends an IRQ to the processor when a new data is ready to be read.
The IRQ fires about every 3 ms, then the driver goes to read 2 bytes with SPI.
The problem I have is that sometimes, there's more than 6 ms between the IRQ has been fired and the moment where SPI transfer starts, which means I lost 2 bytes of the SPI device.
In addition, there's a uncertain delay between the 2 bytes; sometime it's close to 0, sometime it's up to 300us..
Then my question is : how can I reduce the latency between IRQ and SPI readings ?
And how to avoid latency between the 2 bytes ?
I've tried compiling the kernel with premptive option, it does not change things that much.
As for the hardware, I'm using a mini2440 board running at 400 MHz, using a hardware SPI port (not i/o simulated SPI).
Thanks for help.
BR,
Vincent.
From the brochure of the Samsung S3C2440A CPU, the SPI interface hardware supports both interrupt and DMA-based operation. A look at the actual datasheet reveals that the hardware also supports a polling mode.
If you want to achieve high data rates reliably, the DMA-based approach is what you need. Once a DMA operation is configured, the hardware will move the data to RAM on its own, without the need for low-latency interrupt handling.
That said, I do not know the state of the Linux SPI drivers for your CPU. It could be a matter of missing support for DMA, of specific system settings or even of how you are using the driver from your own code. The details w.r.t. SPI are often highly dependent on the particular implementation...
I had a similar problem: I basically got an IRQ and needed to drain a queue via SPI in less than 10 ms or the chip would start to drop data. With high system load (ssh login was actually enough) sometimes the delay between the IRQ handler enqueueing the next SPI transfer with spi_async and the SPI transfer actually happening exceeded 11 ms.
The solution I found was the rt flag in struct spi_device (see here). Enabling that will set the thread that controls the SPI to real-time priority, which made the timing of all SPI transfers super reliable. And by the way that change also removes delay before the complete callback.
Just as a heads up, I think this was not available in earlier kernel versions.
The thing is Linux SPI stack uses queues for transmitting the messages.
This means that there is no guarantee about the delay between the moment you ask to send the SPI message, and the moment where it is effectively sent.
Finally, to fullfill my 3ms requirements between each SPI message, I had to stop using Linux SPI stack, and directly write into the CPU's register inside my own IRQ.
That's highly dirty, but it's the only way to make it work with small delays.
This more a general question. Consider an external device. From time to time this device writes data via its device driver to a specific memory address. I want to write a small C program which read out this data. Is there a better way than just polling this address to check if the value has been changed? I want to keep the CPU load low.
I have done some further research
Is "memory mapped IO" an option? My naive idea is to let the external device writes a flag to a "memory mapped IO"-address which triggers a kernel device driver. The driver then "informs" the program which proceed the value. Can this work? How can a driver informs the program?
The answer may depend on what processor you intend to use, what the device is and possibly whether you are using an operating system or RTOS.
Memory mapped I/O per se is not a solution, that simply refers to I/O device registers that can be directly addressed via normal memory access instructions. Most devices will generate an interrupt when certain registers are updated or contain new valid data.
In general if using an RTOS you can arrange for the device driver to signal via a suitable IPC mechanism any client thread(s) that need to handle the data. If you are not using an RTOS, you could simply register a callback with the device driver which it would call whenever the data is updated. What the client does in the call back is its business - including reading the new data.
If the device in question generates interrupts, then the handling can be done on interrupt, if the device is capable of DMA, then it can handle blocks of data autonomously before teh DMA controller generates an DMA interrupt to a handler.
Given the starting memory address & word count DMA controller transfers data while the CPU works on some other process.
The Input Output processor too handles I/O processes given the starting address & word count..
(correct me if I'm in error)
So what's the difference in functionality between IOP & DMA controller?
In case of memory specific I/O operations (Simple example instructions like lw $r1,$r2,16 in case of MIPS processor), CPU needs to get the data from memory,to facilitate I/O operations. And so CPU has to pause any other operation and monitor the memory READ/WRITE operation till it is not completed. In other words CPU is totally occupied as long as read/write operation is in progress without DMA .If the processor was free during this time,then processor could have executed some other instructions .
Direct Memory Access(DMA):
DMA provides this capability to carry out memory specific operations with minimal CPU intervention. When any I/O device needs a memory access. It sends a DMA request(in form of interrupt) to CPU. CPU initiates the the transfer by providing appropriate grant signals to the data bus. And passes the control to the DMA controller which controls the rest of the data transfer and transfers the data directly to I/O device. During this time, CPU continues with other instructions. Once the Read/Write operation in completed (or any exception is occurred )the DMA controller initiates an interrupt and notifies the processor about the status of read/write operation.
In this way the read/write operation is also carried out and CPU also executes some other instruction during that time. However, initialization of DMA still requires CPU intervention. And so the overall performance is maximized.
I/O processor
You can think I/O processor along the lines of DMA approach.
The I/O processor, generally used in large computer systems, is a coprocessor which is capable of executing the instructions in addition to transfer of data. By the way, the coprocessor instruction system is different from the central processing unit.
CPU can execute the I/O specific program by initializing the basic operations like enabling the data path and setting up the I/O devices participating in operation. And then it transfers the task to I/O processor,which then carry out rest of the tasks and upon completion notifies the processor. The processor meanwhile executes other important instructions.
The I/O processor is essentially a small DMA dedicated processor that can execute limited input and output instructions and can be shared by multiple peripherals.
The I/O processor solves two problems:
The job of input and output is assumed by the CPU.
Although DMA does not require CPU for data exchange between peripherals and memory, it only reduces the burden of CPU. Because in DMA, the initialization of input and output is still done by CPU.
The problem of sharing DMA interface of high speed equipment in large computer system. A large computer system peripherals so much that it had to share the DMA interface Limited (small computer systems such as PC in each device is assigned a DMA high speed interface).
DMA is a hardware module able to transfer data between a peripheral and memory (UART, SPI, DAC, ADCs) or two differents memory addresses without consuming CPU processing time. Generally, configuirng DMA modules involves setting up a memory destination address and a source address, also users are able to configure options such as: buffer data size, automatic address increment and circular buffer. Moreover, these kind of module emits a IRQ signal at the end of the data transfer.
There is a DMA configuration example below for the microcontroller STM32F373. The example shows a DMA configuration between sigma-delta ADC and a memory buffer.
DMA_InitTypeDef DMA_InitStructure;
RCC_AHBPeriphClockCmd(RCC_AHBPeriph_DMA2, ENABLE);
DMA_DeInit(DMA2_Channel3);
/* DISABLE the DMA SDADC1 channel */
DMA_Cmd(DMA2_Channel3, DISABLE);
/* DMA channel SDADC1 Configuration */
DMA_InitStructure.DMA_BufferSize = bufferSize;
DMA_InitStructure.DMA_PeripheralBaseAddr = (uint32_t)&SDADC1->JDATAR;
DMA_InitStructure.DMA_PeripheralInc = DMA_PeripheralInc_Disable;
DMA_InitStructure.DMA_PeripheralDataSize = DMA_PeripheralDataSize_HalfWord;
DMA_InitStructure.DMA_MemoryBaseAddr = (uint32_t)memoryAddress;
DMA_InitStructure.DMA_MemoryInc = DMA_MemoryInc;
DMA_InitStructure.DMA_MemoryDataSize = DMA_MemoryDataSize_HalfWord;
DMA_InitStructure.DMA_DIR = DMA_DIR_PeripheralSRC;
DMA_InitStructure.DMA_Priority = DMA_Priority_High;
DMA_InitStructure.DMA_Mode = DMA_Mode_Circular;
DMA_InitStructure.DMA_M2M = DMA_M2M_Disable;
DMA_Init(DMA2_Channel3, &DMA_InitStructure);
/* Avoid interrupt on DMA ENABLE */
DMA_ClearITPendingBit(DMA2_FLAG_TC3);
// Enable DMA2 Channel Transfer Complete interrupt
DMA_ITConfig(DMA2_Channel3, DMA_IT_TC, ENABLE);
/* Enable the DMA channel */
DMA_Cmd(DMA2_Channel3, ENABLE);
Regarding to I/O processor, I didn't understand it all what did you you mean. But I can say that GPIO hardware modules are able to map general digital input/output to a memory address, i.e: The I/O I/o has a memory address, but read and write operations in fact are done in a peripheral register.
I have a Fibre Optic link, with a proprietary Device Driver.
The link goes into a PCIe card. Running on a RHEL 5.2 (2.6.18-128~)
I have mmap'ed the interface on the card for setup and FIFO access etc, and these read/writes take a few µs to complete, so all good there.
But of course cannot use this for interrupts, so I have to use the kernel module provided, with its user-space lib interface.
WaitForInterrupt(); // API lib interface to kernel module
// Interrupt occurs and am returned to my code in user space
time = CurrentTime() - LatchedTime(); // time to get to here
It takes around 70µs to return from WaitForInterrupt(). (The time the interrupt is raised is latched in the firmware, I read this which as I say above takes ~2µs, and compare it against the current time in the firmware)
What are expected access times between an interrupt occurring and the User Space API interrupt call wait method returning?
Network/other-high-speed interfaces take?
500ms is many orders of magnitudes larger than what a simple switch between userspace/kernel takes, but as someone mentioned in comments, linux is not a real time OS, so there's no guarantee 500ms "hickups" won't show up now and then.
It's quite impossible to tell what the culprit is, the device driver could simpliy be trying to bundle up data to be more efficient.
That said, we've had endless troubles with some custom cards and interactions with both APIC and ACPI, requireing a delicate balance of bios settings, what card goes into which PCI slot and whether a particular video card screws up everything - likely a cause of a dubious driver interacting with more or less buggy bios/video-cards..
If you're able to reliably exceed 500us on a system that's not heavily loaded, I think you're looking at a bad driver implementation (or its userspace wrapper/counterpart).
In my experience the latency to wake a user thread on interrupt should be less than 10us, though (as others have said) Linux provides no latency guarantees.
If you have a recent kernel, you can use the perf sched tool to measure the latency, and see where the time is being used. (500us does sound a tad on the high side, depending on your processor, how many tasks are running, ...)