How to measure latency between hardware interrupt and related system call? - linux

I have a Linux machine with two PCIe RS-485 cards (XR17V354 & XR17V352). I have one port on one of the cards hardwired to one port on the other card. These cards are driven by the generic serial driver (serial8250).
I am running a test and measuring latency. I have one Linux process sending two bytes out the port and then listens for two incoming bytes. The other process receives two bytes and immediately sends two bytes back.
I'm measuring this round trip latency to be around 1500 microseconds with a standard deviation of about 40 microseconds. I am trying to understand the source of this latency. Specifically, I'd like to understand the difference in time from which a hard IRQ fires to signal data is ready to read and the time that the bytes are made available to the user space process.
I am aware of the ftrace feature, but I am not sure how best to utilize it, or if there are other, more suitable tools. Thanks.

What kind of driver is this? I assume it's a driver in kernel space and not UIO.
Independent of your issue you could start looking at how long it takes from a hardware interrupt to the kernel driver and from there to user space.
Here[1] is some ancient test case which can be hacked a bit so you can compare interrupt latencies with "standard" Linux, preempt-rt patched Linux and maybe something like Xenomai as well (although the Xenomai solution would require that you rewrite your driver).
You might want to have a look at [2], cyclictest and friends and maybe try to drill with perf into your system to see more details system wide.
Last but not least have a look at LTTng[3] which enables you to instrument code and it already has many instrumentation points.
[1] http://www.denx.de/wiki/DULG/AN2008_03_Xenomai_gpioirqbench
[2] http://cgit.openembedded.org/openembedded-core/tree/meta/recipes-rt/rt-tests/
[3] http://lttng.org/

Related

What options do I have for running recurring events on a microsecond resolution from a kernel driver?

I want to create a simulation of an actual device on an x86 Linux Kernel. Part of this will involve simulating timings as close to possible as I can get. Based on some research it seems I will need at least microsecond resolution timing. I understand that on a non-realtime system it won't be possible to get perfect timing, but I don't perfect, just as close as I can get, perhaps with hacking around with thread scheduling / preemption options.
What I actually want to do is perform an action every interval, i.e. run a some code every Xµs. I've been trying to research the best ways to do this from a Kernel driver as well as some research into whether it's possible to do this reasonably accurately from user mode (keeping the above paragraph in mind). One of the first things that caught my eye was the HPET timer, that is programmable to generate interrupts based on programmable comparators. Unfortunately, it seems on many chipsets it has been rather buggy in the past, and there's not much information on using it for anything that obtaining a timestamp or using it as the main clock source. The linux Kernel provides an HPET driver that in the past, seemed to provide both kernel and user mode interfaces, but seems only to provide a barely documented usermode interface in more recent kernel versions. I've also read about various other kernel functions and interfaces such as the hrtimer interface and the various delay functions, though I'm having a bit of trouble understanding them and if they are suited for my purpose.
Given my current use case, what are the best options I have running recurring events at a µs resolution from say a kernel driver? Obviously accuracy is probably my biggest criteria, but ease of use would be second.
Well, it's possible to achieve your accuracy in userspace -- clock_nanosleep is one ideal option, which has relative and absolute mode. Since clock_nanosleep is based on hrtimer in kernel mode, you may want to use hrtimer if you'd like to implement it in kernel space.
However, to make the timer work accurately, there're two IMPORTENT things worth mentioning.
You should set the timerslack of your process (either by writing nonzero value in ns to /proc/self/timerslack_ns or via prctl(PR_SET_TIMERSLACK,...)). This value is considered as the 'tolerance' of the timer.
The CPU power management also matters here. The CPU has many different Cstates, each of which has a different exit latency. So you need to configure your cpuidle module to not use Cstates other than C0, e.g. for an Intel CPU you could simply write 1 to /sys/devices/system/cpu/cpu$c/cpuidle/state$s/disable to disable state $s of CPU $c. Or just add idle=poll to your kernel options to let CPU keep active (in C0) while kernel idle. NOTE that this significantly influences the power of the computer and leads the cooling fans to make noise.
You can get a timer with delays under 10 microseconds if the two things mentioned above are configured correctly. There is a trade-off between latency and power consumption that you should made.

How to test linux NAPI feature?

I am trying to test the NAPI functionalities on embedded linux environment. I used 'pktgen' to generate the large number of packets and tried to verify the interrupt count of my network interface at /proc/interrupts.
I found out that interrupt count is comparatively less than the packets generated.
Also I am trying to tune the 'netdev_budget' value from 1 to 1000(default is 300) so that I can observe the reduction in interrupt count when netdev_budget is increased.
However increasing the netdev_budget doesn't seems to help. The interrupt is similar to that of interrupt count observed with netdev_budget set to 300.
So here are my queries:
What is the effect of 'netdev_budget' on NAPI?
What other parameters I can/should tune to observe the changes in interrupt count?
Is there any other way I can use to test the NAPI functionality on Linux?(Apart from directly looking at the network driver code)
Any help is much appreaciated.
Thanks in advance.
I wrote a comprehensive blog post about Linux network tuning which explains everything about monitoring, tuning, and optimizing the Linux network stack (including the NAPI weight). Take a look.
Keep in mind: some drivers do not disable IRQs from the NIC when NAPI starts. They are supposed to, but some simply do not. You can verify this by examining the hard IRQ handler in the driver to see if hard IRQs are being disabled.
Note that hard IRQs are re-enabled in some cases as mentioned in the blog post and below.
As far as your questions:
Increasing netdev_budget increases the number of packets that the NET_RX softirq can process. The number of packets that can be processed is also limited by a time limit, which is not tunable. This is to prevent the NET_RX softirq from eating 100% of CPU usage. If the device does not receive enough packets to process during its time allocation, hardirqs are reneabled and NAPI is disabled.
You can also try modifying your IRQ coalescing settings for the NIC, if it is supported. See the blog post above for more information on how to do this and what this means, exactly.
You should add monitoring to your /proc/net/softnet_stat file. The fields in this file can help you figure out how many packets are being processed, whether you are running out of time, etc.
A question for you to consider, if I may:
Why does your hardirq rate matter? It probably doesn't matter, directly. The hardirq handler in your NIC driver should do as little work as possible, so it executing a lot is probably not a problem for your system. If it is, you should carefully measure that as it seems very unlikely. Nevertheless, you can adjust IRQ coalescing settings and IRQ CPU affinity to distribute processing to alter the number of hardirqs generated by the NIC and processed by a particular CPU, respectively.
You should consider whether you probably are more interested in packet processing throughput or packet processing latency. Depending on which is the concern, you can tune your network stack appropriately.
Remember: to completely tune and optimize your Linux networking stack, you have to monitor and tune each component. They are all intertwined and it is difficult (and often insufficient) to monitor and tune just a single aspect of the stack.

How to (almost) prevent FT232R (uart) receive data loss?

I need to transfer data from a bare metal microcontroller system to a linux PC with 2 MBaud.
The linux PC is currently running a 32 bit Kubuntu 14.04.
To archive this, I'd tried to use a FT232R based USB-UART adapter, but I sometimes observed lost data.
As long as the linux PC is mainly idle, it seems to work most time; however, I see rare data loss.
But when I force cpu load (e.g. rebuild my project), the data loss increases significantly.
After some research I read here, that the FT232R consist of a receive buffer with a capacity of only 384 Byte. This means, that the FT232R has to be read out (USB-polled) after at least every 1,9 ms. Well, FTDI recommends to use flow control, but because of the used microcontroller system, I'm fixed to cannot use any flow control.
I can live with the fact, that there is no absolutely guarantee for having no data loss. But the observed amount of data loss is quiet too heavy for my needs.
So I tried to find a way to increase the priority of the "FT232 driver" on my linux, but cannot find how to do this. It's not described in the
AN220 FTDI Drivers Installation Guide for Linux
and the document
AN107 FTDI Advanced Driver Options
has a capter about "Changing the Driver Priority" but only for windows.
So, does anybody know how to increase the FT232R driver priority in linux?
Any other ideas to solve this problem?
BTW: As I read the FT232H datasheet, it seems that this comes with 1 KiB RX buffer. I'd order one just now and check out its behaviour. Edit: No significant improvement.
If you want reliable data transfer, there is absolutely no way to use any USB-to-serial bridge correctly without hardware flow control, and without dedicating at least all remaining RAM in your microcontroller as the serial buffer (or at least until you can store ~1s worth of data).
I've been using FTDI devices since FT232AM was a hot new thing, and here's how I implement them:
(At least) four lines go between the bridge and the MCU: RXD, TXD, RTS#, CTS#.
Flow control is enabled on the PC side of things.
Flow control is enabled on the MCU side of things.
MCU code is only sending communications when it can fit a complete reply packet into the buffer. Otherwise, it lets the PC side of it time out and retry the request. For requests that stream data back, the entire frame is dropped if it can't fit in the transmit buffer at the time the frame is ready.
If you wish the PC to be reliably notified of new data, say every number of complete samples/frames, you must use event characters to flush the FTDI buffers to the hist, and encode your data. HDLC works great for that purpose and is documented in free standards (RFCs and ITU X and Q series - all free!).
The VCP driver, or the D2XX port bring-up is set up to have transfer sizes and latencies set for the needs of the application.
The communication protocol is framed, with CRCs. I usually use a cut-down version if X.25/Q.921/HDLC, limited to SNRM(E) mode for simple "dumb" command-and-respond devices, and SABM(E) for devices that stream data.
The size of FTDI buffers is immaterial, your MCU should have at least an order of magnitude more storage available to buffer things.
If you're running hard real-time code, such as signal processing, make sure that you account for the overhead of lots of transmit interrupts running "back-to-back". Once the FTDI device purges its buffers after a USB transfer, and indicates that it's ready to receive more data from your MCU, your code can potentially transmit a full FTDI buffer's worth of data at once.
If you're close to running out of cycles in your realtime code, you can use a timer as a source of transmit interrupts instead of the UART interrupt. You can then set the timer rate much lower than the UART speed. This allows you to pace the transmission slower without lowering the baudrate. If you're running in setup/preoperational mode or with lower real-time task load, you can then trivially raise the transmit rate without changing the baudrate. You can use a similar trick to pace the receives by flipping the RTS# output on the MCU under timer control. Of course this isn't a problem is you use DMA or a sufficiently fast MCU.
If you're out of timers, note that many other peripherals can also be repurposed as a source of timer interrupts.
This advice applies no matter what is the USB host.
Sidebar: Admittedly, Linux USB serial driver "architecture" is in the state of suspended animation as far as I can tell, so getting sensible results there may require a lot of work. It's not a matter of a simple kernel thread priority change, I'm afraid. Part of the reason is that funding for a lot of Linux work focuses on server/enterprise applications, and there the USB performance is a matter of secondary interest at best. It works well enough for USB storage, but USB serial is a mess nobody really cares enough to overhaul, and overhaul it needs. Just look at the amount of copy-pasta in that department...

Trouble setting up reliable DMA transfer between 2 TSI148 VMEbus controllers

I am seeking help, most importantly from VMEbus experts.
I am working on a project that aims to setup a communication channel from a real-time powerpc controller (Emerson MVME4100), running vxWorks 6.8, to a Linux Intel computer (Xembedded XVME6300), running Debian 6 with kernel 2.6.32.
This channel runs over VME bus; both computers are in a VME enclosure and both use the Tundra Tsi148 chipset. The Intel computer is explicitly configured as the system controller, the real-time computer is explicitly not.
Setup:
For the Intel computer I wrote a custom driver that creates a 4MB kernel buffer, and shares it over the VME bus by means of a slave window;
For the real-time computer I setup a DMA transfer to repeatedly forward blocks of exactly 48640 bytes; filled with bytes of test data (zeros, ones, twos, etc), in quick succession (once every 32 milliseconds, if possible)
For the Intel computer I read the kernel buffer from the driver, to see whether the data arrives correctly, with a hand-started Python program.
Expectation:
I am expecting to see the same data (zeros, ones etc) from the Python program.
I am expecting transmission times roughly corresponding to the chosen bus speed (typically 290 us or 145 us, depending on bus speed), plus a reasonable DMA setup overhead (up to 10us? I am willing to accept larger numbers, say hundreds of usecs, if that is what the bus normally needs)
Result:
Sometimes data does not arrive at all, and "transmission" time is ~2000 us
Sometimes data arrives reliably, but transmission time is ~98270us, or 98470us, depending on the chosen bus speed.
Questions:
How could I make the transmission reliable and bring down these aweful latencies?
What general direction should I search next?
(I would like to tag with VMEbus if I could)
Many thanks
My comments on the question describe how I got the bus working:
- ensure 2eSST320 on both sides of the bus
- ensure that the DMA transaction used a valid block size (the largest valid was 4096 bytes)
I achieved an effective speed of 150MBytes/s (the bus can achieve 320MBytes/s but the tsi148 chip is known for causing significant overhead). This is good enough for me.

low latency Interrupt handling (expected avg time to return from kernel to user space is?)

I have a Fibre Optic link, with a proprietary Device Driver.
The link goes into a PCIe card. Running on a RHEL 5.2 (2.6.18-128~)
I have mmap'ed the interface on the card for setup and FIFO access etc, and these read/writes take a few µs to complete, so all good there.
But of course cannot use this for interrupts, so I have to use the kernel module provided, with its user-space lib interface.
WaitForInterrupt(); // API lib interface to kernel module
// Interrupt occurs and am returned to my code in user space
time = CurrentTime() - LatchedTime(); // time to get to here
It takes around 70µs to return from WaitForInterrupt(). (The time the interrupt is raised is latched in the firmware, I read this which as I say above takes ~2µs, and compare it against the current time in the firmware)
What are expected access times between an interrupt occurring and the User Space API interrupt call wait method returning?
Network/other-high-speed interfaces take?
500ms is many orders of magnitudes larger than what a simple switch between userspace/kernel takes, but as someone mentioned in comments, linux is not a real time OS, so there's no guarantee 500ms "hickups" won't show up now and then.
It's quite impossible to tell what the culprit is, the device driver could simpliy be trying to bundle up data to be more efficient.
That said, we've had endless troubles with some custom cards and interactions with both APIC and ACPI, requireing a delicate balance of bios settings, what card goes into which PCI slot and whether a particular video card screws up everything - likely a cause of a dubious driver interacting with more or less buggy bios/video-cards..
If you're able to reliably exceed 500us on a system that's not heavily loaded, I think you're looking at a bad driver implementation (or its userspace wrapper/counterpart).
In my experience the latency to wake a user thread on interrupt should be less than 10us, though (as others have said) Linux provides no latency guarantees.
If you have a recent kernel, you can use the perf sched tool to measure the latency, and see where the time is being used. (500us does sound a tad on the high side, depending on your processor, how many tasks are running, ...)

Resources