I need to transfer data from a bare metal microcontroller system to a linux PC with 2 MBaud.
The linux PC is currently running a 32 bit Kubuntu 14.04.
To archive this, I'd tried to use a FT232R based USB-UART adapter, but I sometimes observed lost data.
As long as the linux PC is mainly idle, it seems to work most time; however, I see rare data loss.
But when I force cpu load (e.g. rebuild my project), the data loss increases significantly.
After some research I read here, that the FT232R consist of a receive buffer with a capacity of only 384 Byte. This means, that the FT232R has to be read out (USB-polled) after at least every 1,9 ms. Well, FTDI recommends to use flow control, but because of the used microcontroller system, I'm fixed to cannot use any flow control.
I can live with the fact, that there is no absolutely guarantee for having no data loss. But the observed amount of data loss is quiet too heavy for my needs.
So I tried to find a way to increase the priority of the "FT232 driver" on my linux, but cannot find how to do this. It's not described in the
AN220 FTDI Drivers Installation Guide for Linux
and the document
AN107 FTDI Advanced Driver Options
has a capter about "Changing the Driver Priority" but only for windows.
So, does anybody know how to increase the FT232R driver priority in linux?
Any other ideas to solve this problem?
BTW: As I read the FT232H datasheet, it seems that this comes with 1 KiB RX buffer. I'd order one just now and check out its behaviour. Edit: No significant improvement.
If you want reliable data transfer, there is absolutely no way to use any USB-to-serial bridge correctly without hardware flow control, and without dedicating at least all remaining RAM in your microcontroller as the serial buffer (or at least until you can store ~1s worth of data).
I've been using FTDI devices since FT232AM was a hot new thing, and here's how I implement them:
(At least) four lines go between the bridge and the MCU: RXD, TXD, RTS#, CTS#.
Flow control is enabled on the PC side of things.
Flow control is enabled on the MCU side of things.
MCU code is only sending communications when it can fit a complete reply packet into the buffer. Otherwise, it lets the PC side of it time out and retry the request. For requests that stream data back, the entire frame is dropped if it can't fit in the transmit buffer at the time the frame is ready.
If you wish the PC to be reliably notified of new data, say every number of complete samples/frames, you must use event characters to flush the FTDI buffers to the hist, and encode your data. HDLC works great for that purpose and is documented in free standards (RFCs and ITU X and Q series - all free!).
The VCP driver, or the D2XX port bring-up is set up to have transfer sizes and latencies set for the needs of the application.
The communication protocol is framed, with CRCs. I usually use a cut-down version if X.25/Q.921/HDLC, limited to SNRM(E) mode for simple "dumb" command-and-respond devices, and SABM(E) for devices that stream data.
The size of FTDI buffers is immaterial, your MCU should have at least an order of magnitude more storage available to buffer things.
If you're running hard real-time code, such as signal processing, make sure that you account for the overhead of lots of transmit interrupts running "back-to-back". Once the FTDI device purges its buffers after a USB transfer, and indicates that it's ready to receive more data from your MCU, your code can potentially transmit a full FTDI buffer's worth of data at once.
If you're close to running out of cycles in your realtime code, you can use a timer as a source of transmit interrupts instead of the UART interrupt. You can then set the timer rate much lower than the UART speed. This allows you to pace the transmission slower without lowering the baudrate. If you're running in setup/preoperational mode or with lower real-time task load, you can then trivially raise the transmit rate without changing the baudrate. You can use a similar trick to pace the receives by flipping the RTS# output on the MCU under timer control. Of course this isn't a problem is you use DMA or a sufficiently fast MCU.
If you're out of timers, note that many other peripherals can also be repurposed as a source of timer interrupts.
This advice applies no matter what is the USB host.
Sidebar: Admittedly, Linux USB serial driver "architecture" is in the state of suspended animation as far as I can tell, so getting sensible results there may require a lot of work. It's not a matter of a simple kernel thread priority change, I'm afraid. Part of the reason is that funding for a lot of Linux work focuses on server/enterprise applications, and there the USB performance is a matter of secondary interest at best. It works well enough for USB storage, but USB serial is a mess nobody really cares enough to overhaul, and overhaul it needs. Just look at the amount of copy-pasta in that department...
Related
I have a Linux machine with two PCIe RS-485 cards (XR17V354 & XR17V352). I have one port on one of the cards hardwired to one port on the other card. These cards are driven by the generic serial driver (serial8250).
I am running a test and measuring latency. I have one Linux process sending two bytes out the port and then listens for two incoming bytes. The other process receives two bytes and immediately sends two bytes back.
I'm measuring this round trip latency to be around 1500 microseconds with a standard deviation of about 40 microseconds. I am trying to understand the source of this latency. Specifically, I'd like to understand the difference in time from which a hard IRQ fires to signal data is ready to read and the time that the bytes are made available to the user space process.
I am aware of the ftrace feature, but I am not sure how best to utilize it, or if there are other, more suitable tools. Thanks.
What kind of driver is this? I assume it's a driver in kernel space and not UIO.
Independent of your issue you could start looking at how long it takes from a hardware interrupt to the kernel driver and from there to user space.
Here[1] is some ancient test case which can be hacked a bit so you can compare interrupt latencies with "standard" Linux, preempt-rt patched Linux and maybe something like Xenomai as well (although the Xenomai solution would require that you rewrite your driver).
You might want to have a look at [2], cyclictest and friends and maybe try to drill with perf into your system to see more details system wide.
Last but not least have a look at LTTng[3] which enables you to instrument code and it already has many instrumentation points.
[1] http://www.denx.de/wiki/DULG/AN2008_03_Xenomai_gpioirqbench
[2] http://cgit.openembedded.org/openembedded-core/tree/meta/recipes-rt/rt-tests/
[3] http://lttng.org/
Suppose you have a PCIE device presenting a single BAR and one DMA area declared with pci_alloc_consistent(..). The BAR's flags indicate non-prefetchable, non-cacheable, memory region.
What are the principle causes for latency in reading the DMA area, and similarly, what are the causes of latency reading the BAR?
Thank you for answering this simple question :D!
This smells a bit like homework but I suspect the concepts are not well understood by many so I'll add an answer.
The best way to think through this is to consider what needs to happen in order for a read to complete. The CPU and the device are on separate sides of the PCIe link. It's helpful to view PCI-Express as a mini network. Each link is point-to-point (like your PC connected to another PC). There may also be intermediate switches (aka bridges in PCI). In that case, it's like your PC is connected to a switch that is in turn connected to the other PC.
So, if the CPU wants to read its own memory (the "DMA" region you allocated), it's relatively fast. It has a high speed bus that is designed to make that happen fast. Also, there are multiple layers of caching built in to keep frequently (or recently) used data "close" to the CPU.
But if the CPU wants to read from the BAR in the device, the CPU (actually the PCIe root complex integrated with the CPU) must compose a PCIe read request, send the request, and wait while the device decodes the request, accesses the BAR location and sends back the requested data. Tick tock. Your CPU is doing nothing else while it waits for this to complete.
This is pretty much analogous to asking for a web page from another computer. You formulate an HTTP request, send it and wait while the web server accesses the content, formulates a return packet and sends it to you.
If the device wishes to access memory residing "in" the CPU, it's pretty much the exact same thing in reverse. ("Direct memory access" just means that it doesn't need to interrupt the CPU to handle it, but something [the root complex here] is still responsible for decoding the request, fulfilling the read and sending back the resulting data.)
Also, if there are intermediate PCIe switches between CPU and device, those may add additional buffering/queuing delays (exactly as a switch or router might in a network). And any such delays are doubled since they're incurred in both directions.
Of course PCIe is very fast, so all of that happens in mere nanoseconds, but that's still orders of magnitude slower than a "local" read.
I've build a linux driver for an SPI device.
The SPI device sends an IRQ to the processor when a new data is ready to be read.
The IRQ fires about every 3 ms, then the driver goes to read 2 bytes with SPI.
The problem I have is that sometimes, there's more than 6 ms between the IRQ has been fired and the moment where SPI transfer starts, which means I lost 2 bytes of the SPI device.
In addition, there's a uncertain delay between the 2 bytes; sometime it's close to 0, sometime it's up to 300us..
Then my question is : how can I reduce the latency between IRQ and SPI readings ?
And how to avoid latency between the 2 bytes ?
I've tried compiling the kernel with premptive option, it does not change things that much.
As for the hardware, I'm using a mini2440 board running at 400 MHz, using a hardware SPI port (not i/o simulated SPI).
Thanks for help.
BR,
Vincent.
From the brochure of the Samsung S3C2440A CPU, the SPI interface hardware supports both interrupt and DMA-based operation. A look at the actual datasheet reveals that the hardware also supports a polling mode.
If you want to achieve high data rates reliably, the DMA-based approach is what you need. Once a DMA operation is configured, the hardware will move the data to RAM on its own, without the need for low-latency interrupt handling.
That said, I do not know the state of the Linux SPI drivers for your CPU. It could be a matter of missing support for DMA, of specific system settings or even of how you are using the driver from your own code. The details w.r.t. SPI are often highly dependent on the particular implementation...
I had a similar problem: I basically got an IRQ and needed to drain a queue via SPI in less than 10 ms or the chip would start to drop data. With high system load (ssh login was actually enough) sometimes the delay between the IRQ handler enqueueing the next SPI transfer with spi_async and the SPI transfer actually happening exceeded 11 ms.
The solution I found was the rt flag in struct spi_device (see here). Enabling that will set the thread that controls the SPI to real-time priority, which made the timing of all SPI transfers super reliable. And by the way that change also removes delay before the complete callback.
Just as a heads up, I think this was not available in earlier kernel versions.
The thing is Linux SPI stack uses queues for transmitting the messages.
This means that there is no guarantee about the delay between the moment you ask to send the SPI message, and the moment where it is effectively sent.
Finally, to fullfill my 3ms requirements between each SPI message, I had to stop using Linux SPI stack, and directly write into the CPU's register inside my own IRQ.
That's highly dirty, but it's the only way to make it work with small delays.
I am running linux on an MX28 (ARMv5), and am using a GPIO line to talk to a device. Unfortunately, the device has some special timing requirements. A low on the GPIO line cannot last longer than 7us, highs have no special timing requirements. The code is implemented as a kernel device driver, and toggles the GPIO with direct register writes rather than going through the kernel GPIO api. For testing, I am just generating 3 pulses. The process is as follows, all in one function so it should fit in the instruction cache:
set gpio high
Save Flags & Disable Interrupts
gpio low
pause
gpio high
repeat 2x more
Restore Flags/Reenable Interrups
Here's the output of a logic analyzer tied to the GPIO.
Most of the time it works just great, and the pulses last just under 1us. However, about 10% of the lows last for many, many microseconds. Even though interrupts are disabled, something is causing the flow of the code to be interrupted.
I am at a loss. RT Linux would likely not help here, because the problem is not latency, it appears to be something happening during the low, even though nothing should interrupt it with the IRQs disabled. Any suggestions would be greatly, greatly appreciated.
The ARM cache on an IMX25 (ARM926) is 16K Code, 16K Data L1 with a 32byte length or eight instructions. With the DDR-SDRAM controller running at 133Mhz and a 16bit bus the transfer rate is about 300MB/s. A cache fill should only take about 100nS, not 9uS; this is about 100 times too long.
However, you have four other issues with Linux.
TLB misses and a page table walk.
Data aborts.
DMA masters stealing.
FIQ interrupts.
It is unlikely that the LCD master is stealing enough bandwidth, unless you have a huge display. Is your display larger than 1/4VGA? If not, this is only 10% of the memory bandwidth and this will pipeline with the processor. Do you have either Ethernet or USB active? These peripherals are higher data rate and could cause this type of contention with SDRAM.
All of these issues maybe avoided by writing your toggler PC relative and copying it to the IRAM. See: iram_alloc.c; this file should be portable to older versions of Linux. The XBAR switch allows fetches from SDRAM and IRAM simultaneously. The IRAM can still be a target of other DMA masters. If you are really pressed, move the code to the ETB buffers which no other master in the system can access.
The TLB miss can actually be quite steep as it may need to run several single beat SDRAM cycles; still this should be under 1uS. You have not posted code, so it is possible that a variable and/or other is causing a data fault which is not maskable.
If you have any drivers using the FIQ, they may still be running even though you have masked the normal IRQ interrupts. For instance, the ALSA driver for this system normally uses the FIQ.
Both the ETB and the IRAM are 32-bit data paths and low wait state. Either one will probably give better response than the DDR-SDRAM.
We have achieved sub micro-second response by using a FIQ and IRAM to toggle GPIOs on an IMX258 with another protocol using bit banging.
One possible workaround to the problem Chris mentioned (in addition to problems with paging of kernel module code) is to use a PWM peripheral where the duration of the pulse is pre-programmed and the timing is implemented in hardware.
Fancy processors with caches are not suitable for hard realtime work. Execution time varies if cache misses are non-deterministic (and designs where cache misses are completely deterministic aren't complicated enough to justify a fancy processor).
You can try to avoid memory controller latency during critical sections by aligning the critical section so that it doesn't straddle cache lines. Or prefetch the code you will need. But this is going to be very non-portable and create a nightmare for future maintenance. And still doesn't protect the access to memory-mapped GPIO from bus contention.
I am seeking help, most importantly from VMEbus experts.
I am working on a project that aims to setup a communication channel from a real-time powerpc controller (Emerson MVME4100), running vxWorks 6.8, to a Linux Intel computer (Xembedded XVME6300), running Debian 6 with kernel 2.6.32.
This channel runs over VME bus; both computers are in a VME enclosure and both use the Tundra Tsi148 chipset. The Intel computer is explicitly configured as the system controller, the real-time computer is explicitly not.
Setup:
For the Intel computer I wrote a custom driver that creates a 4MB kernel buffer, and shares it over the VME bus by means of a slave window;
For the real-time computer I setup a DMA transfer to repeatedly forward blocks of exactly 48640 bytes; filled with bytes of test data (zeros, ones, twos, etc), in quick succession (once every 32 milliseconds, if possible)
For the Intel computer I read the kernel buffer from the driver, to see whether the data arrives correctly, with a hand-started Python program.
Expectation:
I am expecting to see the same data (zeros, ones etc) from the Python program.
I am expecting transmission times roughly corresponding to the chosen bus speed (typically 290 us or 145 us, depending on bus speed), plus a reasonable DMA setup overhead (up to 10us? I am willing to accept larger numbers, say hundreds of usecs, if that is what the bus normally needs)
Result:
Sometimes data does not arrive at all, and "transmission" time is ~2000 us
Sometimes data arrives reliably, but transmission time is ~98270us, or 98470us, depending on the chosen bus speed.
Questions:
How could I make the transmission reliable and bring down these aweful latencies?
What general direction should I search next?
(I would like to tag with VMEbus if I could)
Many thanks
My comments on the question describe how I got the bus working:
- ensure 2eSST320 on both sides of the bus
- ensure that the DMA transaction used a valid block size (the largest valid was 4096 bytes)
I achieved an effective speed of 150MBytes/s (the bus can achieve 320MBytes/s but the tsi148 chip is known for causing significant overhead). This is good enough for me.