Trouble setting up reliable DMA transfer between 2 TSI148 VMEbus controllers - linux

I am seeking help, most importantly from VMEbus experts.
I am working on a project that aims to setup a communication channel from a real-time powerpc controller (Emerson MVME4100), running vxWorks 6.8, to a Linux Intel computer (Xembedded XVME6300), running Debian 6 with kernel 2.6.32.
This channel runs over VME bus; both computers are in a VME enclosure and both use the Tundra Tsi148 chipset. The Intel computer is explicitly configured as the system controller, the real-time computer is explicitly not.
Setup:
For the Intel computer I wrote a custom driver that creates a 4MB kernel buffer, and shares it over the VME bus by means of a slave window;
For the real-time computer I setup a DMA transfer to repeatedly forward blocks of exactly 48640 bytes; filled with bytes of test data (zeros, ones, twos, etc), in quick succession (once every 32 milliseconds, if possible)
For the Intel computer I read the kernel buffer from the driver, to see whether the data arrives correctly, with a hand-started Python program.
Expectation:
I am expecting to see the same data (zeros, ones etc) from the Python program.
I am expecting transmission times roughly corresponding to the chosen bus speed (typically 290 us or 145 us, depending on bus speed), plus a reasonable DMA setup overhead (up to 10us? I am willing to accept larger numbers, say hundreds of usecs, if that is what the bus normally needs)
Result:
Sometimes data does not arrive at all, and "transmission" time is ~2000 us
Sometimes data arrives reliably, but transmission time is ~98270us, or 98470us, depending on the chosen bus speed.
Questions:
How could I make the transmission reliable and bring down these aweful latencies?
What general direction should I search next?
(I would like to tag with VMEbus if I could)
Many thanks

My comments on the question describe how I got the bus working:
- ensure 2eSST320 on both sides of the bus
- ensure that the DMA transaction used a valid block size (the largest valid was 4096 bytes)
I achieved an effective speed of 150MBytes/s (the bus can achieve 320MBytes/s but the tsi148 chip is known for causing significant overhead). This is good enough for me.

Related

What is the benefit and micro-ops of ENQCMD instruction?

ENQCMD and MOVDIR64B are two instructions in Intel DSA.
MOVDIR64B reads 64-bytes from the source memory address and performs a 64-byte direct-store operation to the destination address. The ENQCMD instruction allows software to write commands to enqueue registers, which are special device registers accessed using memory-mapped I/O (MMIO).
My question is - what is the aim of designing those two instructions?
Based on my understanding, setting up the memory-mapped IO area (the register) requires OS support, i.e. the device driver. After setting up the MMIO area, we could access it using write() system call, which is also implemented in the device driver. For general architectures, Linux supports iowrite64() to write 8-byte values at a time. Hence, if we want to write 64 bytes, needs to call iowrite64() 8 times.
With the help of MOVDIR64B, for Intel DSA, a new API is created - __iowrite512() which writes 64 bytes atomically.
I agree that the latter one is at least more efficient than the previous one, but I am confused about the time it requires to transfer data.
Consider the following case: if we are given a device (Intel DSA) that supports MOVDIR64B and ENQCMD, suppose we want to transfer 64 bytes of data from memory to MMIO register. There are two options: iowrite64() 8 times (using a loop); or __iowrite512() once. Will the latter one be 8 times faster than the previous one?
My thoughts is that it is less likely to be 8 times difference, but the latter one will be faster. May I know how faster it would be? Is it documented anywhere? I do not have Intel DSA, so I am not sure how to test it.
Besides, what other benefits do ENQCMD have? Will it be broken up into several micro operations? If yes, then what are the micro operations that does ENQCMD?
iowrite64 uses a UC access to MMIO space, so writes are serialized, not pipelined. That is, only one UC write can be in flight at a time from a single CPU thread, and the CPU doesn't continue execution until the MMIO write is complete.
MOVDIR64B has the potential to be faster than even a single iowrite64, because it uses the WC memory type instead of UC (even if the destination address is UC). After the write is issued by the CPU, it can continue execution. Multiple direct stores can be streamed to the device. That means that multiple direct stores can be in flight at one time from a single CPU thread. MOVDIRI also behaves this way.
As far as I know, the time to actually transfer the data to the destination is the same regardless of the size (between 1 and 64 bytes). Of course that is dependent on the width of the data path within the SoC, which could be different for different implementations.
The main advantage of MOVDIR64B is that the descriptor arrives at the device all at once instead of in pieces. The device doesn't have to worry about receiving a partial descriptor or receiving parts of two descriptors interleaved. In fact, Intel DSA ignores writes smaller than 64 bytes to a portal.
To realize the full benefit of streaming writes, the destination address for each MOVDIR64B from a single CPU thread should be different. Each Intel DSA portal is a 4096-byte page, so there are 64 unique addresses within each portal. Descriptor writes from a single CPU can be striped across the 64 addresses. (It doesn't matter whether writes from multiple CPUs use the same address or different addresses, but normally you would not expect multiple CPUs to be using the same dedicated WQ in DSA.)
ENQCMD allows the device to respond to software whether it accepted the descriptor or not. This allows multiple applications to use the same shared WQ without risk of a descriptor being lost because the shared WQ is full. Applications can submit descriptors without any driver involvement (after setup), and without any lock or communication between the applications.

How to measure latency between hardware interrupt and related system call?

I have a Linux machine with two PCIe RS-485 cards (XR17V354 & XR17V352). I have one port on one of the cards hardwired to one port on the other card. These cards are driven by the generic serial driver (serial8250).
I am running a test and measuring latency. I have one Linux process sending two bytes out the port and then listens for two incoming bytes. The other process receives two bytes and immediately sends two bytes back.
I'm measuring this round trip latency to be around 1500 microseconds with a standard deviation of about 40 microseconds. I am trying to understand the source of this latency. Specifically, I'd like to understand the difference in time from which a hard IRQ fires to signal data is ready to read and the time that the bytes are made available to the user space process.
I am aware of the ftrace feature, but I am not sure how best to utilize it, or if there are other, more suitable tools. Thanks.
What kind of driver is this? I assume it's a driver in kernel space and not UIO.
Independent of your issue you could start looking at how long it takes from a hardware interrupt to the kernel driver and from there to user space.
Here[1] is some ancient test case which can be hacked a bit so you can compare interrupt latencies with "standard" Linux, preempt-rt patched Linux and maybe something like Xenomai as well (although the Xenomai solution would require that you rewrite your driver).
You might want to have a look at [2], cyclictest and friends and maybe try to drill with perf into your system to see more details system wide.
Last but not least have a look at LTTng[3] which enables you to instrument code and it already has many instrumentation points.
[1] http://www.denx.de/wiki/DULG/AN2008_03_Xenomai_gpioirqbench
[2] http://cgit.openembedded.org/openembedded-core/tree/meta/recipes-rt/rt-tests/
[3] http://lttng.org/

How to (almost) prevent FT232R (uart) receive data loss?

I need to transfer data from a bare metal microcontroller system to a linux PC with 2 MBaud.
The linux PC is currently running a 32 bit Kubuntu 14.04.
To archive this, I'd tried to use a FT232R based USB-UART adapter, but I sometimes observed lost data.
As long as the linux PC is mainly idle, it seems to work most time; however, I see rare data loss.
But when I force cpu load (e.g. rebuild my project), the data loss increases significantly.
After some research I read here, that the FT232R consist of a receive buffer with a capacity of only 384 Byte. This means, that the FT232R has to be read out (USB-polled) after at least every 1,9 ms. Well, FTDI recommends to use flow control, but because of the used microcontroller system, I'm fixed to cannot use any flow control.
I can live with the fact, that there is no absolutely guarantee for having no data loss. But the observed amount of data loss is quiet too heavy for my needs.
So I tried to find a way to increase the priority of the "FT232 driver" on my linux, but cannot find how to do this. It's not described in the
AN220 FTDI Drivers Installation Guide for Linux
and the document
AN107 FTDI Advanced Driver Options
has a capter about "Changing the Driver Priority" but only for windows.
So, does anybody know how to increase the FT232R driver priority in linux?
Any other ideas to solve this problem?
BTW: As I read the FT232H datasheet, it seems that this comes with 1 KiB RX buffer. I'd order one just now and check out its behaviour. Edit: No significant improvement.
If you want reliable data transfer, there is absolutely no way to use any USB-to-serial bridge correctly without hardware flow control, and without dedicating at least all remaining RAM in your microcontroller as the serial buffer (or at least until you can store ~1s worth of data).
I've been using FTDI devices since FT232AM was a hot new thing, and here's how I implement them:
(At least) four lines go between the bridge and the MCU: RXD, TXD, RTS#, CTS#.
Flow control is enabled on the PC side of things.
Flow control is enabled on the MCU side of things.
MCU code is only sending communications when it can fit a complete reply packet into the buffer. Otherwise, it lets the PC side of it time out and retry the request. For requests that stream data back, the entire frame is dropped if it can't fit in the transmit buffer at the time the frame is ready.
If you wish the PC to be reliably notified of new data, say every number of complete samples/frames, you must use event characters to flush the FTDI buffers to the hist, and encode your data. HDLC works great for that purpose and is documented in free standards (RFCs and ITU X and Q series - all free!).
The VCP driver, or the D2XX port bring-up is set up to have transfer sizes and latencies set for the needs of the application.
The communication protocol is framed, with CRCs. I usually use a cut-down version if X.25/Q.921/HDLC, limited to SNRM(E) mode for simple "dumb" command-and-respond devices, and SABM(E) for devices that stream data.
The size of FTDI buffers is immaterial, your MCU should have at least an order of magnitude more storage available to buffer things.
If you're running hard real-time code, such as signal processing, make sure that you account for the overhead of lots of transmit interrupts running "back-to-back". Once the FTDI device purges its buffers after a USB transfer, and indicates that it's ready to receive more data from your MCU, your code can potentially transmit a full FTDI buffer's worth of data at once.
If you're close to running out of cycles in your realtime code, you can use a timer as a source of transmit interrupts instead of the UART interrupt. You can then set the timer rate much lower than the UART speed. This allows you to pace the transmission slower without lowering the baudrate. If you're running in setup/preoperational mode or with lower real-time task load, you can then trivially raise the transmit rate without changing the baudrate. You can use a similar trick to pace the receives by flipping the RTS# output on the MCU under timer control. Of course this isn't a problem is you use DMA or a sufficiently fast MCU.
If you're out of timers, note that many other peripherals can also be repurposed as a source of timer interrupts.
This advice applies no matter what is the USB host.
Sidebar: Admittedly, Linux USB serial driver "architecture" is in the state of suspended animation as far as I can tell, so getting sensible results there may require a lot of work. It's not a matter of a simple kernel thread priority change, I'm afraid. Part of the reason is that funding for a lot of Linux work focuses on server/enterprise applications, and there the USB performance is a matter of secondary interest at best. It works well enough for USB storage, but USB serial is a mess nobody really cares enough to overhaul, and overhaul it needs. Just look at the amount of copy-pasta in that department...

Critical Timing in an ARM Linux Kernel Driver

I am running linux on an MX28 (ARMv5), and am using a GPIO line to talk to a device. Unfortunately, the device has some special timing requirements. A low on the GPIO line cannot last longer than 7us, highs have no special timing requirements. The code is implemented as a kernel device driver, and toggles the GPIO with direct register writes rather than going through the kernel GPIO api. For testing, I am just generating 3 pulses. The process is as follows, all in one function so it should fit in the instruction cache:
set gpio high
Save Flags & Disable Interrupts
gpio low
pause
gpio high
repeat 2x more
Restore Flags/Reenable Interrups
Here's the output of a logic analyzer tied to the GPIO.
Most of the time it works just great, and the pulses last just under 1us. However, about 10% of the lows last for many, many microseconds. Even though interrupts are disabled, something is causing the flow of the code to be interrupted.
I am at a loss. RT Linux would likely not help here, because the problem is not latency, it appears to be something happening during the low, even though nothing should interrupt it with the IRQs disabled. Any suggestions would be greatly, greatly appreciated.
The ARM cache on an IMX25 (ARM926) is 16K Code, 16K Data L1 with a 32byte length or eight instructions. With the DDR-SDRAM controller running at 133Mhz and a 16bit bus the transfer rate is about 300MB/s. A cache fill should only take about 100nS, not 9uS; this is about 100 times too long.
However, you have four other issues with Linux.
TLB misses and a page table walk.
Data aborts.
DMA masters stealing.
FIQ interrupts.
It is unlikely that the LCD master is stealing enough bandwidth, unless you have a huge display. Is your display larger than 1/4VGA? If not, this is only 10% of the memory bandwidth and this will pipeline with the processor. Do you have either Ethernet or USB active? These peripherals are higher data rate and could cause this type of contention with SDRAM.
All of these issues maybe avoided by writing your toggler PC relative and copying it to the IRAM. See: iram_alloc.c; this file should be portable to older versions of Linux. The XBAR switch allows fetches from SDRAM and IRAM simultaneously. The IRAM can still be a target of other DMA masters. If you are really pressed, move the code to the ETB buffers which no other master in the system can access.
The TLB miss can actually be quite steep as it may need to run several single beat SDRAM cycles; still this should be under 1uS. You have not posted code, so it is possible that a variable and/or other is causing a data fault which is not maskable.
If you have any drivers using the FIQ, they may still be running even though you have masked the normal IRQ interrupts. For instance, the ALSA driver for this system normally uses the FIQ.
Both the ETB and the IRAM are 32-bit data paths and low wait state. Either one will probably give better response than the DDR-SDRAM.
We have achieved sub micro-second response by using a FIQ and IRAM to toggle GPIOs on an IMX258 with another protocol using bit banging.
One possible workaround to the problem Chris mentioned (in addition to problems with paging of kernel module code) is to use a PWM peripheral where the duration of the pulse is pre-programmed and the timing is implemented in hardware.
Fancy processors with caches are not suitable for hard realtime work. Execution time varies if cache misses are non-deterministic (and designs where cache misses are completely deterministic aren't complicated enough to justify a fancy processor).
You can try to avoid memory controller latency during critical sections by aligning the critical section so that it doesn't straddle cache lines. Or prefetch the code you will need. But this is going to be very non-portable and create a nightmare for future maintenance. And still doesn't protect the access to memory-mapped GPIO from bus contention.

low latency Interrupt handling (expected avg time to return from kernel to user space is?)

I have a Fibre Optic link, with a proprietary Device Driver.
The link goes into a PCIe card. Running on a RHEL 5.2 (2.6.18-128~)
I have mmap'ed the interface on the card for setup and FIFO access etc, and these read/writes take a few µs to complete, so all good there.
But of course cannot use this for interrupts, so I have to use the kernel module provided, with its user-space lib interface.
WaitForInterrupt(); // API lib interface to kernel module
// Interrupt occurs and am returned to my code in user space
time = CurrentTime() - LatchedTime(); // time to get to here
It takes around 70µs to return from WaitForInterrupt(). (The time the interrupt is raised is latched in the firmware, I read this which as I say above takes ~2µs, and compare it against the current time in the firmware)
What are expected access times between an interrupt occurring and the User Space API interrupt call wait method returning?
Network/other-high-speed interfaces take?
500ms is many orders of magnitudes larger than what a simple switch between userspace/kernel takes, but as someone mentioned in comments, linux is not a real time OS, so there's no guarantee 500ms "hickups" won't show up now and then.
It's quite impossible to tell what the culprit is, the device driver could simpliy be trying to bundle up data to be more efficient.
That said, we've had endless troubles with some custom cards and interactions with both APIC and ACPI, requireing a delicate balance of bios settings, what card goes into which PCI slot and whether a particular video card screws up everything - likely a cause of a dubious driver interacting with more or less buggy bios/video-cards..
If you're able to reliably exceed 500us on a system that's not heavily loaded, I think you're looking at a bad driver implementation (or its userspace wrapper/counterpart).
In my experience the latency to wake a user thread on interrupt should be less than 10us, though (as others have said) Linux provides no latency guarantees.
If you have a recent kernel, you can use the perf sched tool to measure the latency, and see where the time is being used. (500us does sound a tad on the high side, depending on your processor, how many tasks are running, ...)

Resources