Audio transmission via Webrtc pricking when overused CPU - audio

I'm transmitting from an outdated computer (AMD A6-5400K and 8GB of RAM) and the transmission is stable when the computer is not interacting, but when, for example, I open a Chrome tab, the graphics look like this:
Processing:
Bytes received:
Jitter:
As you can see, the processing was to the maximum, and at that moment I hear a kind of perforation in the transmission.
Of course the problem is also in the computer, but my big doubt here is that using other transmission software (that use the UDP protocol) I do not hear this prick, that is, even with the processing at the peak it remains stable and I would like to reproduce the same thing with Webrtc.

Related

How to measure latency between hardware interrupt and related system call?

I have a Linux machine with two PCIe RS-485 cards (XR17V354 & XR17V352). I have one port on one of the cards hardwired to one port on the other card. These cards are driven by the generic serial driver (serial8250).
I am running a test and measuring latency. I have one Linux process sending two bytes out the port and then listens for two incoming bytes. The other process receives two bytes and immediately sends two bytes back.
I'm measuring this round trip latency to be around 1500 microseconds with a standard deviation of about 40 microseconds. I am trying to understand the source of this latency. Specifically, I'd like to understand the difference in time from which a hard IRQ fires to signal data is ready to read and the time that the bytes are made available to the user space process.
I am aware of the ftrace feature, but I am not sure how best to utilize it, or if there are other, more suitable tools. Thanks.
What kind of driver is this? I assume it's a driver in kernel space and not UIO.
Independent of your issue you could start looking at how long it takes from a hardware interrupt to the kernel driver and from there to user space.
Here[1] is some ancient test case which can be hacked a bit so you can compare interrupt latencies with "standard" Linux, preempt-rt patched Linux and maybe something like Xenomai as well (although the Xenomai solution would require that you rewrite your driver).
You might want to have a look at [2], cyclictest and friends and maybe try to drill with perf into your system to see more details system wide.
Last but not least have a look at LTTng[3] which enables you to instrument code and it already has many instrumentation points.
[1] http://www.denx.de/wiki/DULG/AN2008_03_Xenomai_gpioirqbench
[2] http://cgit.openembedded.org/openembedded-core/tree/meta/recipes-rt/rt-tests/
[3] http://lttng.org/

Latency in ioread

Suppose you have a PCIE device presenting a single BAR and one DMA area declared with pci_alloc_consistent(..). The BAR's flags indicate non-prefetchable, non-cacheable, memory region.
What are the principle causes for latency in reading the DMA area, and similarly, what are the causes of latency reading the BAR?
Thank you for answering this simple question :D!
This smells a bit like homework but I suspect the concepts are not well understood by many so I'll add an answer.
The best way to think through this is to consider what needs to happen in order for a read to complete. The CPU and the device are on separate sides of the PCIe link. It's helpful to view PCI-Express as a mini network. Each link is point-to-point (like your PC connected to another PC). There may also be intermediate switches (aka bridges in PCI). In that case, it's like your PC is connected to a switch that is in turn connected to the other PC.
So, if the CPU wants to read its own memory (the "DMA" region you allocated), it's relatively fast. It has a high speed bus that is designed to make that happen fast. Also, there are multiple layers of caching built in to keep frequently (or recently) used data "close" to the CPU.
But if the CPU wants to read from the BAR in the device, the CPU (actually the PCIe root complex integrated with the CPU) must compose a PCIe read request, send the request, and wait while the device decodes the request, accesses the BAR location and sends back the requested data. Tick tock. Your CPU is doing nothing else while it waits for this to complete.
This is pretty much analogous to asking for a web page from another computer. You formulate an HTTP request, send it and wait while the web server accesses the content, formulates a return packet and sends it to you.
If the device wishes to access memory residing "in" the CPU, it's pretty much the exact same thing in reverse. ("Direct memory access" just means that it doesn't need to interrupt the CPU to handle it, but something [the root complex here] is still responsible for decoding the request, fulfilling the read and sending back the resulting data.)
Also, if there are intermediate PCIe switches between CPU and device, those may add additional buffering/queuing delays (exactly as a switch or router might in a network). And any such delays are doubled since they're incurred in both directions.
Of course PCIe is very fast, so all of that happens in mere nanoseconds, but that's still orders of magnitude slower than a "local" read.

How to (almost) prevent FT232R (uart) receive data loss?

I need to transfer data from a bare metal microcontroller system to a linux PC with 2 MBaud.
The linux PC is currently running a 32 bit Kubuntu 14.04.
To archive this, I'd tried to use a FT232R based USB-UART adapter, but I sometimes observed lost data.
As long as the linux PC is mainly idle, it seems to work most time; however, I see rare data loss.
But when I force cpu load (e.g. rebuild my project), the data loss increases significantly.
After some research I read here, that the FT232R consist of a receive buffer with a capacity of only 384 Byte. This means, that the FT232R has to be read out (USB-polled) after at least every 1,9 ms. Well, FTDI recommends to use flow control, but because of the used microcontroller system, I'm fixed to cannot use any flow control.
I can live with the fact, that there is no absolutely guarantee for having no data loss. But the observed amount of data loss is quiet too heavy for my needs.
So I tried to find a way to increase the priority of the "FT232 driver" on my linux, but cannot find how to do this. It's not described in the
AN220 FTDI Drivers Installation Guide for Linux
and the document
AN107 FTDI Advanced Driver Options
has a capter about "Changing the Driver Priority" but only for windows.
So, does anybody know how to increase the FT232R driver priority in linux?
Any other ideas to solve this problem?
BTW: As I read the FT232H datasheet, it seems that this comes with 1 KiB RX buffer. I'd order one just now and check out its behaviour. Edit: No significant improvement.
If you want reliable data transfer, there is absolutely no way to use any USB-to-serial bridge correctly without hardware flow control, and without dedicating at least all remaining RAM in your microcontroller as the serial buffer (or at least until you can store ~1s worth of data).
I've been using FTDI devices since FT232AM was a hot new thing, and here's how I implement them:
(At least) four lines go between the bridge and the MCU: RXD, TXD, RTS#, CTS#.
Flow control is enabled on the PC side of things.
Flow control is enabled on the MCU side of things.
MCU code is only sending communications when it can fit a complete reply packet into the buffer. Otherwise, it lets the PC side of it time out and retry the request. For requests that stream data back, the entire frame is dropped if it can't fit in the transmit buffer at the time the frame is ready.
If you wish the PC to be reliably notified of new data, say every number of complete samples/frames, you must use event characters to flush the FTDI buffers to the hist, and encode your data. HDLC works great for that purpose and is documented in free standards (RFCs and ITU X and Q series - all free!).
The VCP driver, or the D2XX port bring-up is set up to have transfer sizes and latencies set for the needs of the application.
The communication protocol is framed, with CRCs. I usually use a cut-down version if X.25/Q.921/HDLC, limited to SNRM(E) mode for simple "dumb" command-and-respond devices, and SABM(E) for devices that stream data.
The size of FTDI buffers is immaterial, your MCU should have at least an order of magnitude more storage available to buffer things.
If you're running hard real-time code, such as signal processing, make sure that you account for the overhead of lots of transmit interrupts running "back-to-back". Once the FTDI device purges its buffers after a USB transfer, and indicates that it's ready to receive more data from your MCU, your code can potentially transmit a full FTDI buffer's worth of data at once.
If you're close to running out of cycles in your realtime code, you can use a timer as a source of transmit interrupts instead of the UART interrupt. You can then set the timer rate much lower than the UART speed. This allows you to pace the transmission slower without lowering the baudrate. If you're running in setup/preoperational mode or with lower real-time task load, you can then trivially raise the transmit rate without changing the baudrate. You can use a similar trick to pace the receives by flipping the RTS# output on the MCU under timer control. Of course this isn't a problem is you use DMA or a sufficiently fast MCU.
If you're out of timers, note that many other peripherals can also be repurposed as a source of timer interrupts.
This advice applies no matter what is the USB host.
Sidebar: Admittedly, Linux USB serial driver "architecture" is in the state of suspended animation as far as I can tell, so getting sensible results there may require a lot of work. It's not a matter of a simple kernel thread priority change, I'm afraid. Part of the reason is that funding for a lot of Linux work focuses on server/enterprise applications, and there the USB performance is a matter of secondary interest at best. It works well enough for USB storage, but USB serial is a mess nobody really cares enough to overhaul, and overhaul it needs. Just look at the amount of copy-pasta in that department...

What is the least amount of (managable) samples I can give to a PCM buffer?

Some APIs, like this one, can create a PCM buffer from an array of samples (represented by a number).
Say I want to generate and play some audio in (near) real time. I could generate a PCM buffer with 100 samples and send them off the sound card, using my magic API functions. As those 100 samples are playing, 100 more samples are generated and then switch the buffers are switched. Finally, I can repeat the writing / playing / switching process to create a constant stream of audio.
Now, for my question. What is the smallest sample-size I can use with the write / play / switch approach, without a perceivable pause in the audio stream occurring? I understand the answer here will depend on sample rate, processor speed, and transfer time to the sound card - so please provide a "rule of thumb" like answer if it's more appropriate!
(I'm a bit new to audio stuff, so please feel free to point out any misconceptions I might have!)
TL;DR: 1ms buffers are easily achievable on desktop operating systems if care is taken; it might not be desirable from a performance and energy usage perspective.
The lower limit to buffer-size (and this output latency) is limited by the worst-case scheduling latency of your operating system.
The sequence of events is:
The audio hardware progressively outputs samples from its buffer
At some point, it reaches a low-water-mark and generates an interrupt, signalling that the buffer needs replenishing with more samples
The operating system service the interrupt, and marks the thread as being ready to run
The operating system schedules the thread to run on a CPU
The thread computes, or otherwise obtains samples, and writes them into the output buffer.
The scheduling latency is the time between step 2 and 4 above, and are dictated largely by the design of the host operating. If using a hard RTOS such as VxWorks or eCos with pre-emptive priority scheduling, the worst case can be in the order of fractions of a uS.
General purpose desktop operating systems are generally less slick. MacOSX supports real-time user-space scheduling, and is easily capable of servicing 1ms buffers. The Linux kernel can be configured for pre-emptive real-time threads and bottom-half interrupt handlers handled by kernel threads. You ought to also be able to get achieve 1ms buffers sizes there too. I can't comment on the capabilities of recent versions of the NT kernel.
It's also possible to take a (usually bad) latency hit in step 5 - when your process fills the buffer, if it takes page-fault. Usual practice is to obtain all of the heap and stack memory you require and mlock() it and program code and data into physical memory.
Absolutely forget about achieving low latency in an interpreted or JITed language run-time. You have far too little control of what the language run-time is doing, and have no realistic prospect preventing page-faults (e.g. for memory allocation). I suspect 10ms is pushing your luck in these cases.
It's worth noting that rendering short buffers has a significant impact on system performance (and energy consumption) due to the high rate of interrupts and context switches. These destroy L1 cache locality in a way that's disproportionate with the work they actually do.
While 1ms audio buffers are possible they are not necessarily desirable. The tick rate of modern Windows, for example, is between 10ms and 30ms. That means that, usually at the audio driver end, you need to keep a ring buffer of a bunch of those 1 ms packets to deal with buffer starvation conditions, in case the CPU gets pulled out from under you by some other thread.
All modern audio engines use powers of 2 for their audio buffer sizes. Start with 256 samples per frame and see how that works for you. Every piece of hardware is different, and you can't depend on how a Mac or PC gives you time slices. The user might be calculating pi on some other process while you are running your audio program. It's safer to leave the buffer size large, like 2048 samples per frame, and let the user turn it down if the latency bothers them.

Trouble setting up reliable DMA transfer between 2 TSI148 VMEbus controllers

I am seeking help, most importantly from VMEbus experts.
I am working on a project that aims to setup a communication channel from a real-time powerpc controller (Emerson MVME4100), running vxWorks 6.8, to a Linux Intel computer (Xembedded XVME6300), running Debian 6 with kernel 2.6.32.
This channel runs over VME bus; both computers are in a VME enclosure and both use the Tundra Tsi148 chipset. The Intel computer is explicitly configured as the system controller, the real-time computer is explicitly not.
Setup:
For the Intel computer I wrote a custom driver that creates a 4MB kernel buffer, and shares it over the VME bus by means of a slave window;
For the real-time computer I setup a DMA transfer to repeatedly forward blocks of exactly 48640 bytes; filled with bytes of test data (zeros, ones, twos, etc), in quick succession (once every 32 milliseconds, if possible)
For the Intel computer I read the kernel buffer from the driver, to see whether the data arrives correctly, with a hand-started Python program.
Expectation:
I am expecting to see the same data (zeros, ones etc) from the Python program.
I am expecting transmission times roughly corresponding to the chosen bus speed (typically 290 us or 145 us, depending on bus speed), plus a reasonable DMA setup overhead (up to 10us? I am willing to accept larger numbers, say hundreds of usecs, if that is what the bus normally needs)
Result:
Sometimes data does not arrive at all, and "transmission" time is ~2000 us
Sometimes data arrives reliably, but transmission time is ~98270us, or 98470us, depending on the chosen bus speed.
Questions:
How could I make the transmission reliable and bring down these aweful latencies?
What general direction should I search next?
(I would like to tag with VMEbus if I could)
Many thanks
My comments on the question describe how I got the bus working:
- ensure 2eSST320 on both sides of the bus
- ensure that the DMA transaction used a valid block size (the largest valid was 4096 bytes)
I achieved an effective speed of 150MBytes/s (the bus can achieve 320MBytes/s but the tsi148 chip is known for causing significant overhead). This is good enough for me.

Resources