Fast Audio Input/Output - audio

Here's what I want to do:
I want to allow the user to give my program some sound data (through a mic input), then hold it for 250ms, then output it back out through the speakers.
I have done this already using Java Sound API. The problem is that it's sorta slow. It takes a minimum of about 1-2 seconds from the time the sound is made to the time the sound is heard again from the speakers, and I haven't even tried to implement delay logic yet. Theoretically there should be no delay, but there is. I understand that you have to wait for the sound card to fill up its buffer or whatever, and the sample size and sampling rate have something to do with this.
My question is this: Should I continue down the Java path trying to do this? I want to get the delay down to like 100ms if possible. Does anyone have experience using the ASIO driver with Java? Supposedly it's faster..
Also, I'm a .NET guy. Does this make sense to do with .NET instead? What about C++? I'm looking for the right technology to use here, and maybe a good example of how to read/write to audio input/output streams using your suggested technology platform. Thanks for your help!

I've used JavaSound in the past and found it wonderfully flaky (and it keeps changing between VM releases). If you like C#, use it, just use the DirectX APIs. Here's an example of doing kind of what you want to do using DirectSound and C#. You could use the Effects plugins to perform your 250 ms echo.
http://blogs.microsoft.co.il/blogs/tamir/archive/2008/12/25/capturing-and-streaming-sound-by-using-directsound-with-c.aspx

You may want to look into JACK, an audio API designed for low-latency sound processing. Additionally, Google turns up this nifty presentation [PDF] about using JACK with Java.
Theoretically there should be no delay, but there is.
Well, it's impossible to have zero delay. The best you can hope for is an unnoticeable delay (in terms of human perception). It might help if you describe your basic algorithm for reading & writing the sound data, so people can identify possible problems.
A potential issue with using a garbage-collected language like Java is that the GC will periodically run, interrupting your processing for some arbitrary amount of time. However, I'd be surprised if it's >100ms in normal usage. If GC is a problem, most JVMs provide alternate collection algorithms you can try.

If you choose to go down the C/C++ path, I highly recommend using PortAudio ( http://portaudio.com/ ). It works with almost everything on multiple platforms and it gives you low-level control of the sound drivers without actually having to deal with the various sound driver technology that is around.
I've used PortAudio on multiple projects, and it is a real joy to use. And the license is permissive.

If low latency is your goal, you can't beat C.
libsoundio is a low-level C library for real-time audio input and output. It even comes with an example program that does exactly what you want - piping the microphone input to the speakers output.

It's possible with JavaSound to get end-to-end latency in the ballpark of 100-150ms.
The primary cause of latency is the buffer sizes of the capture and playback lines. The bufferSize is set when opening the lines:
capture: TargetDataLine#open(AudioFormat format, int bufferSize)
playback: SourceDataLine#open(AudioFormat format, int bufferSize)
If the buffer is too big it will cause excess latency, but if it's too small it will cause stuttery playback. So you need to find a balance for your applications needs and your computing power.
The default buffer size can be checked with DataLine#getBufferSize when calling #open(AudioFormat format). The default size will vary based on the AudioFormat and seems to be geared for high latency, stutter free playback applications (e.g. internet streaming). If you're developing a low latency application, the default buffer size is much too large and should be changed.
In my testing with a 16-bit PCM AudioFormat, a buffer size of 1024 bytes has been pretty close to ideal for low latency.
The second and often overlooked cause of audio latency is any other activity being done in the capture or playback threads. For example, logging messages to console can introduce 10's of ms of latency. Turn it off.

Related

Is it practical to use the "rude big hammer" approach to parallelize a MacOS/CoreAudio real-time audio callback?

First, some relevant background info: I've got a CoreAudio-based low-latency audio processing application that does various mixing and special effects on audio that is coming from an input device on a purpose-dedicated Mac (running the latest version of MacOS) and delivers the results back to one of the Mac's local audio devices.
In order to obtain the best/most reliable low-latency performance, this app is designed to hook in to CoreAudio's low-level audio-rendering callback (via AudioDeviceCreateIOProcID(), AudioDeviceStart(), etc) and every time the callback-function is called (from the CoreAudio's realtime context), it reads the incoming audio frames (e.g. 128 frames, 64 samples per frame), does the necessary math, and writes out the outgoing samples.
This all works quite well, but from everything I've read, Apple's CoreAudio implementation has an unwritten de-facto requirement that all real-time audio operations happen in a single thread. There are good reasons for this which I acknowledge (mainly that outside of SIMD/SSE/AVX instructions, which I already use, almost all of the mechanisms you might employ to co-ordinate parallelized behavior are not real-time-safe and therefore trying to use them would result in intermittently glitchy audio).
However, my co-workers and I are greedy, and nevertheless we'd like to do many more math-operations per sample-buffer than even the fastest single core could reliably execute in the brief time-window that is necessary to avoid audio-underruns and glitching.
My co-worker (who is fairly experienced at real-time audio processing on embedded/purpose-built Linux hardware) tells me that under Linux it is possible for a program to requisition exclusive access for one or more CPU cores, such that the OS will never try to use them for anything else. Once he has done this, he can run "bare metal" style code on that CPU that simply busy-waits/polls on an atomic variable until the "real" audio thread updates it to let the dedicated core know it's time to do its thing; at that point the dedicated core will run its math routines on the input samples and generate its output in a (hopefully) finite amount of time, at which point the "real" audio thread can gather the results (more busy-waiting/polling here) and incorporate them back into the outgoing audio buffer.
My question is, is this approach worth attempting under MacOS/X? (i.e. can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores, and if so, will big ugly busy-waiting/polling loops on those cores (including the polling-loops necessary to synchronize the CoreAudio callback-thread relative to their input/output requirements) yield results that are reliably real-time enough that you might someday want to use them in front of a paying audience?)
It seems like something that might be possible in principle, but before I spend too much time banging my head against whatever walls might exist there, I'd like some input about whether this is an avenue worth pursuing on this platform.
can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores
I don't know about that, but you can use as many cores / real-time threads as you want for your calculations, using whatever synchronisation methods you need to make it work, then pass the audio to your IOProc using a lock free ring buffer, like TPCircularBuffer.
But your question reminded me of a new macOS 11/iOS 14 API I've been meaning to try, the Audio Workgroups API (2020 WWDC Video).
My understanding is that this API lets you "bless" your non-IOProc real-time threads with audio real-time thread properties or at least cooperate better with the audio thread.
The documents distinguish between the threads working in parallel (this sounds like your case) and working asynchronously (this sounds like my proposal), I don't know which case is better for you.
I still don't know what happens in practice when you use Audio Workgroups, whether they opt you in to good stuff or opt you out of bad stuff, but if they're not the hammer you're seeking, they may have some useful hammer-like properties.

Which microcontroller for fast high quality audio switching and playback

I'm building a device which will play high quality sound samples and will switch between samples in >5ms when a signal is applied.
I'm after a microcontroller which can allow this - I need 4 I/O pins for triggering the transistions between sounds, as well as the output pin(s) for the audio. The duration of the audio files will be 50ms or so but ideally would have enough storage to allow the files to be 1 second or longer. It will loop the current file until told to change. I don't want audiable pops or suchlike when switching files or running other commands - but there shouldn't be a need for anything complex to run beside it, it's purely audio playing and switching.
I've looked at various microcontrollers in the arduino family but they don't seem optimal for this purpose - (tried for example mozzi library for arduino but it's not fantastic quality). Ideally I could do it all on the chip (whatever it is, doesn't need to be arduino) - without needing external storage or RAM modules. But if that's neccessary I'll do it. The solution is to fit in a 2cm wide cylinder (but no length constraints) so would be ideally within that - so no SD card modules or whatever. Language wise - I'm fairly new to them all - but can learn whatever would be best.
Audio - (44.1kHz CD quality WAV, although could obviously switch to a different format if neccessary). If this is totally impossible to play such a high quality sound - then sound quality could be less.
Thank you for your help
For a simple application like this you would be best to just use a small ARM Cortex M device hooked up to an external SPI FLASH chip. Most microcontrollers scale processing power and RAM with FLASH storage so keeping it all on one chip will result in a grotesquely over-powered solution. Serial FLASH memory is very cheap, easy to use, and you can change the size in the future if you need to add more samples.
For the audio side if you really want CD quality you'll have to look at getting a external audio DAC as I don't know of any microcontrollers that integrate a CD quality codec. External DACs aren't expensive or complex to use, but just adds to the physical size and BOM cost. Many Cortex chips have built in 12-bit DACs though so if the audio has a reasonably small dynamic range you might find this is suitable for your needs.
In terms of minimising pops and clicks the Cortex devices will have enough power for some basic filtering to deal with this. I would recommend against Arduino though as you will quickly come up against processing power limitations and I doubt you will want to dive into assembler optimisations.

Upgrading audio code to new WASAPI standard

We have an application that uses waveXXX() and mixerXXX() functions to handle the audio I/O to and from some instruments (think: oscilloscope or electronics rather than musical instruments, not that it much matters). It's finally time to stop deploying it on Windows XP, and move it to Windows 7 and/or 8.
From reading a variety of material on WASAPI, it sounds like the bulk of the application (based on waveXXX() functions) might actually work fine, but the mixer() stuff used to set master output volume, line in volume, and mute the microphone will definitely have to change, and use IAudioEndPointVolume calls instead.
Is it possible to change only the mixerXXX() calls? Is it desirable?
Logically, this application requires exclusive use of its audio endpoints (speaker out, line in). If I want to ensure exclusive access through software, would that force me to rewrite all the waveXXX() code too? (The alternative is to warn users that other audio applications may interfere with this one).
My recommendation:
If you need exclusive access, convert everything to WASAPI
If you are using line-in, convert everything to WASAPI
If you have time, convert everything to WASAPI
If you are strictly only using speaker and microphone in shared mode, replace mixerXXX() with the ISimpleAudioVolume interface (and several other interfaces to get to it), then test whether existing waveXXX() code behaves as you need it to. Then test each time hardware, OS or audio drivers change. Better still, just convert to WASAPI.
In my case, exclusive speaker output is critical - this drives the instrument that generates a related input signal. I guess I don't mind if another application wants to share access to that incoming signal, but logically it is a system that wants an exclusive contract with its audio endpoints.
That exclusivity requires that I obtain an IMMDevice instance for both speaker output and line-in input, Activate() the IAudioClient interface on them and Initialize() both using AUDCLNT_SHAREMODE_EXCLUSIVE (see also this answer).
But have I actually selected line-in by such a process? Probably not. All I can be sure of is annoying any other applications who were previously sharing my endpoints by cutting them off.
Having done this much, it's really not clear what will happen to waveInXXX() calls - maybe they'll take from line-in, maybe from microphone - maybe it depends on how the hardware vendor implements their end of the deal. It's also never been clear to me whether line-in and microphone are always multiplexed (i.e. selectable), always mixed (i.e. you can only simulate selection by muting the other one) or there is no standard one can rely on.
Because of factors like that, it's a gamble not to use WASAPI throughout.

How do media timeline apps work under the hood?

There are numerous examples of apps that are time-based and send out events or carry out complex processing to a very fine resolution and accuracy. Think MIDI sequencing apps, audio and video editing apps.
So I'm curious, at their core level, how do these apps do what they do so accurately from a programming point of view?
MIDI and media playback are entirely different in nature, and are handled in different ways.
For MIDI, there is very little data to process. A thread with a high priority is created to handle MIDI I/O. That's all that is needed.
For audio, accuracy isn't a problem but latency is. There is a buffer on the sound interface that is regularly written to by the software playing back audio. For a typical media player, this buffer has storage for about 300ms of audio. The software just writes the PCM-encoded audio waveform to the buffer. The sound interface is constantly reading from this buffer and playing back at a constant rate.
For low-latency audio applications, this buffer size can be very small, handling as little as 5 or 10ms of audio. The software generating the audio data must again be handled by a thread with high priority, and often has many optimizations that keep it running in the event the rest of the software (effects and what not) cannot keep up. Buffer underruns are common. Special drivers are often used to skip unneeded software in the signal chain. ASIO and DirectX are common on Windows. Windows Vista/7 and OSX both call their audio APIs "core audio", and provide low-latency features without special drivers.
Video is an entirely different beast. Decoding the video is handled by hardware, where possible. This is how a slow device such as a cell phone is able to playback 720p video. If the hardware can handle the codec, the software just needs to send it the data. In cases where the codec is unsupported, the video must be decoded in software which is much slower. Even on modern PCs, software decoding often leads to choppy or laggy video.
Synchronization of audio to video is also a problem. I don't know much about it, but it is my understanding that the audio is the master clock, and video is synchronized to it. You cannot simply start playback and expect the timing to work out, as different sound interfaces will have different ideas as to what 44.1kHz (or any other sample rate) is. You can prove this yourself by playing back the same audio on two different devices simultaneously, and listening to them drift apart over time.

Making a real-time audio application with software synthesizers

I'm looking into making some software that makes the keyboard function like a piano (e.g., the user presses the 'W' key and the speakers play a D note). I'll probably be using OpenAL. I understand the basics of digital audio, but playing real-time audio in response to key presses poses some problems I'm having trouble solving.
Here is the problem: Let's say I have 10 audio buffers, and each buffer holds one second of audio data. If I have to fill buffers before they are played through the speakers, then I would would be filling buffers one or two seconds before they are played. That means that whenever the user tries to play a note, there will be a one or two second delay between pressing the key and the note being played.
How do you get around this problem? Do you just make the buffers as small as possible, and fill them as late as possible? Is there some trick that I am missing?
Most software synthesizers don't use multiple buffers at all.
They just use one single, small ringbuffer that is constantly played.
A high priority thread will as often as possible check the current play-position and fill the free part (e.g. the part that has been played since the last time your thread was running) of the ringbuffer with sound data.
This will give you a constant latency that is only bound by the size of your ring-buffer and the output latency of your soundcard (usually not that much).
You can lower your latency even further:
In case of a new note to be played (e.g. the user has just pressed a key) you check the current play position within the ring-buffer, add some samples for safety, and then re-render the sound data with the new sound-settings applied.
This becomes tricky if you have time-based effects running (delay lines, reverb and so on), but it's doable. Just keep track of the last 10 states of your time based effects every millisecond or so. That'll make it possible to get back 10 milliseconds in time.
With the WinAPI, you can only get so far in terms of latency. Usually you can't get below 40-50ms which is quite nasty. The solution is to implement ASIO support in your app, and make the user run something like Asio4All in the background. This brings the latency down to 5ms but at a cost: other apps can't play sound at the same time.
I know this because I'm a FL Studio user.
The solution is small buffers, filled frequently by a real-time thread. How small you make the buffers (or how full you let the buffer become with a ring-buffer) is constrained by scheduling latency of your operating system. You'll probably find 10ms to be acceptable.
There are some nasty gotchas in here for the uninitiated - particularly with regards to software architecture and thread-safety.
You could try having a look at Juce - which is a cross-platform framework for writing audio software, and in particular - audio plugins such as SoftSynths and effects. It includes software for both sample plug-ins and hosts. It is in the host that issues with threading are mostly dealt with.

Resources