Upgrading audio code to new WASAPI standard

Upgrading audio code to new WASAPI standard - audio

We have an application that uses waveXXX() and mixerXXX() functions to handle the audio I/O to and from some instruments (think: oscilloscope or electronics rather than musical instruments, not that it much matters). It's finally time to stop deploying it on Windows XP, and move it to Windows 7 and/or 8.
From reading a variety of material on WASAPI, it sounds like the bulk of the application (based on waveXXX() functions) might actually work fine, but the mixer() stuff used to set master output volume, line in volume, and mute the microphone will definitely have to change, and use IAudioEndPointVolume calls instead.
Is it possible to change only the mixerXXX() calls? Is it desirable?
Logically, this application requires exclusive use of its audio endpoints (speaker out, line in). If I want to ensure exclusive access through software, would that force me to rewrite all the waveXXX() code too? (The alternative is to warn users that other audio applications may interfere with this one).

My recommendation:
If you need exclusive access, convert everything to WASAPI
If you are using line-in, convert everything to WASAPI
If you have time, convert everything to WASAPI
If you are strictly only using speaker and microphone in shared mode, replace mixerXXX() with the ISimpleAudioVolume interface (and several other interfaces to get to it), then test whether existing waveXXX() code behaves as you need it to. Then test each time hardware, OS or audio drivers change. Better still, just convert to WASAPI.
In my case, exclusive speaker output is critical - this drives the instrument that generates a related input signal. I guess I don't mind if another application wants to share access to that incoming signal, but logically it is a system that wants an exclusive contract with its audio endpoints.
That exclusivity requires that I obtain an IMMDevice instance for both speaker output and line-in input, Activate() the IAudioClient interface on them and Initialize() both using AUDCLNT_SHAREMODE_EXCLUSIVE (see also this answer).
But have I actually selected line-in by such a process? Probably not. All I can be sure of is annoying any other applications who were previously sharing my endpoints by cutting them off.
Having done this much, it's really not clear what will happen to waveInXXX() calls - maybe they'll take from line-in, maybe from microphone - maybe it depends on how the hardware vendor implements their end of the deal. It's also never been clear to me whether line-in and microphone are always multiplexed (i.e. selectable), always mixed (i.e. you can only simulate selection by muting the other one) or there is no standard one can rely on.
Because of factors like that, it's a gamble not to use WASAPI throughout.

Related

Is it practical to use the "rude big hammer" approach to parallelize a MacOS/CoreAudio real-time audio callback?

First, some relevant background info: I've got a CoreAudio-based low-latency audio processing application that does various mixing and special effects on audio that is coming from an input device on a purpose-dedicated Mac (running the latest version of MacOS) and delivers the results back to one of the Mac's local audio devices.
In order to obtain the best/most reliable low-latency performance, this app is designed to hook in to CoreAudio's low-level audio-rendering callback (via AudioDeviceCreateIOProcID(), AudioDeviceStart(), etc) and every time the callback-function is called (from the CoreAudio's realtime context), it reads the incoming audio frames (e.g. 128 frames, 64 samples per frame), does the necessary math, and writes out the outgoing samples.
This all works quite well, but from everything I've read, Apple's CoreAudio implementation has an unwritten de-facto requirement that all real-time audio operations happen in a single thread. There are good reasons for this which I acknowledge (mainly that outside of SIMD/SSE/AVX instructions, which I already use, almost all of the mechanisms you might employ to co-ordinate parallelized behavior are not real-time-safe and therefore trying to use them would result in intermittently glitchy audio).
However, my co-workers and I are greedy, and nevertheless we'd like to do many more math-operations per sample-buffer than even the fastest single core could reliably execute in the brief time-window that is necessary to avoid audio-underruns and glitching.
My co-worker (who is fairly experienced at real-time audio processing on embedded/purpose-built Linux hardware) tells me that under Linux it is possible for a program to requisition exclusive access for one or more CPU cores, such that the OS will never try to use them for anything else. Once he has done this, he can run "bare metal" style code on that CPU that simply busy-waits/polls on an atomic variable until the "real" audio thread updates it to let the dedicated core know it's time to do its thing; at that point the dedicated core will run its math routines on the input samples and generate its output in a (hopefully) finite amount of time, at which point the "real" audio thread can gather the results (more busy-waiting/polling here) and incorporate them back into the outgoing audio buffer.
My question is, is this approach worth attempting under MacOS/X? (i.e. can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores, and if so, will big ugly busy-waiting/polling loops on those cores (including the polling-loops necessary to synchronize the CoreAudio callback-thread relative to their input/output requirements) yield results that are reliably real-time enough that you might someday want to use them in front of a paying audience?)
It seems like something that might be possible in principle, but before I spend too much time banging my head against whatever walls might exist there, I'd like some input about whether this is an avenue worth pursuing on this platform.

can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores
I don't know about that, but you can use as many cores / real-time threads as you want for your calculations, using whatever synchronisation methods you need to make it work, then pass the audio to your IOProc using a lock free ring buffer, like TPCircularBuffer.
But your question reminded me of a new macOS 11/iOS 14 API I've been meaning to try, the Audio Workgroups API (2020 WWDC Video).
My understanding is that this API lets you "bless" your non-IOProc real-time threads with audio real-time thread properties or at least cooperate better with the audio thread.
The documents distinguish between the threads working in parallel (this sounds like your case) and working asynchronously (this sounds like my proposal), I don't know which case is better for you.
I still don't know what happens in practice when you use Audio Workgroups, whether they opt you in to good stuff or opt you out of bad stuff, but if they're not the hammer you're seeking, they may have some useful hammer-like properties.

What bluetooth to use (2.1 or 4.0) and how?

The title seems to be too general (I can't think of a good title). I'll try to be specific in the question description.
I was required to make an industrial control box that collects data periodically (maybe 10-20 bytes of data per 5 seconds). The operator will use a laptop or mobile phone to collect the data (without opening the box) via Bluetooth, weekly or monthly or probably at even longer period.
I will be in charge of selecting proper modules/chips, doing PCB, and doing the embedded software too. Because the box is not produced in high volume, so I have freedom (modules/chips to use, prices, capabilities, etc.) in designing different components.
The whole application requires an USART port to read in data when available (maybe every 5-10 seconds), a SPI port for data storage (SD Card reader/writer), several GPIO pins for LED indicator or maybe buttons (whether we need buttons and how many is up to my design).
I am new to Bluetooth. I read through wiki and some googling pages. Knowing about the pairing, knowing about the class 1 and class 2 differences, knowing about the 2.1 and 4.0 differences.
But I still have quite several places not clear to decide what Bluetooth module/chip to use.
A friend of mine mentioned TI CC2540 to me. I checked and it only supports 4.0 BLE mode. And from Google, BT4.0 has payload of at most 20 bytes. Is BT4.0 suitable for my application when bulk data will need to be collected every month or several months? Or it's better to use BT2.1 with EDR for this application? BT4.0 BLE mode seems to have faster pairing speed but lower throughput?
I read through CC2540, and found that it is not a BT only chip, it has several GPIO pins and uart pins (I am not sure about SPI). Can I say that CC2540 itself is powerful enough to hold the whole application? Including bluetooth, data receiving via UART, and SD card reading/writing?
My original design was to use an ARM cortex-M/AVR32 MCU. The program is just a loop to serve each task/events in rounds (or I can even install Linux). There will be a Bluetooth module. The module will automatically take care of pairing. I will only need to send the module what data to send to the other end. Of course there might be some other controlling, such as to turn the module to low power mode because the Bluetooth will only be used once per month or something like that. But after some study of Bluetooth, I am not sure whether such BT module exists or not. Is programming chips like CC2540 a must?
In my understanding, my designed device will be a BT slave, the laptop/phone will be the master. My device will periodically probe (maybe with longer period to save power) the existence of master and pair with it. Once it's paired, it will start sending data. Is my understanding on the procedure correct? Is there any difference in pairing/data sending for 2.1 and 4.0?
How should authentication be designed? I of course want unlimited phones/laptops to pair with the device, but only if they can prove they are the operator.
It's a bit messy. I will be appreciated if you have read through the above questions. The following is the summary,
2.1 or 4.0 to use?
Which one is the better choice? Meaning that suitable for the application, and easy to develop.
ARM/avr32 + CC2540 (or the like)
CC2540 only or the like (if possible)
ARM/avr32 + some BT modules ( such as Bluegiga https://www.bluegiga.com/en-US/products/ )
Should Linux be used?
How the pairing and data sending should be like for power saving? Are buttons useful to facilitate the sleep mode and active pairing and data sending mode for power saving?
How the authentication should be done? Only operators are allowed but he can use any laptops/phones.
Thanks for reading.

My two cents (I've done a fair amount of Bluetooth work, and I've designed consumer products currently in the field)... Please note that I'll be glossing over A LOT of design decisions to keep this 'concise'.
2.1 or 4.0 to use?
Looking over your estimates, it sounds like you're looking at about 2MB of data per week, maybe 8MB per month. The question of what technology to use here comes down to how long people are willing to wait to collect data.
For BLE (BT 4.0) - assume your data transfer is in the 2-4kB/s range. For 2.1, assume it's in the 15-30kB range. This depends on a ton of factors, but has generally been my experience between Android/iOS and BLE devices.
At 2MB, either one of those is going to take a long time to transfer. Is it possible to get data more frequently? Or maybe have a wifi-connected base station somewhere that collects data frequently? (I'm not sure of the application)
Which one is the better choice? Meaning that suitable for the
application, and easy to develop. ARM/avr32 + CC2540 (or the like)
CC2540 only or the like (if possible) ARM/avr32 + some BT modules (
such as Bluegiga https://www.bluegiga.com/en-US/products/ ) Should
Linux be used?
This is a tricky question to answer without knowing more about the complexity of the system, the sensors, the data storage capacity, BOM costs, etc...
In my experience, a Linux OS is HUGE overkill for a simple GPIO/UART/I2C based system (unless you're super comfortable with Linux). Chips that can run Linux, and add the additional RAM are usually costly (e.g. a cheap ARM Cortex M0 is like 50 cents in decent volume and sounds like all you need to use).
The question usually comes down to 'external MCU or not' - as in, do you try to get an all-in-one BT module, which has application space to program on. This is a size and cost savings to using it, but it adds risks and unknowns vs a braindead BT module + external MCU.
One thing I would like to clarify is that you mention the TI CC2540 a few times (actually, CC2541 is the newer version). Anyways, that chip is an IC level component. Unless you want to do the antenna design and put it through FCC intentional radiator certification (the certs are between 1k-10k usually - and I'm assuming you're in the US when I say FCC).
What I think you're looking for is a ready-made, certified module. For example, the BLE113 from Bluegiga is an FCC certified module which contains the CC2541 internally (plus some bells and whistles). That one specifically also has an interpreted language called BGScript to speed up development. It's great for very simple applications - and has nice, baked in low-power modes.
So, the BLE113 is an example of a module that can be used with/without an external MCU depending on application complexity.
If you're willing to go through FCC intentional radiator certification, then the TI CC2541 is common, as well as the Nordic NRF51822 (this chip has a built in ARM core that you can program on as well, so you don't need an external MCU).
An example of a BLE module which requires an external MCU would be the Bobcats (AMS001) from AckMe. They have a BLE stack running on the chip which communicates to an external MCU using UART.
As with a comment above, if you need iOS compatibility, using a Bluetooth 2.1 (BT Classic) is a huge pain due to the MFI program (which I have gone through - pure misery). So, unless REALLY necessary, I would stick with BT Classic and Android/PC.
Some sample BT classic chips might be the Roving Networks RN42, the AmpedRF BT 33 (or 43 or 53). If you're interested, I did a throughput test on iOS devices with a Bluetooth classic device (https://stackoverflow.com/a/22441949/992509)
How the pairing and data sending should be like for power saving? Are
buttons useful to facilitate the sleep mode and active pairing and
data sending mode for power saving?
If the radio is only on every week or month when the operator is pulling data down, there's very little to do other than to put the BT module into reset mode to ensure there is no power used. BT Classic tends to use more power during transfer, but if you're streaming data constantly, the differences can be minimal if you choose the right modules (e.g. lower throughput on BLE for a longer period of time, vs higher throughput on BT 2.1 for less time - it works itself out in the wash).
The one thing I would do here is to have a button trigger the ability to pair to the BT modules, because that way they're not always on and advertising, and are just asleep or in reset.
How the authentication should be done? Only operators are allowed but
he can use any laptops/phones.
Again, not knowing the environment, this can be tricky. If it's in a secure environment, that should be enough (e.g. behind locked doors that you need to be inside to press a button to wake up the BT module).
Otherwise, at a bare minimum, have the standard BT pairing code enabled, rather than allowing anyone to pair. Additionally, you could bake extra authentication and security in (e.g. you can't download data without a passcode or something).
If you wanted to get really crazy, you could encrypt the data (this might involve using Linux though) and make sure only trusted people had decryption keys.

We use both protocols on different products.
Bluetooth 4.0 BLE aka Smart
low battery consumption
low data rates (I came up to 20 bytes each 40 ms. As I remember apple's minimum interval is 18 ms and other handset makers adapted that interval)
you have to use Bluetooth's characteristics mechanism
you have to implement chaining if your data packages are longer
great distances 20-100m
new technology with a lot of awful premature implementations. Getting better slowly.
we used a chip from Bluegiga that allowed a script language for programming. But still many limitations and bugs are build in.
we had a greater learning curve to implement BLE than using 2.1
Bluetooth 2.1
good for high data rates depending on used baud rate. The bottle neck here was the buffer in the controller.
weak distances 2-10 m
Its much easier to stream data
Did not notice a big time difference in pairing and connecting with both technologies.
Here are two examples of devices which clearly demand either 2.1 or BLE. Maybe your use case is closer to one of those examples:
Humidity sensors attached to trees in a Forrest. Each week the ranger walks through the Forrest and collects the data.
Wireless stereo headsets

Making a real-time audio application with software synthesizers

I'm looking into making some software that makes the keyboard function like a piano (e.g., the user presses the 'W' key and the speakers play a D note). I'll probably be using OpenAL. I understand the basics of digital audio, but playing real-time audio in response to key presses poses some problems I'm having trouble solving.
Here is the problem: Let's say I have 10 audio buffers, and each buffer holds one second of audio data. If I have to fill buffers before they are played through the speakers, then I would would be filling buffers one or two seconds before they are played. That means that whenever the user tries to play a note, there will be a one or two second delay between pressing the key and the note being played.
How do you get around this problem? Do you just make the buffers as small as possible, and fill them as late as possible? Is there some trick that I am missing?

Most software synthesizers don't use multiple buffers at all.
They just use one single, small ringbuffer that is constantly played.
A high priority thread will as often as possible check the current play-position and fill the free part (e.g. the part that has been played since the last time your thread was running) of the ringbuffer with sound data.
This will give you a constant latency that is only bound by the size of your ring-buffer and the output latency of your soundcard (usually not that much).
You can lower your latency even further:
In case of a new note to be played (e.g. the user has just pressed a key) you check the current play position within the ring-buffer, add some samples for safety, and then re-render the sound data with the new sound-settings applied.
This becomes tricky if you have time-based effects running (delay lines, reverb and so on), but it's doable. Just keep track of the last 10 states of your time based effects every millisecond or so. That'll make it possible to get back 10 milliseconds in time.

With the WinAPI, you can only get so far in terms of latency. Usually you can't get below 40-50ms which is quite nasty. The solution is to implement ASIO support in your app, and make the user run something like Asio4All in the background. This brings the latency down to 5ms but at a cost: other apps can't play sound at the same time.
I know this because I'm a FL Studio user.

The solution is small buffers, filled frequently by a real-time thread. How small you make the buffers (or how full you let the buffer become with a ring-buffer) is constrained by scheduling latency of your operating system. You'll probably find 10ms to be acceptable.
There are some nasty gotchas in here for the uninitiated - particularly with regards to software architecture and thread-safety.
You could try having a look at Juce - which is a cross-platform framework for writing audio software, and in particular - audio plugins such as SoftSynths and effects. It includes software for both sample plug-ins and hosts. It is in the host that issues with threading are mostly dealt with.

Is forcing I2C communication safe?

For a project I'm working on I have to talk to a multi-function chip via I2C. I can do this from linux user-space via the I2C /dev/i2c-1 interface.
However, It seems that a driver is talking to the same chip at the same time. This results in my I2C_SLAVE accesses to fail with An errno-value of EBUSY. Well - I can override this via the ioctl I2C_SLAVE_FORCE. I tried it, and it works. My commands reach the chip.
Question: Is it safe to do this? I know for sure that the address-ranges that I write are never accessed by any kernel-driver. However, I am not sure if forcing I2C communication that way may confuse some internal state-machine or so.(I'm not that into I2C, I just use it...)
For reference, the hardware facts:
OS: Linux
Architecture: TI OMAP3 3530
I2C-Chip: TWL4030 (does power, audio, usb and lots of other things..)

I don't know that particular chip, but often you have commands that require a sequence of writes, first to one address to set a certain mode, then you read or write another address -- where the function of the second address changes based on what you wrote to the first one. So if the driver is in the middle of one of those operations, and you interrupt it (or vice versa), you have a race condition that will be difficult to debug. For a reliable solution, you better communicate through the chip's driver...

I mostly agree with #Wim. But I would like to add that this can definitely cause irreversible problems, or destruction, depending on the device.
I know of a Gyroscope (L3GD20) that requires that you don't write to certain locations. The way that the chip is setup, these locations contain manufacturer's settings which determine how the device functions and performs.
This may seem like an easy problem to avoid, but if you think about how I2C works, all of the bytes are passed one bit at a time. If you interrupt in the middle of the transmission of another byte, results can not only be truly unpredictable, but they can also increase the risk of permanent damage exponentially. This is, of course, entirely up to the chip on how to handle the problem.
Since microcontrollers tend to operate at speeds much faster than the bus speeds allowed on I2C, and since the bus speeds themselves are dynamic based on the speeds at which devices process the information, the best bet is to insert pauses or loops between transmissions that wait for things to finish. If you have to, you can even insert a timeout. If these pauses aren't working, then something is wrong with the implementation.

Fast Audio Input/Output

Here's what I want to do:
I want to allow the user to give my program some sound data (through a mic input), then hold it for 250ms, then output it back out through the speakers.
I have done this already using Java Sound API. The problem is that it's sorta slow. It takes a minimum of about 1-2 seconds from the time the sound is made to the time the sound is heard again from the speakers, and I haven't even tried to implement delay logic yet. Theoretically there should be no delay, but there is. I understand that you have to wait for the sound card to fill up its buffer or whatever, and the sample size and sampling rate have something to do with this.
My question is this: Should I continue down the Java path trying to do this? I want to get the delay down to like 100ms if possible. Does anyone have experience using the ASIO driver with Java? Supposedly it's faster..
Also, I'm a .NET guy. Does this make sense to do with .NET instead? What about C++? I'm looking for the right technology to use here, and maybe a good example of how to read/write to audio input/output streams using your suggested technology platform. Thanks for your help!

I've used JavaSound in the past and found it wonderfully flaky (and it keeps changing between VM releases). If you like C#, use it, just use the DirectX APIs. Here's an example of doing kind of what you want to do using DirectSound and C#. You could use the Effects plugins to perform your 250 ms echo.
http://blogs.microsoft.co.il/blogs/tamir/archive/2008/12/25/capturing-and-streaming-sound-by-using-directsound-with-c.aspx

You may want to look into JACK, an audio API designed for low-latency sound processing. Additionally, Google turns up this nifty presentation [PDF] about using JACK with Java.
Theoretically there should be no delay, but there is.
Well, it's impossible to have zero delay. The best you can hope for is an unnoticeable delay (in terms of human perception). It might help if you describe your basic algorithm for reading & writing the sound data, so people can identify possible problems.
A potential issue with using a garbage-collected language like Java is that the GC will periodically run, interrupting your processing for some arbitrary amount of time. However, I'd be surprised if it's >100ms in normal usage. If GC is a problem, most JVMs provide alternate collection algorithms you can try.

If you choose to go down the C/C++ path, I highly recommend using PortAudio ( http://portaudio.com/ ). It works with almost everything on multiple platforms and it gives you low-level control of the sound drivers without actually having to deal with the various sound driver technology that is around.
I've used PortAudio on multiple projects, and it is a real joy to use. And the license is permissive.

If low latency is your goal, you can't beat C.
libsoundio is a low-level C library for real-time audio input and output. It even comes with an example program that does exactly what you want - piping the microphone input to the speakers output.

It's possible with JavaSound to get end-to-end latency in the ballpark of 100-150ms.
The primary cause of latency is the buffer sizes of the capture and playback lines. The bufferSize is set when opening the lines:
capture: TargetDataLine#open(AudioFormat format, int bufferSize)
playback: SourceDataLine#open(AudioFormat format, int bufferSize)
If the buffer is too big it will cause excess latency, but if it's too small it will cause stuttery playback. So you need to find a balance for your applications needs and your computing power.
The default buffer size can be checked with DataLine#getBufferSize when calling #open(AudioFormat format). The default size will vary based on the AudioFormat and seems to be geared for high latency, stutter free playback applications (e.g. internet streaming). If you're developing a low latency application, the default buffer size is much too large and should be changed.
In my testing with a 16-bit PCM AudioFormat, a buffer size of 1024 bytes has been pretty close to ideal for low latency.
The second and often overlooked cause of audio latency is any other activity being done in the capture or playback threads. For example, logging messages to console can introduce 10's of ms of latency. Turn it off.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string