realtime midi input and synchronisation with audio - audio

I have built a standalone app version of a project that until now was just a VST/audiounit. I am providing audio support via rtaudio.
I would like to add MIDI support using rtmidi but it's not clear to me how to synchronise the audio and MIDI parts.
In VST/audiounit land, I am used to MIDI events that have a timestamp indicating their offset in samples from the start of the audio block.
rtmidi provides a delta time in seconds since the previous event, but I am not sure how I should grab those events and how I can work out their time in relation to the current sample in the audio thread.
How do plugin hosts do this?
I can understand how events can be sample accurate on playback, but it's not clear how they could be sample accurate when using realtime input.
rtaudio gives me a callback function. I will run at a low block size (32 samples). I guess I will pass a pointer to an rtmidi instance as the userdata part of the callback and then call midiin->getMessage( &message ); inside the audio callback, but I am not sure if this is thread-sensible.
Many thanks for any tips you can give me

In your case, you don't need to worry about it. Your program should send the MIDI events to the plugin with a timestamp of zero as soon as they arrive. I think you have perhaps misunderstood the idea behind what it means to be "sample accurate".
As #Brad noted in his comment to your question, MIDI is indeed very slow. But that's only part of the problem... when you are working in a block-based environment, incoming MIDI events cannot be processed by the plugin until the start of a block. When computers were slower and block sizes of 512 (or god forbid, >1024) were common, this introduced a non-trivial amount of latency which results in the arrangement not sounding as "tight". Therefore sequencers came up with a clever way to get around this problem. Since the MIDI events are already known ahead of time, these events can be sent to the instrument one block early with an offset in sample frames. The plugin then receives these events at the start of the block, and knows not to start actually processing them until N samples have passed. This is what "sample accurate" means in sequencers.
However, if you are dealing with live input from a keyboard or some sort of other MIDI device, there is no way to "schedule" these events. In fact, by the time you receive them, the clock is already ticking! Therefore these events should just be sent to the plugin at the start of the very next block with an offset of 0. Sequencers such as Ableton Live, which allow a plugin to simultaneously receive both pre-sequenced and live events, simply send any live events with an offset of 0 frames.
Since you are using a very small block size, the worst-case scenario is a latency of .7ms, which isn't too bad at all. In the case of rtmidi, the timestamp does not represent an offset which you need to schedule around, but rather the time which the event was captured. But since you only intend to receive live events (you aren't writing a sequencer, are you?), you can simply pass any incoming MIDI to the plugin right away.


lv2 plugin development - how to read MIDI time and note simultaneously

I'm using Moony to prototype some components for a MIDI only lv2 plugin I'm building. I've been trying to work out how to get some sort of song position value from a noteOn event, meaning that I need to know the beat and bar that the note belongs to when it invokes the midiResponder. Even a total time or total frames will do to calculate. The way Moony works with timeResponder and midiResponder callbacks means I can know the time position or the note... but not both simultaneously. Looking at the lv2 midi spec it looks like only the event type, note number and velocity are properties of a noteOn event atom... so then I will face the same issue when I rewrite in C++ and integrate this code into my lv2 plugin? Is this right? Is there a work around?
The spec you looked at there describes the payload of an LV2 MIDI event, which is literally MIDI. The time stamp is available, but it is in the (generic) Event which contains the MIDI. In this way, all events are time stamped (in frames relative to the buffer), regardless of their payload type.
If you were to write this in C++, you would get a single buffer of events which includes time changes, MIDI events, and whatever other events the plugin may support. So, all the information is available, but managing this state so it is available where you want is up to you.

When does WASAPI GetNextPacketSize return 0

The sample code of WASAPI capture on MSDN, loops till the GetNextPacketSize return 0.
I just want to understand when will this happen:
Will it happen if there is silence registered on the microphone? (In this case will it loop infinitely if i keep making noise on microphone?)
It depends on some audio capture fundamental concept which I am missing (I am quite new to audio APIs :)).
The API helps in determining the size of the data buffer to be captured so that API client does not need to guess or allocate a buffer with excess etc. The API will return zero when there is no data to capture yet (not a single frame). This can happen in ongoing audio capture session when/if you call the API too early, and the caller is basically expected to try once again later since new data can still be generated.
In some conditions zero return might indicate an end of the stream. Specifically, if you capture from loopback device and there are no active playback sessions that can generate data for loopback delivery, capture API might keep delivering no data until new playback session emerges.
The sample code loop checks for zero packet size in conjunction with Sleep call. This way the loop expects that at least some data is generated during the sleep time and under normal conditions of continuous generation of audio data there is no zero length returned every first call within the outer loop. The inner loop attempts to read as many non-empty buffers as possible until zero indicates that all data, which was ready for delivery, was already returned to the client.
Outer loop keeps running until sink passes end-of-capture event through bDone variable. There is a catch here that somehow inner loop might be rolling without breaking into outer loop - according to the sample code - and capture is not correctly stopped. The sample assumes that sink processes data fast enough so that inner loop could process all currently available data and break out to reach Sleep call. That is, the WASAPI calls are all non-blocking and in assumption that these loops runs pretty fast the idea is that audio data is processed faster than it is captured, and the loop spends most of the thread time being in the Sleep call. Perhaps not the best sample code for beginners. You can improve this by checking bDone in the inner loop as well, to make it more reliable.

Audio synthesis in Haskell using reactive-banana

I'm trying to get started with reactive-banana and want to create a simple synthesizer. There are lots of GUI examples, but I have trouble applying them to audio. Since audio APIs have callbacks that say "give me n samples of audio" I figure I should fire an event each callback (using the snd part of what newAddHandler returns) that contains the number of samples to generate, a pointer where they should be written, and timing info to coordinate MIDI events. The IO action passed to reactimate would then write the samples to the pointer. MIDI events would be similarly fired from another callback and also contain timing info.
This is where I get stuck however. I guess the audio signal is supposed to be a behaviour, but how do I "run" a behaviour for the right amount of time to obtain the samples? The right amount of course depends on MIDI events that might occur between two audio callbacks.
Presuming the intention is to do something live, I think firing an event for each callback is going to be extremely limiting. Most audio APIs expect that these callbacks will return very quickly (e.g. typically you would never call malloc or do blocking IO in one). Firing an FRP event may work for very simple processing, but I think if you try to do anything more complex you'll get dropouts in the audio stream.
I would expect a more viable approach is to fire events yourself (by a clock, or in response to GUI events, etc) and generate a buffer of audio, and have the callback API read from that buffer. I know that some audio APIs (e.g. portaudio) have a buffered mode which handles some of this automatically. Although if all you have is a callback API, it's not too hard to add a buffer on top of that.
To approach problems like this, I find useful to take a semantic viewpoint: What is an audio signal? What type can I use to represent it?
Essentially, an audio signal is a time-varying amplitude
Audio = Time -> Double
which suggests the representation as a behavior
type Audio = Behavior Double
Then, we can use the <#> combinator to query the amplitude at a particular moment in time, namely whenever an event happens.
However, for reasons of efficiency, audio data is generally stored in blocks of 64 bytes (or 128, 256). After all, processing needs to be fast and it's important to use tight inner loops. This suggests to model audio data as a behavior
type Audio = Behavior (Vector Double)
whose values are 64 byte blocks of audio data and which changes whenever the time period corresponding to 64 bytes is over.
Connecting to other APIs is done only after the semantic model has been clarified. In this case, it seems a good idea to write the audio data from the behavior into a buffer, whose contents is then presented whenever the external API calls your callback.
By the way, I don't know whether reactive-banana-0.8 is fast enough yet to be useful for sample-level audio processing. It shouldn't be too bad, but you may have to choose a rather large block size.

Streaming output from program to an arbitrary number of programs under Linux?

How should I stream the output from one program to an undefined number of programs in such a fashion that the data isn't buffered anywhere and that the application where the stream originates from doesn't block even if there's nothing reading the stream, but the programs reading the stream do block if there's no output from the first-mentioned program?
I've been trying to Google around for a while now, but all I find is methods where the program does block if nothing is reading the stream.
How should I stream the output from one program to an undefined number of programs in such a fashion that the data isn't buffered anywhere and that the application where the stream originates from doesn't block even if there's nothing reading the stream
Your requirements as stated can not possibly be satisfied without some form of a buffer.
Most straightforward option is to write the output to the file and let consumers read that file.
Another option is to have a ring-buffer in a form of a memory mapped file. As the capacity of a ring-buffer is normally fixed there needs to be a policy for dealing with slow consumers. Options are: block the producer; terminate the slow consumer; let the slow consumer somehow recover when it missed data.
Many years ago I wrote something like what you describe for an audio stream processing app ( It's on github as splitter.cpp and has a small man page.
The splitter program currently does not support dynamically changing the set of output programs. The output programs are fixed when the command is started.
Without knowing exactly what sort of data you are talking about (how large is the data, what format is it, etc, etc) it is hard to come up with a concrete answer. Let's say for example you want a "ticker-tape" application that sends out information for share purchases on the stock exchange, you could quite easily have a server that accepts a socket from each application, starts a thread and sends the relevant data as it appears from the recoder at the stock market. I'm not aware of any "multiplexer" that exists today (but Greg's one may be a starting point). If you use (for example) XML to package the data, you could send the second half of a packet, and the client code would detect that it's not complete, so throws it away.
If, on the other hand, you are sending out high detail live update weather maps for the whole country, the data is probably large enough that you don't want to wait for a full new one to arrive, so you need some sort of lock'n'load protocol that sets the current updated map, and then sends that one out until (say) 1 minute later you have a new one. Again, it's not that complex to write some code to do this, but it's quite a different set of code to the "ticker tape" solution above, because the packet of data is larger, and getting "half a packet" is quite wasteful and completely useless.
If you are streaming live video from the 2016 Olympics in Brazil, then you probably want a further diffferent solution, as timing is everything with video, and you need the client to buffer, pick up key-frames, throw away "stale" frames, etc, etc, and the server will have to be different.

Has anybody some advice on programming realtime audio synthesis?

I'm currently working on a personal project: creating a library for realtime audio synthesis in Flash. In short: tools to connect wavegenarators, filters, mixers, etc with eachother and supply the soundcard with raw (realtime) data. Something like max/msp or Reaktor.
I already have some working stuff, but I'm wondering if the basic setup that I wrote is right. I don't want to run into problems later on that force me to change the core of my app (although that can always happen).
Basically, what I do now is start at the end of the chain, at the place where the (raw) sounddata goes 'out' (to the soundcard). To do that, I need to write chunks of bytes (ByteArrays) to an object, and to get that chunk I ask whatever module is connected to my 'Sound Out' module to give me his chunk. That module does the same request to the module that's connected to his input, and that keeps happening until the start of the chain is reached.
Is this the right approach? I can imagine running into problems if there's a feedbackloop, or if there's another module with no output: if i were to connect a spectrumanalyzer somewhere, that would be a dead end in the chain (a module with no outputs, just an input). In my current setup, such a module wouldnt work because i only start calculating from the sound-output module.
Has anyone experience with programming something like this? I'd be very interested in some thoughts about the right approach. (For clarity: i'm not looking for specific Flash-implementations, and that's why i didnt tag this question under flash or actionscript)
I did a similar thing a while back, and I used the same approach as you do - start at the virtual line out, and trace the signal back to the top. I did this per sample though, not per buffer; if I were to write the same application today, I might choose per-buffer instead though, because I suspect it would perform better.
The spectrometer was designed as an insert module, that is, it would only work if both its input and its output were connected, and it would pass its input to the output unchanged.
To handle feedback, I had a special helper module that introduced a 1-sample delay and would only fetch its input once per cycle.
Also, I think doing all your internal processing with floats, and thus arrays of floats as the buffers, would be a lot easier than byte arrays, and it would save you the extra effort of converting between integers and floats all the time.
In later versions you may have different packet rates in different parts of your network.
One example would be if you extend it to transfer data to or from disk. Another example
would be that low data rate control variables such as one controlling echo-delay may, later, become a part of your network. You probably don't want to process control variables with the same frequency that you process audio packets, but they are still 'real time' and part of the function network. They may for example need smoothing to avoid sudden transitions.
As long as you are calling all your functions at the same rate, and all the functions are essentially taking constant-time, your pull-the-data approach will work fine. There will
be little to choose between pulling data and pushing. Pulling is somewhat more natural for playing audio, pushing is somewhat more natural for recording, but either works and ends up making the same calls to the underlying audio processing functions.
For the spectrometer you've got
the issue of multiple sinks for
data, but it is not a problem.
Introduce a dummy link to it from
the real sink. The dummy link can
cause a request for data that is not
honoured. As long as the dummy link knows
it is a dummy and does not care about
the lack of data, everything will be
OK. This is a standard technique for reducing multiple sinks or sources to a single one.
With this kind of network you do not want to do the same calculation twice in one complete update. For example if you mix a high-passed and low-passed version of a signal you do not want to evaluate the original signal twice. You must do something like record a timer tick value with each buffer, and stop propagation of pulls when you see the current tick value is already present. This same mechanism will also protect you against feedback loops in evaluation.
So, those two issues of concern to you are easily addressed within your current framework.
Rate matching where there are different packet rates in different parts of the network is where the problems with the current approach will start. If you are writing audio to disk then for efficiency you'll want to write large chunks infrequently. You don't want to be blocking your servicing of the more frequent small audio input and output processing packets during those writes. A single rate pulling or pushing strategy on its own won't be enough.
Just accept that at some point you may need a more sophisticated way of updating than a single rate network. When that happens you'll need threads for the different rates that are running, or you'll write your own simple scheduler, possibly as simple as calling less frequently evaluated functions one time in n, to make the rates match. You don't need to plan ahead for this. Your audio functions are almost certainly already delegating responsibility for ensuring their input buffers are ready to other functions, and it will only be those other functions that need to change, not the audio functions themselves.
The one thing I would advise at this stage is to be careful to centralise audio buffer
allocation, noticing that buffers are like fenceposts. They don't belong to an audio
function, they lie between the audio functions. Centralising the buffer allocation will make it easy to retrospectively modify the update strategy for different rates in different parts of the network.
