DirectShow, specifically Rate Matching, time stamps and the DirectSound Audio Renderer

DirectShow, specifically Rate Matching, time stamps and the DirectSound Audio Renderer - audio

Can anyone give me a concise explanation of how and why DirectShow DirectSound Audio Renderer will adjust the rate when I have my custom capture filter that does not expose a clock?
I cannot make any sense of it at all. When audio starts, I assign a rtStart of zero plus the duration of the sample (numbytes / m_wfx.nAvgBytesPerSec). Then the next sample has a start time of the end of the previous sample, and so on....
Some time later, the capture filter senses Directshow is consuming samples too rapidly, and tries to set a timestamp of some time in the future, which the audio renderer completely ignores. I can, as a test, suddenly tell a sample it must not be rendered until 20 secs in the future (StreamTime() + UNITS), and again the renderer just ignores it. However, the Null Audio Renderer does what it is told, and the whole graph freezes for 20 seconds, which is the expected behaviour.
In a nutshell, then, I want the audio renderer to use either my capture clock (or its own, or the graph's, I dont care) but I do need it to obey the time stamps I'm sending to it. What I need it to do is squish or stretch samples, ever so subtly, to make up for the difference in the rates between DSound and the oncoming stream (whose rate I cannot control).

MSDN explains the technology here: Live Sources, I suppose you are aware of this documentation topic.
Rate matching takes place when your source is live, otherwise audio renderer does not need to bother and it expects the source to keep input queue pre-loaded with data, so that data is consumed at the rate it is needed.
It seems that your filter is capturing in real time (capture filter and then you mention you don't control the rate of data you obtain externally). So you need to make sure your capture filter is recognized as live source and then you choose the clock for playback, and overall the mode of operation. I suppose you want the behavior described hear AM_PUSHSOURCECAPS_PRIVATE_CLOCK:
the source filter is using a private clock to generate time stamps. In this case, the audio renderer matches rates against the time stamps.
This is what you write about above:
you time stamp according to external source
playback is using audio device clock
audio renderer does rate matching to match the rates
To see how exactly rate matching takes place, you need to open audio renderer property pages, Advanced page:
Data under Slaving Info will show the rate matching details (48000/48300 matching in my example). The data is also available programmatically via IAMAudioRendererStats::GetStatParam.

Related

Add, remove and tune filters on running ffmpeg

Preface
I have been fiddling around with ffmpeg and ffplay using command line adding and tuning filters and effects.
It quickly becomes rather tiresome to
start playback with some audio file
stop
modify command
back to 1.
When for example fine-tuning noise reduction or adding effects and equalizer.
I have played around with using zmq to tune filters by executing commands in a different terminal, but this also becomes somewhat cumbersome.
I want some interface where I can add, remove and tune filters during runtime / while listening to the changes taking effect.
FFMPEG
I use filter to mean effect / filter from here on out. For example afftdn, rubberband, ...
ffmpeg is somewhat intimidating. It's powerful but also complex, at least when starting to dig into it. :0
Looking at the library and examples I am looking at API example for audio decoding and filtering, - which at least at first looks promising as a starting platter.
Output
I imagine it would be best to have multiple sinks or some container with multiple audio tracks:
Raw audio
Audio with effects applied
Optionally:
Raw audio
Audio with all filters
Audio with filter group 1
Audio with filter group 2
... etc.
Processing
I imagine the routine would have to be something like:
Read packet from stream/file/url
Unpack the sample
Copy / duplicate sample for each filter group / or one for filters
Apply filter(s) to these “effect sample” (s)
Write raw audio, filtered audio 1, filtered audio 2, filtered audio N, ... to out
Or for step 3 - 5 (as one would only be listening to one track at a time (But this is perhaps not the best if one decide to jump back / forth in the audio stream):
Apply currently active filter(s)
Write raw audio, filtered audio to out
Simultaneously one would read and check changes to filters by some interface. I.e. input:
afftdn=rf=-20:nr=20
then, if afftdn is not present in filters add it, else set new values.
Idea is to output "raw-audio". I.e. used in a sampling and tuning phase - then produce a line with filter-options one can use with the ffmpeg-tool to process the audio files once satisfied.
Questions section
Does something similar exist?
General:
Does this seem like a way to do it and use the ffmpeg library?
Can one add, remove and change filter values during runtime or do one have to re-initialize the entire stream for each added / removed filter etc?
Is the “Processing” part sound?
Would using a container that supports multiple audio tracks be the likely best solution? E.g. mp4.
Any container preferred over others?
Any drawbacks (i.e. jumping back / forth in the stream)
Sub-note
Dream is to have a Arduino interfacing with this program where I use physical rotary switches, incremental rotary encoders, buttons and whistles. Tuning the various options for the filters using physical knobs. But at first I need some working sample where I use FIFO or what ever to test ffmpeg itself.

Sync music to frame-based time

I'm making a game in which there are a series of events (which happens, say, every 30 frames in a 60fps setting) that I want to sync with the music (at 120 bpm). In usual cases, e.g. rhythm games, syncing the events to the music is easier, because human seems to perceive much smaller gaps in music than in videos. However, in my case, the game heavily depends on frame-based time, and a lot of things will break if I change the schedule of my series of events.
After a lot of experiments, it seems to me almost impossible to tweak the music without disturbing the human ear: A jump of ~1ms is noticeable, a ~10ms discrepancy between video and audio is noticeable, a 0.5% change in the pitch is noticeable. And I don't have handy tools to speed up audio without changing the pitch.
What is the easiest way out in this circumstance? Is there any reference on this subject that I can refer to? Any advice is appreciated!

The method I that I successfully use (in Java) is to route the playback signal through a path that allows the counting of PCM frames (audio frames run at rates like 44100 fps, as opposed to screen updates which run at rates like 60 fps). I don't know about other languages, but with Java, this can be done by outputting using a SourceDataLine class. As the audio frame count is incremented, it can be compared to the next item (pending item) on a collection of events that require triggers to other systems or threads. Java has an excellent class for handling the collection of events: ConcurrentSkipListSet. It is asynchronous, and automatically sorts elements via a Comparator set to the desired PCM frame count.
Some example code that showing the counting of frames can be seen in this tutorial Using Files and Format Converters, if you search on the page for the phrase "Here, do something useful with the audio data". They are counting bytes, not PCM frames, but the example does give the basic idea.
Why is counting PCM effective? I think this has to do with the fact that this code (in Java) is the closest we get to the point where audio data is fed to the native code controlling the sound system, and that this code employs a blocking queue. Thus, the write operations only happen when the audio system is ready to receive and playback more sound data, and audio systems have to be very accurate in how they maintain their rate of processing. The amount of time variance that occurs here (especially if the thread is given a high priority) is smaller than the time variance incurred by choices made by the JVM as it juggles multiple threads and processes.

pcm capture using alsa

I'm new in alsa sound programming. I'm developing an application to record the audio in to a wav file in c language. I did some research on net but still not very clear about many topics. Please help.
This is the configuration I'm setting.
access : SND_PCM_ACCESS_RW_INTERLEAVED
format: S16_LE
rate: 16000
channel: 1
I have few doubts:
I'm highly confused between the period size and period time settings.
What is the difference between snd_pcm_hw_params_set_period_time_near() and snd_pcm_hw_params_set_period_size_near(). Which API should be called for capture? Similarly there is snd_pcm_hw_params_set_buffer_time_near() and snd_pcm_hw_params_set_buffer_size_near(). How to decide between these two APIs?
How to decide the period size value? I believe the same value is used in snd_pcm_sw_params_set_avail_min() call.
What value should be used for number of frames to be read in snd_pcm_readi()?
What is the importance of snd_pcm_sw_params_set_avail_min() and snd_pcm_start_threshold() APIs? Is it a must to call those
I'm referring the arecord implementation and another example code for capture.
Thanks in advance.

The period time describes the same parameter as the period size. It might be useful if the rate is not yet known.
You get interrupts (i.e., the opportunity to get woken up if you're waiting for data) at the end of each period. If you know how much data you want to read each time, try to use that as period size.
Read as many frames as you want to process.
The avail_min parameter specifies how many frames must be available before an interrupt results in you application actually being waken up.
The start threshold specifies that the device starts automatically when you try to read that many frames.

gnuradio phase drift of AM demodulation

I am beginning a project using GNUradio and an inexpensive SDR.
http://www.amazon.com/gp/product/B00SXZDUAQ?psc=1&redirect=true&ref_=oh_aui_search_detailpage
One portion of the project requires me to generate a reference audio tone and compare the phase of that tone to demodulated audio.
To simulate this portion of the system, I have generated a simple GNUradio flowchart:
I had some issues with the source and demodulated audio in that they would drift relative to each other. This occurred on the scope sync on the original flowgraph. To aid in troubleshooting I sent the demodulated audio out thru the soundcard’s second channel and monitored both audio streams in addition to the modulated RF on an external oscilloscope:
Initially all seems well but, the demodulated audio drifts in relation to the original source and RF:
My question is: am I doing something wrong in the flowgraph or am I expecting too much performance out of an inexpensive SDR?
Thanks in advance for any insights

You cannot expect to see zero phase drift in anything short of a fully digital simulation, or a fully analog circuit with exactly one oscillator, because no two (physical) oscillators have identical frequencies.
In your case, there are two relevant oscillators involved:
The sample clock in the RTL-SDR unit.
The sample clock in your sound card output.
Within an GNU Radio flowgraph, there is no time reference per se and everything depends on the sources and sinks which are connected to hardware.
The relevant source in your flowgraph is the RTL-SDR hardware; insofar as its oscillator is different from its nominal value (28.8 MHz, as it happens), everything it produces will be off-frequency in an absolute sense (both RF carrier frequencies and audio frequencies of demodulated output).
But you don't actually have an absolute frequency reference; you have the tone produced by your sound card. The sound card has its own oscillator, which determines the rate at which samples are converted to analog signals, and therefore the rate at which samples are consumed from the flowgraph.
Therefore, your reference signal will drift relative to your received and demodulated signal, at a rate determined by the difference in frequency error between the two oscillators.
Additionally, since your sound card will be accepting samples from the flowgraph at a slightly different real-time rate than the RTL-SDR is producing them, you will notice periodic glitches in the audio as the error accumulates and must be dealt with; they will start occurring either immediately (if the source is slower than the sink, requiring the sound card to play silence instead) or after a delay for buffers to hit their maximum size (if the source is faster than the sink, requiring the RTL-SDR to drop some samples).

How does youtube support starting playback from any part of the video?

Basically I'm trying to replicate YouTube's ability to begin video playback from any part of hosted movie. So if you have a 60 minute video, a user could skip straight to the 30 minute mark without streaming the first 30 minutes of video. Does anyone have an idea how YouTube accomplishes this?

Well the player opens the HTTP resource like normal. When you hit the seek bar, the player requests a different portion of the file.
It passes a header like this:
RANGE: bytes-unit = 10001\n\n
and the server serves the resource from that byte range. Depending on the codec it will need to read until it gets to a sync frame to begin playback

Video is a series of frames, played at a frame rate. That said, there are some rules about the order of what frames can be decoded.
Essentially, you have reference frames (called I-Frames) and you have modification frames (class P-Frames and B-Frames)... It is generally true that a properly configured decoder will be able to join a stream on any I-Frame (that is, start decoding), but not on P and B frames... So, when the user drags the slider, you're going to need to find the closest I frame and decode that...
This may of course be hidden under the hood of Flash for you, but that is what it will be doing...

I don't know how YouTube does it, but if you're looking to replicate the functionality, check out Annodex. It's an open standard that is based on Ogg Theora, but with an extra XML metadata stream.
Annodex allows you to have links to named sections within the video or temporal URIs to specific times in the video. Using libannodex, the server can seek to the relevant part of the video and start serving it from there.

If I were to guess, it would be some sort of selective data retrieval, like the Range header in HTTP. that might even be what they use. You can find more about it here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string