Calculation of currentDelayMs in webrtc-internals - delay

What is currentDelayMs for video stream actually ?
How currentDelayMs is calculated in webrtc-internals? Why it differs from Rtt?
I tried to get an average of delay in video and audio from the graphs. Sometimes average delay is found about 2.25s. I want to know what it represents and how it is being calculated.

Related

How to get amplitude of an audio stream in an AudioGraph to build a SoundWave using Universal Windows?

I want to built a SoundWave sampling an audio stream.
I read that a good method is to get amplitude of the audio stream and represent it with a Polygon. But, suppose we have and AudioGraph with just a DeviceInputNode and a FileOutpuNode (a simple recorder).
How can I get the amplitude from a node of the AudioGraph?
What is the best way to periodize this sampling? Is a DispatcherTimer good enough?
Any help will be appreciated.
First, everything you care about is kind of here:
uwp AudioGraph audio processing
But since you have a different starting point, I'll explain some more core things.
An AudioGraph node is already periodized for you -- it's generally how audio works. I think Win10 defaults to periods of 10ms and/or 20ms, but this can be set (theoretically) via the AudioGraphSettings.DesiredSamplesPerQuantum setting, with the AudioGraphSettings.QuantumSizeSelectionMode = QuantumSizeSelectionMode.ClosestToDesired; I believe the success of this functionality actually depends on your audio hardware and not the OS specifically. My PC can only do 480 and 960. This number is how many samples of the audio signal to accumulate per channel (mono is one channel, stereo is two channels, etc...), and this number will also set the callback timing as a by-product.
Win10 and most devices default to 48000Hz sample rate, which means they are measuring/output data that many times per second. So with my QuantumSize of 480 for every frame of audio, i am getting 48000/480 or 100 frames every second, which means i'm getting them every 10 milliseconds by default. If you set your quantum to 960 samples per frame, you would get 50 frames every second, or a frame every 20ms.
To get a callback into that frame of audio every quantum, you need to register an event into the AudioGraph.QuantumProcessed handler. You can directly reference the link above for how to do that.
So by default, a frame of data is stored in an array of 480 floats from [-1,+1]. And to get the amplitude, you just average the absolute value of this data.
This part, including handling multiple channels of audio, is explained more thoroughly in my other post.
Have fun!

Creating audio level meter - signal normalization

I have program which tracks audio signal in real time. Every processed sample I am able to read value of it in range between <-1, 1>.
I would like to create(and later display) audio level meter. From what I understand - to do it I need to keep converting my audio signal in real time, on each channel to dB and then display dB values on each channel in some graphical form of bars.
I am a bit lost how to do it and it should be simple matter. Would just normalization from <-1, 1> to <0, 1> (like... [n-sample +1]/2) and then calculating 20*log10 from each upcoming sample make it?
You can't plot the signal directly, as it always varying positive and negative.
Therefore you need to average out the strength of the signal every so many samples.
Say you're sampling at 44.1kHz, perhaps you might choose 4410 samples so you're updating your display 10 times per second.
So you calculate the RMS of your 4410 samples - see http://en.wikipedia.org/wiki/Root_mean_square
The RMS value is always positive.
You can then convert this to Db:
dBV = 20 x log10(Vrms)
This assumes that your maximum signal -1 to +1 corresponds to -1 to +1 volt. You will need to do further adjustments if not.

DirectShow, specifically Rate Matching, time stamps and the DirectSound Audio Renderer

Can anyone give me a concise explanation of how and why DirectShow DirectSound Audio Renderer will adjust the rate when I have my custom capture filter that does not expose a clock?
I cannot make any sense of it at all. When audio starts, I assign a rtStart of zero plus the duration of the sample (numbytes / m_wfx.nAvgBytesPerSec). Then the next sample has a start time of the end of the previous sample, and so on....
Some time later, the capture filter senses Directshow is consuming samples too rapidly, and tries to set a timestamp of some time in the future, which the audio renderer completely ignores. I can, as a test, suddenly tell a sample it must not be rendered until 20 secs in the future (StreamTime() + UNITS), and again the renderer just ignores it. However, the Null Audio Renderer does what it is told, and the whole graph freezes for 20 seconds, which is the expected behaviour.
In a nutshell, then, I want the audio renderer to use either my capture clock (or its own, or the graph's, I dont care) but I do need it to obey the time stamps I'm sending to it. What I need it to do is squish or stretch samples, ever so subtly, to make up for the difference in the rates between DSound and the oncoming stream (whose rate I cannot control).
MSDN explains the technology here: Live Sources, I suppose you are aware of this documentation topic.
Rate matching takes place when your source is live, otherwise audio renderer does not need to bother and it expects the source to keep input queue pre-loaded with data, so that data is consumed at the rate it is needed.
It seems that your filter is capturing in real time (capture filter and then you mention you don't control the rate of data you obtain externally). So you need to make sure your capture filter is recognized as live source and then you choose the clock for playback, and overall the mode of operation. I suppose you want the behavior described hear AM_PUSHSOURCECAPS_PRIVATE_CLOCK:
the source filter is using a private clock to generate time stamps. In this case, the audio renderer matches rates against the time stamps.
This is what you write about above:
you time stamp according to external source
playback is using audio device clock
audio renderer does rate matching to match the rates
To see how exactly rate matching takes place, you need to open audio renderer property pages, Advanced page:
Data under Slaving Info will show the rate matching details (48000/48300 matching in my example). The data is also available programmatically via IAMAudioRendererStats::GetStatParam.

Does ADPCM has some sample rate?

ADPCM is adaptive, so it has varible sample rate. But does it have some average rate or something? Does it have frames of fixed time duration?
You misunderstood it here :-). "Adaptive" doesn't mean that sample rate is adjusted according to the signal it contains.
"Adaptive" means that the limited available delta steps (4Bit = only 16 possibilities to encode a sample) are adapted to the signal by prediction. It attempts to approximate from a given sample which value the next sample may have and adapts the delta steps to that.
If the signal has less change from sample to sample, the steps are chosen closer togheter than if the signal has much change. It is very unlikely that the signal goes from very oscillating to quiet from one sample to the next.
You notice that behavior if you encode a square wave with 100Hz using such algorithm and re-open it in an audio editor that makes the waveform visible. When the waveform changes from one polarity to other, the signal "speeds up" (the steps are more and more apart) until it reaches the other end and then it slows down again (The steps are more and more close togheter).
It still has a fixed sample rate. The one you will give to it. In RIFF WAVE, the sample rate is stored in the header.

Deciding on length of FFT

I am working on a tool to compare two wave files for similarity in their waveforms. Ex, I have a wave file of duration 1min and i make another wave file using the first one but have made each 5sec data at an interval of 5sec to 0.
Now my software will tell that there is waveform difference at time interval 5sec to 10sec, 15sec to 20sec, 25sec to 30 sec and so on...
As of now, with initial development, this is working fine.
Following are 3 test sets:
I have two wave files with sampling rate of 960Hz, mono, with no of data samples as 138551 (arnd 1min 12sec files). I am using 128 point FFT (splitting file in 128 samples chunk) and results are good.
When I use the same algorithm on wave files of sampling rate 48KHz, 2-channel with no of data samples 6927361 for each channel (arnd 2min 24 sec file), the process becomes too slow. When I use 4096 point FFT, the process is better.
But, the 4096 point FFT on files of 22050Hz, 2-channels with number of data samples 55776 for each channel (arnd 0.6sec file) gives very poor results. In this case 128 point FFT gives good result.
So, I am confused on how to decide the length of FFT so that my results are good in each case.
I guess the length should depend on number of samples and sampling rate.
Please give your inputs on this.
Thanks
The length of the FFT, N, will determine the resolution in the frequency domain:
resolution (Hz) = sample_rate (Hz) / N
So for example in case (1) you have resolution = 960 / 128 = 7.5 Hz. SO each bin in the resulting FFT (or presumably the power spectrum derived from this) will be 7.5 Hz wide, and you will be able to differentiate between frequency components which are at least this far apart.
Since you don't say what kind of waveforms these are, or what the purpose of your application is, it's hard to know what kind of resolution you need.
One important further point - many people using FFT for the first time are unaware that in general you need to apply a window function prior to the FFT to avoid spectral leakage.
I have to say I have found your question very cryptic. I think you should look into Short-time Fourier transform. The reason I say this is because you are looking at quite a large amount of samples if you use a sampling frequency of 44.1KhZ over 2mins with 2 channels. One fft across the entire amount will take quite a while indeed, not to mention the estimate will be biased as the signals mean and variance will change drastically over the whole duration. To avoid this you want to frame the time-domain signal first, these frames can be as small as 20ms-40ms (commonly used for speech) and often overlapping (Welch method of Spectral Estimation). Then you apply a window function such as Hamming or Hanning window to reduce spectral leakage and calculate an N-Point fft for each frame. Where N is the next power of two above the number of samples in that frame.
For example:
Fs = 8Khz, single channel;
time = 120sec;
no_samples = time * Fs = 960000 ;
frame length T_length= 20ms;
frame length in samples N_length = 160;
frame overlap T_overlap= 10ms;
frame overlap in samples N_overlap= 80;
Num of frames N_frames = (no_samples - (N_length-N_overlap))/N_overlap = 11999;
FFT length = 256;
So you will be processing 11999 frames in total, but your FFT length will be small. You will only need an FFT length of 256 (next power of two above frame length 160). Most algorithms that implement the fft require the signal length and fft length to be the same. All you have to do is append zeros to your framed signal up until 256. So pad each frame with x amount of zeros, where x = FFT_length-N_length. My latest android app does this on recorded speech and uses the short-time FFT data to display the Spectrogram of speech and also performs various spectral modification and filtering, its called Speech Enhancement for Android

Resources