Mixing two audio in Windows Media Foundation - audio

I am trying to mix two audio in Windows media foundation. In MATLAB I did it adding sample by sample from each audio. In windows media foundation I can access the samples using IMFSourceReader and IMFSample which gives me a chunk of data for some duration. For example first call to sourceReader->ReadSample() gives me first t duration of data. next call to it gives me next t turation of data. Each t duration of data lies in buffer size of L.
When I tried to access the audio samples using media foundation this way I expected that the duration t and buffer length L would be same each time I call ReadSample(). But I get different t and L each time I call ReadSample() for the audios. For example some times I get buffer length 16384, duration 928798 (in 100 nanosecond unit) and sometimes I get buffer length 8192 with duration 464399.
This is a huge problem for me as I cannot add two audio with different buffer length and duration. Is it possible to get fixed size buffer and fixed duration of IMFSamples? If not how can I mix two audio in media foundation?
I first transcoded the audio files to 44100 Hz sampling rate wma file.Then for reading audio I am using this code.
CHECK_HR(MFCreateMediaType(&spMFTypeIn));
CHECK_HR(spMFTypeIn->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Audio));
CHECK_HR(spMFTypeIn->SetGUID(MF_MT_SUBTYPE, MFAudioFormat_PCM));
CHECK_HR(spMFTypeIn->SetUINT32(MF_MT_AUDIO_BITS_PER_SAMPLE, BITS_PER_SAMPLE));
CHECK_HR(spMFTypeIn->SetUINT32(MF_MT_AUDIO_SAMPLES_PER_SECOND, 44100));
CHECK_HR(spMFTypeIn->SetUINT32(MF_MT_AUDIO_NUM_CHANNELS, 2));
CHECK_HR(spMFTypeIn->SetUINT32(MF_MT_AUDIO_PREFER_WAVEFORMATEX, 1));
CHECK_HR(spMFTypeIn->SetUINT32(MF_MT_AUDIO_BLOCK_ALIGNMENT, 16 / 8 * 2));
CHECK_HR(spMFTypeIn->SetUINT32(MF_MT_AUDIO_AVG_BYTES_PER_SECOND, 16 / 8 * 2 * 44100));

Related

How find sampleCount knowing length audio file and sampleRate?

I have been looking for a long time how to find sampleCount, but there is no answer. It is possible to say an algorithm or formula for calculation. It is known 850ms , the file weight is 37 KB, the resolution of the wav file , sampleRate is 48000.... I can check , you should get sampleCount equal to 40681 as I have in the file . this is necessary so that I can calculate sampleCount for other audio files.I am waiting for your help
I found and I get 40800 . I multiplied the rate with the time in seconds
Yes, the sample count is equal to the sample rate, multiplied by the duration.
So for an audio file that is exactly 850 milliseconds, at 48 kHz sample rate:
850 * 48000 = 40800 samples
Now, with MP3s you have to be careful. There is some padding at the beginning of the file for cleanly initializing the decoder, and the amount of padding can vary based on the encoder and its configuration. (You can read all about the troubles this has caused on the Wikipedia page for "gapless playback".) Additionally, your MP3 duration will be determined on MP3 frame boundaries, and not arbitrary PCM boundaries... assuming your decoder/player does not support gapless playback.

Data density of audio steganography

How many bytes can be stored per minute of audio using any method of steganography with a disregard to detectability or any other factor e.g if the original audio begins to sound different

How to split long audio (EX:1hour ) file into multiple short length (5s) audio file using python

I have some long audio files.I want to split this audio file into multiple short length audio file using python.Ex:The audio long length is more than 1 hour and want to split into multiple short length 5s files. i want to extract features for the whole audio file in each 5s.
There are two issues in your question.
Splitting the audio
Extracting features.
and both of them have the same, underlying, key information: sampling frequency.
The duration of an audio signal, in seconds, and the sampling frequency used for the audio file, define the amount of samples that an audio file has. An audio sample is (in simplified terms) one value of the audio signal in your hard-disk or computer memory.
The amount of audio samples, for a typical wav file, are calculated based on the formula sr * dur, here sr is the sampling frequency in Hz (e.g. 44100 for a CD quality signal) and dur is the duration of the audio file in seconds. For example, a CD audio file of 2 seconds has 44100 * 2 = 88200 samples.
So:
To split an audio file in Python, you first have to read it in a variable. There are plenty libraries and functions out there, for example (in a random order):
scipy.io.wavfile.read
wave module
and others. You can check this SO post for more info on reading a wav file.
Then, you just have to get N samples, e.g. my_audio_1 = whole_audio_file[0:5*sr].
BUT!!!
If you just want to extract features for every X seconds, then it is no need to split the audio manually. Most audio feature extraction libraries, do that for you.
For example, in librosa you can control the amount of the FFT points, which roughly are equivalent to the length of the audio that you want to extract features from. You can check, for example, here: https://librosa.org/doc/latest/feature.html

What do the bytes in a .wav file represent?

When I store the data in a .wav file into a byte array, what do these values mean?
I've read that they are in two-byte representations, but what exactly is contained in these two-byte values?
You will have heard, that audio signals are represented by some kind of wave. If you have ever seen this wave diagrams with a line going up and down -- that's basically what's inside those files. Take a look at this file picture from http://en.wikipedia.org/wiki/Sampling_rate
You see your audio wave (the gray line). The current value of that wave is repeatedly measured and given as a number. That's the numbers in those bytes. There are two different things that can be adjusted with this: The number of measurements you take per second (that's the sampling rate, given in Hz -- that's how many per second you grab). The other adjustment is how exact you measure. In the 2-byte case, you take two bytes for one measurement (that's values from -32768 to 32767 normally). So with those numbers given there, you can recreate the original wave (up to a limited quality, of course, but that's always so when storing stuff digitally). And recreating the original wave is what your speaker is trying to do on playback.
There are some more things you need to know. First, since it's two bytes, you need to know the byte order (big endian, little endian) to recreate the numbers correctly. Second, you need to know how many channels you have, and how they are stored. Typically you would have mono (one channel) or stereo (two), but more is possible. If you have more than one channel, you need to know, how they are stored. Often you would have them interleaved, that means you get one value for each channel for every point in time, and after that all values for the next point in time.
To illustrate: If you have data of 8 bytes for two channels and 16-bit number:
abcdefgh
Here a and b would make up the first 16bit number that's the first value for channel 1, c and d would be the first number for channel 2. e and f are the second value of channel 1, g and h the second value for channel 2. You wouldn't hear much there because that would not come close to a second of data...
If you take together all that information you have, you can calculate the bit rate you have, that's how many bits of information is generated by the recorder per second. In our example, you generate 2 bytes per channel on every sample. With two channels, that would be 4 bytes. You need about 44000 samples per second to represent the sounds a human beeing can normally hear. So you'll end up with 176000 bytes per second, which is 1408000 bits per second.
And of course, it is not 2-bit values, but two 2 byte values there, or you would have a really bad quality.
The first 44 bytes are commonly a standard RIFF header, as described here:
http://tiny.systems/software/soundProgrammer/WavFormatDocs.pdf
and here: http://www.topherlee.com/software/pcm-tut-wavformat.html
Apple/OSX/macOS/iOS created .wav files might add an 'FLLR' padding chunk to the header and thus increase the size of the initial header RIFF from 44 bytes to 4k bytes (perhaps for better disk or storage block alignment of the raw sample data).
The rest is very often 16-bit linear PCM in signed 2's-complement little-endian format, representing arbitrarily scaled samples at a rate of 44100 Hz.
The WAVE (.wav) file contain a header, which indicates the formatting information of the audio file's data. Following the header is the actual audio raw data. You can check their exact meaning below.
Positions Typical Value Description
1 - 4 "RIFF" Marks the file as a RIFF multimedia file.
Characters are each 1 byte long.
5 - 8 (integer) The overall file size in bytes (32-bit integer)
minus 8 bytes. Typically, you'd fill this in after
file creation is complete.
9 - 12 "WAVE" RIFF file format header. For our purposes, it
always equals "WAVE".
13-16 "fmt " Format sub-chunk marker. Includes trailing null.
17-20 16 Length of the rest of the format sub-chunk below.
21-22 1 Audio format code, a 2 byte (16 bit) integer.
1 = PCM (pulse code modulation).
23-24 2 Number of channels as a 2 byte (16 bit) integer.
1 = mono, 2 = stereo, etc.
25-28 44100 Sample rate as a 4 byte (32 bit) integer. Common
values are 44100 (CD), 48000 (DAT). Sample rate =
number of samples per second, or Hertz.
29-32 176400 (SampleRate * BitsPerSample * Channels) / 8
This is the Byte rate.
33-34 4 (BitsPerSample * Channels) / 8
1 = 8 bit mono, 2 = 8 bit stereo or 16 bit mono, 4
= 16 bit stereo.
35-36 16 Bits per sample.
37-40 "data" Data sub-chunk header. Marks the beginning of the
raw data section.
41-44 (integer) The number of bytes of the data section below this
point. Also equal to (#ofSamples * #ofChannels *
BitsPerSample) / 8
45+ The raw audio data.
I copied all of these from http://www.topherlee.com/software/pcm-tut-wavformat.html here
As others have pointed out, there's metadata in the wav file, but I think your question may be, specifically, what do the bytes (of data, not metadata) mean? If that's true, the bytes represent the value of the signal that was recorded.
What does that mean? Well, if you extract the two bytes (say) that represent each sample (assume a mono recording, meaning only one channel of sound was recorded), then you've got a 16-bit value. In WAV, 16-bit is (always?) signed and little-endian (AIFF, Mac OS's answer to WAV, is big-endian, by the way). So if you take the value of that 16-bit sample and divide it by 2^16 (or 2^15, I guess, if it's signed data), you'll end up with a sample that is normalized to be within the range -1 to 1. Do this for all samples and plot them versus time (and time is determined by how many samples/second is in the recording; e.g. 44.1KHz means 44.1 samples/millisecond, so the first sample value will be plotted at t=0, the 44th at t=1ms, etc) and you've got a signal that roughly represents what was originally recorded.
I suppose your question is "What do the bytes in data block of .wav file represent?" Let us know everything systematically.
Prelude:
Let us say we play a 5KHz sine wave using some device and record it in a file called 'sine.wav', and recording is done on a single channel (mono). Now you already know what the header in that file represents.
Let us go through some important definitions:
Sample: A sample of any signal means the amplitude of that signal at the point where sample is taken.
Sampling rate: Many such samples can be taken within a given interval of time. Suppose we take 10 samples of our sine wave within 1 second. Each sample is spaced by 0.1 second. So we have 10 samples per second, thus the sampling rate is 10Hz. Bytes 25th to 28th in the header denote sampling rate.
Now coming to the answer of your question:
It is not possible practically to write the whole sine wave to the file because there are infinite points on a sine wave. Instead, we fix a sampling rate and start sampling the wave at those intervals and record the amplitudes. (The sampling rate is chosen such that the signal can be reconstructed with minimal distortion, using the samples we are going to take. The distortion in the reconstructed signal because of the insufficient number of samples is called 'aliasing'.)
To avoid aliasing, the sampling rate is chosen to be more than twice the frequency of our sine wave (5kHz)(This is called 'sampling theorem' and the rate twice the frequency is called 'nyquist rate'). Thus we decide to go with sampling rate of 12kHz which means we will sample our sine wave, 12000 times in one second.
Once we start recording, if we record the signal, which is sine wave of 5kHz frequency, we will have 12000*5 samples(values). We take these 60000 values and put it in an array. Then we create the proper header to reflect our metadata and then we convert these samples, which we have noted in decimal, to their hexadecimal equivalents. These values are then written in the data bytes of our .wav files.
Plot plotted on : http://fooplot.com
Two bit audio wouldn't sound very good :) Most commonly, they represent sample values as 16-bit signed numbers that represent the audio waveform sampled at a frequency such as 44.1kHz.

setting timestamps for audio samples in directshow graph

I am developing a directshow audio decoder filter, to decode AC3 audio.
the filter is used in a live graph, decoding TS multicast.
the demuxer (mainconcept) provides me with the audio data demuxed, but does not provide timestamps for the sample.
how can I get/compute the correct timestamp of the audio?
I found this forum post:
http://www.ureader.com/msg/14712447.aspx
In it, a member gives the following formula for calculating the timestamps for audio, given it's format (sample rate, number of channels, bits per sample):
With PCM audio, duration_in_secs = 8 * buffer_size / wBitsPerSample /
nChannels / nSamplesPerSec or duration_in_secs = buffer_size /
nAvgBytesPerSec (since, for PCM audio, nAvgBytesPerSec =
wBitsPerSample * nChannels * nSamplesPerSec / 8).
The only thing you need to add is a tracking variable that tells you what sample number in the stream that you are at, so you can use it to offset the start time and end time by the duration (duration_in_secs) when doing linear streaming. For seek operations you would of course need to know or calculate the sample number into the stream.
Don't forget that the units for timestamps in DirectShow are typed as REFERENCE_TIME, a long integer or Int64. Each unit is equal to 100 nanoseconds. That is why you see in video filters the value 10,000,000 being divided by the relevant number of frames per second (FPS) to calculate timestamps for each frame because 10,000,000 equals 1 second in a REFERENCE_TIME variable.
Each AC-3 frame embeds data for 6 * 256 samples. Sampling rate can be 32 kHz, 44.1 kHz or 48 kHz (as defined by AC-3 specification Digital Audio Compression Standard (AC-3, E-AC-3)). The frames themselves do not carry timestamps, so you needs to assume continuous stream and increment time stamps respectively. As you mentioned the source is live, you might need to re-adjust time stamps on data starvation.
Each AC-3 frame is of fixed length (which you can identify from bitstream header), so you might also be checking if demultiplexer is giving you a single AC-3 frame or a few in a batch.

Resources