FFMPEG Understanding AVFrame::linesize (Audio) - audio

As per the doucmentation of AVFrame, for audio, lineSize is size in bytes of each plane and only linesize[0] may be set. But however, am unsure whether lineszie[0] is holding per plane buffer size or is it the complete buffer size and we have to divide it by no of channels to get per plane buffer size.
For Example, when I call
int data_size = av_samples_get_buffer_size(NULL, iDesiredNoOfChannels, iAudioSamples, (AVSampleFormat)iDesiredFormat, 0) ; For iDesiredNoOfChannels = 2, iAudioSamples = 1024 & iDesiredFormat = AV_SAMPLE_FMT_FLTP data_size=8192. Pretty straightforward, as each sample is 4 bytes and since there are 2 channels total memory will be (1024 * 4 * 2) bytes. As such lineSize[0] should be 4096 for planar audio. data[0] & data[1] should be each of size 4096. However, pFrame->lineSize[0] is giving 8192. So to get the size per plane, I have to do pFrame->lineSize[0] / pFrame->channels. Isn't this behaviour different from what the documentation suggests or is my understanding of the documentaion wrong.

Old question but thought I'd answer it anyway for people who may be wondering the same thing.
In all audio AVFrames, only linesize[0] may be set and they are all required to be the same size. You should not be using linesize[1], etc. I don't know why they chose to do things this way because it's not consistent with video frames, but whatever. Just remember whether interleaved or planar only linesize[0] matters so you have to divide by channel count for planar.

Related

Difference between sampling rate, bit rate and bit depth

This is kind of a basic question which might sound too obvious to many of you , but I am getting confused so bad.
Here is what a Quora user says. Now It is clear to me what a Sampling rate is - The number of samples you take of a sound signal (in one second) is it's sampling rate.
Now my doubt here is - This rate should have nothing to do with the quantisation, right?
About bit-depth, Is the quantisation dependant on bit-depth? As in 32-bit (2^32 levels) and 64-bit (2^64 levels). Or is it something else?
and the bit-rate, is number of bits transferred in one second? If I an audio file says 320 kbps what does that really mean?
I assume the readers have got some sense on how I am panicking on where does the bit rate, and bit depth have significance?
EDIT: Also find this question if you have worked with linux OS and gstreamer framework.
Now my doubt here is - This rate should have nothing to do with the
quantisation, right?
Wrong. Sampling is a process that results in quantisation. Sampling, as the name implies, means taking samples (amplitudes) of a (usually) continuous signal (e.g audio) at regular time intervals and converting them to a different represantation thereof. In digital signal processing, this represantation is discrete (not continuous). An example of this process is a wave file (e.g recording your own voice and saving it as a wav).
About bit-depth, Is the quantisation dependant on bit-depth? As in
32-bit (2^32 levels) and 64-bit (2^64 levels). Or is it something
else?
Yes. The CD format, for example, has a bit depth of 16 (16 bits per sample). Bit depth is a part of the format of a sound (wave) file (along with the number of channels and sampling rate).
Since sound (think of a pure sine tone) has both positive and negative parts, I'd argue that you can represent (2^16 / 2) amplitude levels using 16 bits.
and the bit-rate, is number of bits transferred in one second? If I an
audio file says 320 kbps what does that really mean?
Yes. Bit rates are usually meaningful in the context of network transfers. 320 kbps == 320 000 bits per second. (for kilobit you multiply by 1000, rather than 1024)
Let's take a worked example 'Red-book' CD audio
The Bit depth is 16-bit. This is the number of bits used to represent each sample. This is intimately coupled with quantisation.
The Smaple-rate is 44.1kHz
The Frame-rate is 44.1kHz (two audio channels make up a stereo pair)
The Bit-rate is therefore 16 * 44100 * 2 = 1411200 bits/sec
There are a few twists with compressed audio streams such such as MP3 or AAC. In these, there is a non-linear relationship between bit-rate, sample-rate and bit-depth. The bit-rate is generally the maximum rate per-second and the efficiency of the codec is content dependant.

What do the bytes in a .wav file represent?

When I store the data in a .wav file into a byte array, what do these values mean?
I've read that they are in two-byte representations, but what exactly is contained in these two-byte values?
You will have heard, that audio signals are represented by some kind of wave. If you have ever seen this wave diagrams with a line going up and down -- that's basically what's inside those files. Take a look at this file picture from http://en.wikipedia.org/wiki/Sampling_rate
You see your audio wave (the gray line). The current value of that wave is repeatedly measured and given as a number. That's the numbers in those bytes. There are two different things that can be adjusted with this: The number of measurements you take per second (that's the sampling rate, given in Hz -- that's how many per second you grab). The other adjustment is how exact you measure. In the 2-byte case, you take two bytes for one measurement (that's values from -32768 to 32767 normally). So with those numbers given there, you can recreate the original wave (up to a limited quality, of course, but that's always so when storing stuff digitally). And recreating the original wave is what your speaker is trying to do on playback.
There are some more things you need to know. First, since it's two bytes, you need to know the byte order (big endian, little endian) to recreate the numbers correctly. Second, you need to know how many channels you have, and how they are stored. Typically you would have mono (one channel) or stereo (two), but more is possible. If you have more than one channel, you need to know, how they are stored. Often you would have them interleaved, that means you get one value for each channel for every point in time, and after that all values for the next point in time.
To illustrate: If you have data of 8 bytes for two channels and 16-bit number:
abcdefgh
Here a and b would make up the first 16bit number that's the first value for channel 1, c and d would be the first number for channel 2. e and f are the second value of channel 1, g and h the second value for channel 2. You wouldn't hear much there because that would not come close to a second of data...
If you take together all that information you have, you can calculate the bit rate you have, that's how many bits of information is generated by the recorder per second. In our example, you generate 2 bytes per channel on every sample. With two channels, that would be 4 bytes. You need about 44000 samples per second to represent the sounds a human beeing can normally hear. So you'll end up with 176000 bytes per second, which is 1408000 bits per second.
And of course, it is not 2-bit values, but two 2 byte values there, or you would have a really bad quality.
The first 44 bytes are commonly a standard RIFF header, as described here:
http://tiny.systems/software/soundProgrammer/WavFormatDocs.pdf
and here: http://www.topherlee.com/software/pcm-tut-wavformat.html
Apple/OSX/macOS/iOS created .wav files might add an 'FLLR' padding chunk to the header and thus increase the size of the initial header RIFF from 44 bytes to 4k bytes (perhaps for better disk or storage block alignment of the raw sample data).
The rest is very often 16-bit linear PCM in signed 2's-complement little-endian format, representing arbitrarily scaled samples at a rate of 44100 Hz.
The WAVE (.wav) file contain a header, which indicates the formatting information of the audio file's data. Following the header is the actual audio raw data. You can check their exact meaning below.
Positions Typical Value Description
1 - 4 "RIFF" Marks the file as a RIFF multimedia file.
Characters are each 1 byte long.
5 - 8 (integer) The overall file size in bytes (32-bit integer)
minus 8 bytes. Typically, you'd fill this in after
file creation is complete.
9 - 12 "WAVE" RIFF file format header. For our purposes, it
always equals "WAVE".
13-16 "fmt " Format sub-chunk marker. Includes trailing null.
17-20 16 Length of the rest of the format sub-chunk below.
21-22 1 Audio format code, a 2 byte (16 bit) integer.
1 = PCM (pulse code modulation).
23-24 2 Number of channels as a 2 byte (16 bit) integer.
1 = mono, 2 = stereo, etc.
25-28 44100 Sample rate as a 4 byte (32 bit) integer. Common
values are 44100 (CD), 48000 (DAT). Sample rate =
number of samples per second, or Hertz.
29-32 176400 (SampleRate * BitsPerSample * Channels) / 8
This is the Byte rate.
33-34 4 (BitsPerSample * Channels) / 8
1 = 8 bit mono, 2 = 8 bit stereo or 16 bit mono, 4
= 16 bit stereo.
35-36 16 Bits per sample.
37-40 "data" Data sub-chunk header. Marks the beginning of the
raw data section.
41-44 (integer) The number of bytes of the data section below this
point. Also equal to (#ofSamples * #ofChannels *
BitsPerSample) / 8
45+ The raw audio data.
I copied all of these from http://www.topherlee.com/software/pcm-tut-wavformat.html here
As others have pointed out, there's metadata in the wav file, but I think your question may be, specifically, what do the bytes (of data, not metadata) mean? If that's true, the bytes represent the value of the signal that was recorded.
What does that mean? Well, if you extract the two bytes (say) that represent each sample (assume a mono recording, meaning only one channel of sound was recorded), then you've got a 16-bit value. In WAV, 16-bit is (always?) signed and little-endian (AIFF, Mac OS's answer to WAV, is big-endian, by the way). So if you take the value of that 16-bit sample and divide it by 2^16 (or 2^15, I guess, if it's signed data), you'll end up with a sample that is normalized to be within the range -1 to 1. Do this for all samples and plot them versus time (and time is determined by how many samples/second is in the recording; e.g. 44.1KHz means 44.1 samples/millisecond, so the first sample value will be plotted at t=0, the 44th at t=1ms, etc) and you've got a signal that roughly represents what was originally recorded.
I suppose your question is "What do the bytes in data block of .wav file represent?" Let us know everything systematically.
Prelude:
Let us say we play a 5KHz sine wave using some device and record it in a file called 'sine.wav', and recording is done on a single channel (mono). Now you already know what the header in that file represents.
Let us go through some important definitions:
Sample: A sample of any signal means the amplitude of that signal at the point where sample is taken.
Sampling rate: Many such samples can be taken within a given interval of time. Suppose we take 10 samples of our sine wave within 1 second. Each sample is spaced by 0.1 second. So we have 10 samples per second, thus the sampling rate is 10Hz. Bytes 25th to 28th in the header denote sampling rate.
Now coming to the answer of your question:
It is not possible practically to write the whole sine wave to the file because there are infinite points on a sine wave. Instead, we fix a sampling rate and start sampling the wave at those intervals and record the amplitudes. (The sampling rate is chosen such that the signal can be reconstructed with minimal distortion, using the samples we are going to take. The distortion in the reconstructed signal because of the insufficient number of samples is called 'aliasing'.)
To avoid aliasing, the sampling rate is chosen to be more than twice the frequency of our sine wave (5kHz)(This is called 'sampling theorem' and the rate twice the frequency is called 'nyquist rate'). Thus we decide to go with sampling rate of 12kHz which means we will sample our sine wave, 12000 times in one second.
Once we start recording, if we record the signal, which is sine wave of 5kHz frequency, we will have 12000*5 samples(values). We take these 60000 values and put it in an array. Then we create the proper header to reflect our metadata and then we convert these samples, which we have noted in decimal, to their hexadecimal equivalents. These values are then written in the data bytes of our .wav files.
Plot plotted on : http://fooplot.com
Two bit audio wouldn't sound very good :) Most commonly, they represent sample values as 16-bit signed numbers that represent the audio waveform sampled at a frequency such as 44.1kHz.

Transforming Audio Samples From Time Domain to Frequency Domain

as a software engineer I am facing with some difficulties while working on a signal processing problem. I don't have much experience in this area.
What I try to do is to sample the environmental sound with 44100 sampling rate and for fixed size windows to test if a specific frequency (20KHz) exists and is higher than a threshold value.
Here is what I do according to the perfect answer in How to extract frequency information from samples from PortAudio using FFTW in C
102400 samples (2320 ms) is gathered from audio port with 44100 sampling rate. Sample values are between 0.0 and 1.0
int samplingRate = 44100;
int numberOfSamples = 102400;
float samples[numberOfSamples] = ListenMic_Function(numberOfSamples,samplingRate);
Window size or FFT Size is 1024 samples (23.2 ms)
int N = 1024;
Number of windows is 100
int noOfWindows = numberOfSamples / N;
Splitting samples to noOfWindows (100) windows each having size of N (1024) samples
float windowSamplesIn[noOfWindows][N];
for i:= 0 to noOfWindows -1
windowSamplesIn[i] = subarray(samples,i*N,(i+1)*N);
endfor
Applying Hanning window function on each window
float windowSamplesOut[noOfWindows][N];
for i:= 0 to noOfWindows -1
windowSamplesOut[i] = HanningWindow_Function(windowSamplesIn[i]);
endfor
Applying FFT on each window (real to complex conversion done inside the FFT function)
float frequencyData[noOfWindows][samplingRate/2];
for i:= 0 to noOfWindows -1
frequencyData[i] = RealToComplex_FFT_Function(windowSamplesOut[i], samplingRate);
endfor
In the last step, I use the FFT function implemented in this link: http://www.codeproject.com/Articles/9388/How-to-implement-the-FFT-algorithm ; because I cannot implement an FFT function from the scratch.
What I can't be sure is while giving N (1024) samples to FFT function as input, samplingRate/2 (22050) decibel values is returned as output. Is it what an FFT function does?
I understand that because of Nyquist Frequency, I can detect half of sampling rate frequency at most. But is it possible to get decibel values for each frequency up to samplingRate/2 (22050) Hz?
Thanks,
Vahit
See see How do I obtain the frequencies of each value in an FFT?
From a 1024 sample input, you can get back 512 meaningful frequency-levels.
So, yes, within your window, you'll get back a level for the Nyquist frequency.
The lowest frequency level you'll see is for DC (0 Hz), and the next one up will be for SampleRate/1024, or around 44 Hz, the next for 2 * SampleRate/1024, and so on, up to 512 * SampleRate / 1024 Hz.
Since only one band is used in your FFT, I would expect your results to be tarnished by side-band effects, even with proper windowing. It might work, but you might also get false positives with some input frequencies. Also, your signal is close to your niquist, so you are assuming a fairly good signal path up to your FFT. I don't think this is the right approach.
I think a better approach to this kind of signal detection would be with a high order filter (depending on your requirements, I would guess fourth or fifth order, which isn't actually that high). If you don't know how to design a high order filter, you could use two or three second order filters in series. Designing a second order filter, sometimes called a "biquad" is described here:
http://www.musicdsp.org/files/Audio-EQ-Cookbook.txt
albeit very tersely and with some assumptions of prior knowledge. I would use a high-pass (HP) filter with corner frequency as low as you can make it, probably between 18 and 20 kHz. Keep in mind there is some attenuation at the corner frequency, so after applying a filter multiple times you will drop a little signal.
After you filter the audio, take the RMS or average amplitude (that is, the average of the absolute value), to find the average level over a time period.
This technique has several advantages over what you are doing now, including better latency (you can start detecting within a few samples), better reliability (you won't get false-positives in response to loud signals at spurious frequencies), and so on.
This post might be of relevance: http://blog.bjornroche.com/2012/08/why-eq-is-done-in-time-domain.html

Audio samples per second?

I am wondering on the relationship between a block of samples and its time equivalent. Given my rough idea so far:
Number of samples played per second = total filesize / duration.
So say, I have a 1.02MB file and a duration of 12 sec (avg), I will have about 89,300 samples played per second. Is this right?
Is there other ways on how to compute this? For example, how can I know how much a byte[1024] array is equivalent to in time?
Generally speaking for PCM samples you can divide the total length (in bytes) by the duration (in seconds) to get the number of bytes per second (for WAV files there will be some inaccuracy to account for the header). How these translate into samples depends on
the sample rate
bits used per sample, i.e. commonly
used is 16 bits = 2 bytes
number of channels, i.e. for stereo
this is 2
If you know 2) and 3) you can determine 1)
In your example 89300 bytes/second, assuming stereo and 16 bits per sample would be 89300 / 4 ~= 22Khz sample rate
In addition to #BrokenGlass's very good answer, I'll just add that for uncompressed audio with a fixed sample rate, number of channels and bits per sample, the arithmetic is fairly straightforward. E.g. for "CD quality" audio we have a 44.1 kHz sample rate, 16 bits per sample, 2 channels (stereo), therefore the data rate is:
44100 * 16 * 2
= 1,411,200 bits / sec
= 176,400 bytes / sec
= 10 MB / minute (approx)

How to estimate size of a Jpeg file before saving it with a certain quality factor?

I'v got a bitmap 24bits, I am writing application in c++, MFC,
I am using libjpeg for encoding the bitmap into jpeg file 24bits.
When this bitmap's width is M, and height is N.
How to estimate jpeg file size before saving it with certain quality factor N (0-100).
Is it possible to do this?
For example.
I want to implement a slide bar, which represent save a current bitmap with certain quality factor N.
A label is beside it. shows the approximate file size when decode the bitmap with this quality factor.
When user move the slide bar. He can have a approximate preview of the filesize of the tobe saved jpeg file.
In libjpeg, you can write a custom destination manager that doesn't actually call fwrite, but just counts the number of bytes written.
Start with the stdio destination manager in jdatadst.c, and have a look at the documentation in libjpeg.doc.
Your init_destination and term_destination methods will be very minimal (just alloc/dealloc), and your empty_output_buffer method will do the actual counting. Once you have completed the JPEG writing, you'll have to read the count value out of your custom structure. Make sure you do this before term_destination is called.
It also depends on the compression you are using and to be more specific how many bits per color pixel are you using.
The quality factor wont help you here as a quality factor of 100 can range (in most cases) from 6 bits per color pixel to ~10 bits per color pixel, maybe even more (Not sure).
so once you know that its really straight forward from there..
If you know the Sub Sampling Factor this can be estimated. That information comes from the start of frame marker.
In the same marker right before the width and height so is the bit depth.
If you let
int subSampleFactorH = 2, subSampleFactorV = 1;
Then
int totalImageBytes = (Image.Width / subSampleFactorH) * (Image.Height / subSampleFactorV);
Then you can also optionally add more bytes to account for container data also.
int totalBytes = totalImageBytes + someConstantOverhead;

Resources