Deciding on length of FFT - audio

I am working on a tool to compare two wave files for similarity in their waveforms. Ex, I have a wave file of duration 1min and i make another wave file using the first one but have made each 5sec data at an interval of 5sec to 0.
Now my software will tell that there is waveform difference at time interval 5sec to 10sec, 15sec to 20sec, 25sec to 30 sec and so on...
As of now, with initial development, this is working fine.
Following are 3 test sets:
I have two wave files with sampling rate of 960Hz, mono, with no of data samples as 138551 (arnd 1min 12sec files). I am using 128 point FFT (splitting file in 128 samples chunk) and results are good.
When I use the same algorithm on wave files of sampling rate 48KHz, 2-channel with no of data samples 6927361 for each channel (arnd 2min 24 sec file), the process becomes too slow. When I use 4096 point FFT, the process is better.
But, the 4096 point FFT on files of 22050Hz, 2-channels with number of data samples 55776 for each channel (arnd 0.6sec file) gives very poor results. In this case 128 point FFT gives good result.
So, I am confused on how to decide the length of FFT so that my results are good in each case.
I guess the length should depend on number of samples and sampling rate.
Please give your inputs on this.
Thanks

The length of the FFT, N, will determine the resolution in the frequency domain:
resolution (Hz) = sample_rate (Hz) / N
So for example in case (1) you have resolution = 960 / 128 = 7.5 Hz. SO each bin in the resulting FFT (or presumably the power spectrum derived from this) will be 7.5 Hz wide, and you will be able to differentiate between frequency components which are at least this far apart.
Since you don't say what kind of waveforms these are, or what the purpose of your application is, it's hard to know what kind of resolution you need.
One important further point - many people using FFT for the first time are unaware that in general you need to apply a window function prior to the FFT to avoid spectral leakage.

I have to say I have found your question very cryptic. I think you should look into Short-time Fourier transform. The reason I say this is because you are looking at quite a large amount of samples if you use a sampling frequency of 44.1KhZ over 2mins with 2 channels. One fft across the entire amount will take quite a while indeed, not to mention the estimate will be biased as the signals mean and variance will change drastically over the whole duration. To avoid this you want to frame the time-domain signal first, these frames can be as small as 20ms-40ms (commonly used for speech) and often overlapping (Welch method of Spectral Estimation). Then you apply a window function such as Hamming or Hanning window to reduce spectral leakage and calculate an N-Point fft for each frame. Where N is the next power of two above the number of samples in that frame.
For example:
Fs = 8Khz, single channel;
time = 120sec;
no_samples = time * Fs = 960000 ;
frame length T_length= 20ms;
frame length in samples N_length = 160;
frame overlap T_overlap= 10ms;
frame overlap in samples N_overlap= 80;
Num of frames N_frames = (no_samples - (N_length-N_overlap))/N_overlap = 11999;
FFT length = 256;
So you will be processing 11999 frames in total, but your FFT length will be small. You will only need an FFT length of 256 (next power of two above frame length 160). Most algorithms that implement the fft require the signal length and fft length to be the same. All you have to do is append zeros to your framed signal up until 256. So pad each frame with x amount of zeros, where x = FFT_length-N_length. My latest android app does this on recorded speech and uses the short-time FFT data to display the Spectrogram of speech and also performs various spectral modification and filtering, its called Speech Enhancement for Android

Related

How samples are aligned in the audio file?

I'm trying to better understand how samples are aligned in the audio file.
Let's say we have a 2s audio file with sampling rate = 3.
I think there are three possible ways to align those samples. Looking at the picture below, can you tell me which one is correct?
Also, is this a standard for all audio files or does different formats have different rules?
Cheers!
Sampling rate in audio typically tells you how many samples are in one second, a unit called Hertz. Strictly speaking, the correct answer would be (1), as you have 3 samples within one second. Assuming there's no latency, PCM and other formats dictate that audio starts at 0. Next "cycle" (next second) also starts at zero, same principle like with a clock.
To get total length of the audio (following question in the comment), you should simply take number of samples / rate. Example from a 30s WAV using soxi, one of canonical tools used in the community for sound manipulation:
Input File : 'book_00396_chp_0024_reader_11416_5_door_Freesound_validated_380721_0-door_Freesound_validated_381380_0-9IfN8dUgGaQ_snr10_fileid_1138.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:30.00 = 480000 samples ~ 2250 CDDA sectors
File Size : 960k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
480000 samples / (16000 samples / seconds) = 30 seconds exactly. Citing manual, duration is "Equivalent to number of samples divided by the sample-rate."

Difference between sampling rate, bit rate and bit depth

This is kind of a basic question which might sound too obvious to many of you , but I am getting confused so bad.
Here is what a Quora user says. Now It is clear to me what a Sampling rate is - The number of samples you take of a sound signal (in one second) is it's sampling rate.
Now my doubt here is - This rate should have nothing to do with the quantisation, right?
About bit-depth, Is the quantisation dependant on bit-depth? As in 32-bit (2^32 levels) and 64-bit (2^64 levels). Or is it something else?
and the bit-rate, is number of bits transferred in one second? If I an audio file says 320 kbps what does that really mean?
I assume the readers have got some sense on how I am panicking on where does the bit rate, and bit depth have significance?
EDIT: Also find this question if you have worked with linux OS and gstreamer framework.
Now my doubt here is - This rate should have nothing to do with the
quantisation, right?
Wrong. Sampling is a process that results in quantisation. Sampling, as the name implies, means taking samples (amplitudes) of a (usually) continuous signal (e.g audio) at regular time intervals and converting them to a different represantation thereof. In digital signal processing, this represantation is discrete (not continuous). An example of this process is a wave file (e.g recording your own voice and saving it as a wav).
About bit-depth, Is the quantisation dependant on bit-depth? As in
32-bit (2^32 levels) and 64-bit (2^64 levels). Or is it something
else?
Yes. The CD format, for example, has a bit depth of 16 (16 bits per sample). Bit depth is a part of the format of a sound (wave) file (along with the number of channels and sampling rate).
Since sound (think of a pure sine tone) has both positive and negative parts, I'd argue that you can represent (2^16 / 2) amplitude levels using 16 bits.
and the bit-rate, is number of bits transferred in one second? If I an
audio file says 320 kbps what does that really mean?
Yes. Bit rates are usually meaningful in the context of network transfers. 320 kbps == 320 000 bits per second. (for kilobit you multiply by 1000, rather than 1024)
Let's take a worked example 'Red-book' CD audio
The Bit depth is 16-bit. This is the number of bits used to represent each sample. This is intimately coupled with quantisation.
The Smaple-rate is 44.1kHz
The Frame-rate is 44.1kHz (two audio channels make up a stereo pair)
The Bit-rate is therefore 16 * 44100 * 2 = 1411200 bits/sec
There are a few twists with compressed audio streams such such as MP3 or AAC. In these, there is a non-linear relationship between bit-rate, sample-rate and bit-depth. The bit-rate is generally the maximum rate per-second and the efficiency of the codec is content dependant.

Creating audio level meter - signal normalization

I have program which tracks audio signal in real time. Every processed sample I am able to read value of it in range between <-1, 1>.
I would like to create(and later display) audio level meter. From what I understand - to do it I need to keep converting my audio signal in real time, on each channel to dB and then display dB values on each channel in some graphical form of bars.
I am a bit lost how to do it and it should be simple matter. Would just normalization from <-1, 1> to <0, 1> (like... [n-sample +1]/2) and then calculating 20*log10 from each upcoming sample make it?
You can't plot the signal directly, as it always varying positive and negative.
Therefore you need to average out the strength of the signal every so many samples.
Say you're sampling at 44.1kHz, perhaps you might choose 4410 samples so you're updating your display 10 times per second.
So you calculate the RMS of your 4410 samples - see http://en.wikipedia.org/wiki/Root_mean_square
The RMS value is always positive.
You can then convert this to Db:
dBV = 20 x log10(Vrms)
This assumes that your maximum signal -1 to +1 corresponds to -1 to +1 volt. You will need to do further adjustments if not.

Transforming Audio Samples From Time Domain to Frequency Domain

as a software engineer I am facing with some difficulties while working on a signal processing problem. I don't have much experience in this area.
What I try to do is to sample the environmental sound with 44100 sampling rate and for fixed size windows to test if a specific frequency (20KHz) exists and is higher than a threshold value.
Here is what I do according to the perfect answer in How to extract frequency information from samples from PortAudio using FFTW in C
102400 samples (2320 ms) is gathered from audio port with 44100 sampling rate. Sample values are between 0.0 and 1.0
int samplingRate = 44100;
int numberOfSamples = 102400;
float samples[numberOfSamples] = ListenMic_Function(numberOfSamples,samplingRate);
Window size or FFT Size is 1024 samples (23.2 ms)
int N = 1024;
Number of windows is 100
int noOfWindows = numberOfSamples / N;
Splitting samples to noOfWindows (100) windows each having size of N (1024) samples
float windowSamplesIn[noOfWindows][N];
for i:= 0 to noOfWindows -1
windowSamplesIn[i] = subarray(samples,i*N,(i+1)*N);
endfor
Applying Hanning window function on each window
float windowSamplesOut[noOfWindows][N];
for i:= 0 to noOfWindows -1
windowSamplesOut[i] = HanningWindow_Function(windowSamplesIn[i]);
endfor
Applying FFT on each window (real to complex conversion done inside the FFT function)
float frequencyData[noOfWindows][samplingRate/2];
for i:= 0 to noOfWindows -1
frequencyData[i] = RealToComplex_FFT_Function(windowSamplesOut[i], samplingRate);
endfor
In the last step, I use the FFT function implemented in this link: http://www.codeproject.com/Articles/9388/How-to-implement-the-FFT-algorithm ; because I cannot implement an FFT function from the scratch.
What I can't be sure is while giving N (1024) samples to FFT function as input, samplingRate/2 (22050) decibel values is returned as output. Is it what an FFT function does?
I understand that because of Nyquist Frequency, I can detect half of sampling rate frequency at most. But is it possible to get decibel values for each frequency up to samplingRate/2 (22050) Hz?
Thanks,
Vahit
See see How do I obtain the frequencies of each value in an FFT?
From a 1024 sample input, you can get back 512 meaningful frequency-levels.
So, yes, within your window, you'll get back a level for the Nyquist frequency.
The lowest frequency level you'll see is for DC (0 Hz), and the next one up will be for SampleRate/1024, or around 44 Hz, the next for 2 * SampleRate/1024, and so on, up to 512 * SampleRate / 1024 Hz.
Since only one band is used in your FFT, I would expect your results to be tarnished by side-band effects, even with proper windowing. It might work, but you might also get false positives with some input frequencies. Also, your signal is close to your niquist, so you are assuming a fairly good signal path up to your FFT. I don't think this is the right approach.
I think a better approach to this kind of signal detection would be with a high order filter (depending on your requirements, I would guess fourth or fifth order, which isn't actually that high). If you don't know how to design a high order filter, you could use two or three second order filters in series. Designing a second order filter, sometimes called a "biquad" is described here:
http://www.musicdsp.org/files/Audio-EQ-Cookbook.txt
albeit very tersely and with some assumptions of prior knowledge. I would use a high-pass (HP) filter with corner frequency as low as you can make it, probably between 18 and 20 kHz. Keep in mind there is some attenuation at the corner frequency, so after applying a filter multiple times you will drop a little signal.
After you filter the audio, take the RMS or average amplitude (that is, the average of the absolute value), to find the average level over a time period.
This technique has several advantages over what you are doing now, including better latency (you can start detecting within a few samples), better reliability (you won't get false-positives in response to loud signals at spurious frequencies), and so on.
This post might be of relevance: http://blog.bjornroche.com/2012/08/why-eq-is-done-in-time-domain.html

discrete fourier transform frequency bound?

for a 8KHz wav sound i took 20ms sample which has 160 samples of data, plotted the FFT spectrum in audacity.
It gave the magnitudes in 3000 and 4000 Hz as well, shouln't it be giving the magnitudes until
the 80Hz,because there is 160 samples of data?
For a sample rate of Fs = 8 khz the FFT will give meaningful results from DC to Nyquist (= Fs / 2), i.e. 0 to 4 kHz. The width of each FFT bin will be 1 / 20 ms = 50 Hz.
actually audacity shows the peaks as 4503Hz which means understands to 1Hz bins. by the way if I take 20ms and repeat it 50 times to make as 1s sample,is the fft going to be for 1Hz bins? and also audacity has the option for the window as far as I know If you use windowing then the components should be multiple times of 2,like 1,2,4,8,etc.. but it shows the exact frequencies,then why it uses the windowing?
The best sampling rate is 2*frequency.
in different frequencys you should to change the sampling rate.

Resources