discrete fourier transform frequency bound? - audio

for a 8KHz wav sound i took 20ms sample which has 160 samples of data, plotted the FFT spectrum in audacity.
It gave the magnitudes in 3000 and 4000 Hz as well, shouln't it be giving the magnitudes until
the 80Hz,because there is 160 samples of data?

For a sample rate of Fs = 8 khz the FFT will give meaningful results from DC to Nyquist (= Fs / 2), i.e. 0 to 4 kHz. The width of each FFT bin will be 1 / 20 ms = 50 Hz.

actually audacity shows the peaks as 4503Hz which means understands to 1Hz bins. by the way if I take 20ms and repeat it 50 times to make as 1s sample,is the fft going to be for 1Hz bins? and also audacity has the option for the window as far as I know If you use windowing then the components should be multiple times of 2,like 1,2,4,8,etc.. but it shows the exact frequencies,then why it uses the windowing?

The best sampling rate is 2*frequency.
in different frequencys you should to change the sampling rate.

Related

How samples are aligned in the audio file?

I'm trying to better understand how samples are aligned in the audio file.
Let's say we have a 2s audio file with sampling rate = 3.
I think there are three possible ways to align those samples. Looking at the picture below, can you tell me which one is correct?
Also, is this a standard for all audio files or does different formats have different rules?
Cheers!
Sampling rate in audio typically tells you how many samples are in one second, a unit called Hertz. Strictly speaking, the correct answer would be (1), as you have 3 samples within one second. Assuming there's no latency, PCM and other formats dictate that audio starts at 0. Next "cycle" (next second) also starts at zero, same principle like with a clock.
To get total length of the audio (following question in the comment), you should simply take number of samples / rate. Example from a 30s WAV using soxi, one of canonical tools used in the community for sound manipulation:
Input File : 'book_00396_chp_0024_reader_11416_5_door_Freesound_validated_380721_0-door_Freesound_validated_381380_0-9IfN8dUgGaQ_snr10_fileid_1138.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:30.00 = 480000 samples ~ 2250 CDDA sectors
File Size : 960k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
480000 samples / (16000 samples / seconds) = 30 seconds exactly. Citing manual, duration is "Equivalent to number of samples divided by the sample-rate."

What is the bit rate?

I am new to audio programming,
But I am wondering formula of bitRate,
According to wiki https://en.wikipedia.org/wiki/Bit_rate#Audio,
bit rate = sample rate X bit depth X channels
and
sample rate is the number of samples (or snapshots taken) per second obtained by a digital audio device.
bit depth is the number of bits of information in each sample.
So why bit rate = sample rate X bit depth X channels?
From my perspective, if bitDepth = 2 bit, sample rate = 3 HZ
then I can transfer 6 bit data in 1 second
For example:
Sample data = 00 //at 1/3 second.
Sample data = 01 //at 2/3 second.
Sample data = 10 //at 3/3 second.
So I transfer 000110 in 1 second, is that correct logic?
Bit-rate is the expected amount of bits per interval (eg: per second).
Sound cycles are measured in hertz, where 1 hertz == 1 second. So to get full sound data that represents that 1 second of audio, you calculate how many bits are needed to be sent (or for media players, they check the bit-rate in a file-format's settings so they can read & playback correctly).
Why is channels involved (isn't sample rate X bit-depth enough)?
In digital audio the samples are sent for each "ear" (L/R channel). There will always be double the amount of samples in a stereo sound versus if it was mono sound. Usually there is a "flag" to specify if sound is stereo or mono.
Logic Example: (without bit-depth, and assuming 1-bit per sample)...
There is speech "Hello" recorded at 200 samples/sec at bitrate of 100/sec. What happens?
If stereo flag, each ear gets 100 samples per sec (correct total of 200 played)
If mono, audio speech will sound slow by half (since only 100 samples played at expected bit-rate of 100, but remember, a full second was recorded at 200 sample/sec. You get half of "hello" in one second and the other at next second to (== slowed speech).
Taking the above example, you will find these audio gives slow/double speed adventures in your "new to audio programming" experience. The fix will be either setting channels amount or setting bit-rate correctly. Good luck.
The 'sample rate' is the rate at which each channel is sampled.
So 'sample rate X bit depth' will give you the bit rate for a single channel.
You then need to multiply that by the number of channels to get the total bit rate flowing through the system.
For example the CD standard has a sample rate of 44100 samples per second and a bit depth of 16 giving a bit rate of 705600 per channel and a total bit rate of 1411200 bits per seconds for stereo.

Relation between bandwidth and play time in a CD

I have recently read that uncompressed CD-quality audio has a bandwidth of 1.411 Mbps in case of stereo, does it mean a CD can be played to output audio at the rate of 1.411 Mbps, i mean does it play 1.411 Mbits of stereo audio every second..?
Two channels, each with 44,100 16-bit samples per second. That is 2 x 44100 x 16 = 1,411,200bps. That is 1.411Mbps. (176400 bytes per second)
Each second requires 1.411Mb. If you reduced the sample rate by half, you would double the number of seconds that can be recorded on a CD. Same if you dropped it to one channel, or 8-bit.
To imagine the impact of reducing the sample rate, lets suppose a technology that sampled every 1 second. This would be like pressing mute over and over, you would only catch parts.
Reducing the channel to one is easy to imagine, that's monaural.
Reducing to 8-bit is harder to describe. Imagine we reduced it to 1-bit. That would essentially mean the speaker has two states, fully centered and fully driven. That is not much variation. 16 bits gives 65536 positions.

How to quantize ppq time to position in audio buffer

i'm currently working on the sound sequencer. I decided to use PPQN as time grid. And now i have this problem:
Let's say that tempo is 120 BPM, PPQ is 96 (default values) and sample rate is 44100 Hz. That makes 1 pulse equal to 5.2 ms which is 229.32 Samples (5.2ms*44.1samples/ms). So the position in buffer will be [229.32]. And now i can truncate this 0.32 samples so that every third pulse will be one sample larger than other or interpolate it somehow.
Any ideas or maybe somebody knows the proper way?

Deciding on length of FFT

I am working on a tool to compare two wave files for similarity in their waveforms. Ex, I have a wave file of duration 1min and i make another wave file using the first one but have made each 5sec data at an interval of 5sec to 0.
Now my software will tell that there is waveform difference at time interval 5sec to 10sec, 15sec to 20sec, 25sec to 30 sec and so on...
As of now, with initial development, this is working fine.
Following are 3 test sets:
I have two wave files with sampling rate of 960Hz, mono, with no of data samples as 138551 (arnd 1min 12sec files). I am using 128 point FFT (splitting file in 128 samples chunk) and results are good.
When I use the same algorithm on wave files of sampling rate 48KHz, 2-channel with no of data samples 6927361 for each channel (arnd 2min 24 sec file), the process becomes too slow. When I use 4096 point FFT, the process is better.
But, the 4096 point FFT on files of 22050Hz, 2-channels with number of data samples 55776 for each channel (arnd 0.6sec file) gives very poor results. In this case 128 point FFT gives good result.
So, I am confused on how to decide the length of FFT so that my results are good in each case.
I guess the length should depend on number of samples and sampling rate.
Please give your inputs on this.
Thanks
The length of the FFT, N, will determine the resolution in the frequency domain:
resolution (Hz) = sample_rate (Hz) / N
So for example in case (1) you have resolution = 960 / 128 = 7.5 Hz. SO each bin in the resulting FFT (or presumably the power spectrum derived from this) will be 7.5 Hz wide, and you will be able to differentiate between frequency components which are at least this far apart.
Since you don't say what kind of waveforms these are, or what the purpose of your application is, it's hard to know what kind of resolution you need.
One important further point - many people using FFT for the first time are unaware that in general you need to apply a window function prior to the FFT to avoid spectral leakage.
I have to say I have found your question very cryptic. I think you should look into Short-time Fourier transform. The reason I say this is because you are looking at quite a large amount of samples if you use a sampling frequency of 44.1KhZ over 2mins with 2 channels. One fft across the entire amount will take quite a while indeed, not to mention the estimate will be biased as the signals mean and variance will change drastically over the whole duration. To avoid this you want to frame the time-domain signal first, these frames can be as small as 20ms-40ms (commonly used for speech) and often overlapping (Welch method of Spectral Estimation). Then you apply a window function such as Hamming or Hanning window to reduce spectral leakage and calculate an N-Point fft for each frame. Where N is the next power of two above the number of samples in that frame.
For example:
Fs = 8Khz, single channel;
time = 120sec;
no_samples = time * Fs = 960000 ;
frame length T_length= 20ms;
frame length in samples N_length = 160;
frame overlap T_overlap= 10ms;
frame overlap in samples N_overlap= 80;
Num of frames N_frames = (no_samples - (N_length-N_overlap))/N_overlap = 11999;
FFT length = 256;
So you will be processing 11999 frames in total, but your FFT length will be small. You will only need an FFT length of 256 (next power of two above frame length 160). Most algorithms that implement the fft require the signal length and fft length to be the same. All you have to do is append zeros to your framed signal up until 256. So pad each frame with x amount of zeros, where x = FFT_length-N_length. My latest android app does this on recorded speech and uses the short-time FFT data to display the Spectrogram of speech and also performs various spectral modification and filtering, its called Speech Enhancement for Android

Resources