I'm trying to get a qualitative handle on the amount of static or noise present in a audio stream. The normal content of the stream is voice or music.
I've been experiementing with taking the stddev of the samples, and that does give me some handle on the presence of voice vs. empty channel noise (ie. a high stddev usually indicates voice or music)
Was wondering if anyone else had some pointers on this.
Doesn't the peak value give you the answer? If you're looking at a signal from a good ADC, the ambient level should be in the 1's or 10's of counts, while voice or music will get up into the thousands of counts. Is there some kind of automatic gain control that makes this strategy not work?
If you need something more complex, the peak to RMS ratio might be a bit more reliable than simply RMS level (RMS = stddev). Pure noise will have a ratio of around 3-5, while sinusoids, for instance, have a peak to RMS ratio of 1.4. However, you can get more discrimination by looking at the spectrum of the signal. Static is usually spectrally smooth or even flat, while voice and music are spectrally structured. So a Fourier transform might be what you're looking for. Assuming a signal x that contains, say 0.5 seconds worth of data, here's some Matlab code:
Sx = fft(x .* hann(length(x), 'periodic'))
The HANN function applies a Hann window to reduce spectral leakage, while the FFT function quickly calculates the Fourier transform. Now you have a couple of choices. If you want to determine whether the signal x consists of static or voice/music, take the peak to RMS ratio of the spectrum:
pk2rms = max(abs(Sx))/sqrt(sum(abs(Sx).^2)/length(Sx))
I'd expect pure static to have a peak to RMS ratio around 3-5 (again), while voice/music would be at least an order of magnitude higher. This takes advantage of the fact that pure white noise has the same "structure" in time and frequency domains.
If you want to get a numerical estimate of the noise level, you can calculate the power in Sx over time, using an average:
Gxx = ((k-1)*Gxx + Sx.*conj(Sx))/k
Over time, the peaks in Gxx should come and go, but you should see a constant minimum value corresponding to the noise floor. In general, audio spectra are easier to look at on a dB (log vertical) scale.
Some notes:
1. I picked 0.5 seconds for the length of x, but I'm not sure what an optimal value here is. If you pick a value that's too short, x will not have much structure. In that case, the DC component of the signal will have a lot of energy. I expect you can still use the peak to RMS discriminator, though, if you first toss out the bin in Sx corresponding to DC.
2. I'm not sure what a good value for k is, but that equation corresponds to exponential averaging. You can probably experiment with k to figure out an optimal value. This might work best with a short x.
There are different kinds of noise. White, pink, brown. Noise can come from many places. Is a 60hertz hum noise or signal?
For white noise, I'd look at the fft and find the lowest value to see what your noise floor is.
Related
I want to do precise guitar tuner, this is usually done by many via computing FFT and getting peak. But this is of low appliance for several reasons:
Discrete precision, gives insuffient resolution for tuning bass guitar.
High computation time and complexity, when trying to increase buffer size(and/or sampling rate). Introduces visible delay(lag).
Most of frequency range where concentrates all FFT's precision is unused. Everything above 1-2 khz is not appliable for tuning musical instruments.
There should be simplier way for signals that have single-frequency sinusoidal shape. Given small enough buffer (say it 256 samples with 96khz sampling rate) - how could you measure a base(lowese) frequency?
In simple words: How to find frequency F, so that difference of "sine signal of frequency F" and "actually recorded signal" would give minimal error, than for any frequency, other than F ? (so we can definetely conclude that sinusoid of frequency F is best approximation of recorded sound buffer).
PS. Anything, but not using FFT!
Here is a simple approach based on zero crossing. It relies on being able to map the instrument signal to a simple sinuoid. This may work OK when signal to noise ratio is high, but is not a very robust method.
Bandpass filter around the fundamental frequency of the tone you want to tune for. Example 82.41 Hz for low E string on guitar.
Consider a window of the last N samples. Set it to ex 100ms to update the pitch estimate 10 times per second.
Perform zero-crossing detection with a threshold value T. T could be set to 10% of signal peak for example. Count the periods between each zero crossing, collect them in an array.
Take the median of the periods to get your pitch estimate
You can also compute the quantiles of the periods to estimate how reliable the method is. If they give very different numbers from the median, then the method is not working well.
The approach can be extended by computing autocorrelation on the zero-crossings, as described in
https://www.cycfi.com/2018/03/fast-and-efficient-pitch-detection-bitstream-autocorrelation/
I have been experimenting with simple FFT using p5 sound and then plotting the bands of the spectrum visually.
One thing i noticed is that the lower frequencies appears very high in almost all tracks while the high frequencies seems to be mute.
So for instance when doing FFT only with 16 bands most of the sound happens only on the first 4 bands and it seems that the other frequencies ( the higher ones ) are reported to be "muted" or just too quiet.
You can see this on this example for instance: http://p5js.org/reference/#/p5.FFT where even with relatively high frequencies the right side of the spectrum stays totally down, the lower frequencies are reported to be the highest even tough what you here is more of a middle / higher pitch kind of sound.
It seems that some sort of transformation have to be applied to the FFT result in order to have a visual representation that matches better that we hearing?
Am i missing something? I mean, i'm surely missing some basic information about how FFT works and how the frequencies are reported, but i mean, is that a common problem that has a common solution?
The human auditory system is fundamentally logarithmic base-2 in nature - each subsequent octave has twice the bandwidth of the next. As a consequence of this, the vast majority of the frequency content of human perceivable sound is below 1kHz, and signal power is spread more thinly between FFT bins at higher frequencies - which is precisely what your graph shows.
Spectrograms - which is what I suspect you're expecting to see here - are plotted with log(F) on the x-axis and signal power in dB on the Y axis. Your code draw a graph with both axes linear.
In addition, because you are not specifically applying a window function to the samples used to calculate the FFT , what you get by default is the rectangular window - very far from a good choice in this application.
Im fairly new to onset detection. I read some papers about it and know that when working only with the time-domain, it is possible that there will be a large number of false-positives/negatives, and that it is generally advisable to work with either both the time-domain and frequency-domain or the frequency domain.
Regarding this, I am a bit confused because, I am having trouble on how the spectral energy or the results from the FFT bin can be used to determine note onsets. Because, aren't note onsets represented by sharp peaks in amplitude?
Can someone enlighten me on this? Thank you!
This is the easiest way to think about note onset:
think of a music signal as a flat constant signal. When and onset occurs you look at it as a large rapid CHANGE in signal (a positive or negative peak)
What this means in the frequency domain:
the FT of a constant signal is, well, CONSTANT! and flat
When the onset event occurs there is a rapid increase in spectrial content.
While you may think "Well you're actually talking about the peak of the onset right?" not at all. We are not actually interested in the peak of the onset, but rather the rising edge of the signal. When there is a sharp increase in the signal, the high frequency content increases.
one way to do this is using the spectrial difference function:
take your time domain signal and cut it up into overlaping strips (typically 50% overlap)
apply a hamming/hann window (this is to reduce spectrial smudging) (remember cutting up the signal into windows is like multiplying it by a pulse, in the frequency domain its like convolving the signal with a sinc function)
Apply the FFT algorithm on two sucessive windows
For each DFT bin, calculate the difference between the Xn and Xn-1 bins if it is negative set it to zero
square the results and sum all th bins together
repeat till end of signal.
look for peaks in signal using median thresholding and there are your onset times!
Source:
https://adamhess.github.io/Onset_Detection_Nov302011.pdf
and
http://www.elec.qmul.ac.uk/people/juan/Documents/Bello-TSAP-2005.pdf
You can look at sharp differences in amplitude at a specific frequency as suspected sound onsets. For instance if a flute switches from playing a G5 to playing a C, there will be a sharp drop in amplitude of the spectrum at around 784 Hz.
If you don't know what frequency to examine, the magnitude of an FFT vector will give you the amplitude of every frequency over some window in time (with a resolution dependent on the length of the time window). Pick your frequency, or a bunch of frequencies, and diff two FFTs of two different time windows. That might give you something that can be used as part of a likelihood estimate for a sound onset or change somewhere between the two time windows. Sliding the windows or successive approximation of their location in time might help narrow down the time of a suspected note onset or other significant change in the sound.
"Because, aren't note onsets represented by sharp peaks in amplitude?"
A: Not always. On percussive instruments (including piano) this is true, but for violin, flute, etc. notes often "slide" into each other as frequency changes without sharp amplitude increases.
If you stick to a single instrument like the piano onset detection is do-able. Generalized onset detection is a much more difficult problem. There are about a dozen primitive features that have been used for onset detection. Once you code them, you still have to decide how best to use them.
I have a program that plots the spectrum analysis (Amp/Freq) of a signal, which is preety much the DFT converted to polar. However, this is not exactly the sort of graph that, say, winamp (right at the top-left corner), or effectively any other audio software plots. I am not really sure what is this sort of graph called (if it has a distinct name at all), so I am not sure what to look for.
I am preety positive about the frequency axis being base two exponential, the amplitude axis puzzles me though.
Any pointers?
Actually an interesting question. I know what you are saying; the frequency axis is certainly logarithmic. But what about the amplitude? In response to another poster, the amplitude can't simply be in units of dB alone, because dB has no concept of zero. This introduces the idea of quantization error, SNR, and dynamic range.
Assume that the received digitized (i.e., discrete time and discrete amplitude) time-domain signal, x[n], is equal to s[n] + e[n], where s[n] is the transmitted discrete-time signal (i.e., continuous amplitude) and e[n] is the quantization error. Suppose x[n] is represented with b bits, and for simplicity, takes values in [0,1). Then the maximum peak-to-peak amplitude of e[n] is one quantization level, i.e., 2^{-b}.
The dynamic range is the defined to be, in decibels, 20 log10 (max peak-to-peak |s[n]|)/(max peak-to-peak |e[n]|) = 20 log10 1/(2^{-b}) = 20b log10 2 = 6.02b dB. For 16-bit audio, the dynamic range is 96 dB. For 8-bit audio, the dynamic range is 48 dB.
So how might Winamp plot amplitude? My guesses:
The minimum amplitude is assumed to be -6.02b dB, and the maximum amplitude is 0 dB. Visually, Winamp draws the window with these thresholds in mind.
Another nonlinear map, such as log(1+X), is used. This function is always nonnegative, and when X is large, it approximates log(X).
Any other experts out there who know? Let me know what you think. I'm interested, too, exactly how this is implemented.
To generate a power spectrum you need to do the following steps:
apply window function to time domain data (e.g. Hanning window)
compute FFT
calculate log of FFT bin magnitudes for N/2 points of FFT (typically 10 * log10(re * re + im * im))
This gives log magnitude (i.e. dB) versus linear frequency.
If you also want a log frequency scale then you will need to accumulate the magnitude from appropriate ranges of bins (and you will need a fairly large FFT to start with).
Well I'm not 100% sure what you mean but surely its just bucketing the data from an FFT?
If you want to get the data such that you have (for a 44Khz file) frequency points at 22Khz, 11Khz 5.5Khz etc then you could use a wavelet decomposition, i guess ...
This thread may help ya a bit ...
Converting an FFT to a spectogram
Same sort of information as a spectrogram I'd guess ...
What you need is power spectrum graph. You have to compute DFT of your signal's current window. Then square each value.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a sample held in a buffer from DirectX. It's a sample of a note played and captured from an instrument. How do I analyse the frequency of the sample (like a guitar tuner does)? I believe FFTs are involved, but I have no pointers to HOWTOs.
The FFT can help you figure out where the frequency is, but it can't tell you exactly what the frequency is. Each point in the FFT is a "bin" of frequencies, so if there's a peak in your FFT, all you know is that the frequency you want is somewhere within that bin, or range of frequencies.
If you want it really accurate, you need a long FFT with a high resolution and lots of bins (= lots of memory and lots of computation). You can also guess the true peak from a low-resolution FFT using quadratic interpolation on the log-scaled spectrum, which works surprisingly well.
If computational cost is most important, you can try to get the signal into a form in which you can count zero crossings, and then the more you count, the more accurate your measurement.
None of these will work if the fundamental is missing, though. :)
I've outlined a few different algorithms here, and the interpolated FFT is usually the most accurate (though this only works when the fundamental is the strongest harmonic - otherwise you need to be smarter about finding it), with zero-crossings a close second (though this only works for waveforms with one crossing per cycle). Neither of these conditions is typical.
Keep in mind that the partials above the fundamental frequency are not perfect harmonics in many instruments, like piano or guitar. Each partial is actually a little bit out of tune, or inharmonic. So the higher-frequency peaks in the FFT will not be exactly on the integer multiples of the fundamental, and the wave shape will change slightly from one cycle to the next, which throws off autocorrelation.
To get a really accurate frequency reading, I'd say to use the autocorrelation to guess the fundamental, then find the true peak using quadratic interpolation. (You can do the autocorrelation in the frequency domain to save CPU cycles.) There are a lot of gotchas, and the right method to use really depends on your application.
There are also other algorithms that are time-based, not frequency based.
Autocorrelation is a relatively simple algorithm for pitch detection.
Reference: http://cnx.org/content/m11714/latest/
I have written c# implementations of autocorrelation and other algorithms that are readable. Check out http://code.google.com/p/yaalp/.
http://code.google.com/p/yaalp/source/browse/#svn/trunk/csaudio/WaveAudio/WaveAudio
Lists the files, and PitchDetection.cs is the one you want.
(The project is GPL; so understand the terms if you use the code).
Guitar tuners don't use FFT's or DFT's. Usually they just count zero crossings. You might not get the fundamental frequency because some waveforms have more zero crossings than others but you can usually get a multiple of the fundamental frequency that way. That's enough to get the note although you might be one or more octaves off.
Low pass filtering before counting zero crossings can usually get rid of the excess zero crossings. Tuning the low pass filter requires some knowlegde of the range of frequency you want to detect though
FFTs (Fast-Fourier Transforms) would indeed be involved. FFTs allow you to approximate any analog signal with a sum of simple sine waves of fixed frequencies and varying amplitudes. What you'll essentially be doing is taking a sample and decomposing it into amplitude->frequency pairs, and then taking the frequency that corresponds to the highest amplitude.
Hopefully another SO reader can fill the gaps I'm leaving between the theory and the code!
A little more specifically:
If you start with the raw PCM in an input array, what you basically have is a graph of wave amplitude vs time.Doing a FFT will transform that to a frequency histogram for frequencies from 0 to 1/2 the input sampling rate. The value of each entry in the result array will be the 'strength' of the corresponding sub-frequency.
So to find the root frequency given an input array of size N sampled at S samples/second:
FFT(N, input, output);
max = max_i = 0;
for(i=0;i<N;i++)
if (output[i]>max) max_i = i;
root = S/2.0 * max_i/N ;
Retrieval of fundamental frequencies in a PCM audio signal is a difficult task, and there would be a lot to talk about it...
Anyway, usually time-based method are not suitable for polyphonic signals, because a complex wave given by the sum of different harmonic components due to multiple fundamental frequencies has a zero-crossing rate which depends only from the lowest frequency component...
Also in the frequency domain the FFT is not the most suitable method, since frequency spacing between notes follow an exponential scale, not linear. This means that a constant frequency resolution, used in the FFT method, may be insufficient to resolve lower frequency notes if the size of the analysis window in the time domain is not large enough.
A more suitable method would be a constant-Q transform, which is DFT applied after a process of low-pass filtering and decimation by 2 (i.e. halving each step the sampling frequency) of the signal, in order to obtain different subbands with different frequency resolution. In this way the calculation of DFT is optimized. The trouble is that also time resolution is variable, and increases for the lower subbands...
Finally, if we are trying to estimate the fundamental frequency of a single note, FFT/DFT methods are ok. Things change for a polyphonic context, in which partials of different sounds overlap and sum/cancel their amplitude depending from their phase difference, and so a single spectral peak could belong to different harmonic contents (belonging to different notes). Correlation in this case don't give good results...
Apply a DFT and then derive the fundamental frequency from the results. Googling around for DFT information will give you the information you need -- I'd link you to some, but they differ greatly in expectations of math knowledge.
Good luck.