Signal processing: FFT overlap processing resources - audio

Are there any good (if possible scientific) resources available (web or books) about overlap processing. I am not that interested in the effects of using overlap processing and windows when analyzing a signal, since the requirements are different. It is more about the following Real Time situation: (I am currently dealing with audio signals)
Dividing a signal into smaller parts.
Creating overlap windows.
FFTing the windowed chunks.
Do processing in the frequency domain.
IFFT the results.
put the chunks together to a continuous stream.
I am especially interested in the influence of the window used on the resulting error as well as the effect of the overlap length. However I couldn't find any good resources that deal with the subject in detail. Any suggestions?
Edit:
After some discussions if using a window function is appropriate, I found a decent handout explaining the overlap and add/save method. http://www.ece.tamu.edu/~deepa/ecen448/handouts/08c/10_Overlap_Save_Add_handouts.pdf
However, after doing some tests, I noticed that the windowed version would perform more accurate in most cases than the overlap & add/save method. Could anybody confirm this?
I don't want to jump to any conclusions regarding computation time though....
Edit2:
Here are some graphs from my tests:
I created a signal, which consists of three cosine waves
I used this filter function in the time domain for filtering. (It's symmetric, as it is applied to the whole output of the FFT, which also is symmetric for real input signals)
The output of the IFFT looks like this: It can be seen that low frequencies are attenuated more than frequency in the mid range.
For the overlap add/save and the windowed processing I divided the input signal into 8 chunks of 256 samples. After reassembling them they look like that. (sample 490 - 540)
It can be seen that the overlap add/save processes differ from the windowed version at the point where chunks are put together (sample 511). This is the error which leads to different results when comparing windowed process and overlap add/save. The windowed process is closer to the one processed in one big junk.
However, I have no idea why they are there or if they shouldn't be there at all.

This is fairly well-known area of signal processing, and generally speaking if you are doing processing along the lines of FFT -> spectral processing -> IFFT you need to use the "overlap and add" approach. Cross-correlation of two inputs is a classic example, done much more easily in the spectral domain than the time domain.
Here's a short paper I found right away via Google (I just searched for "fft overlap and add"): http://www.coe.montana.edu/ee/rmaher/ee477/ee477_fftlab_sp07.pdf
I would recommend you invest in a good Signal Processing book, such as the classic Rabiner & Gold "Theory and application of digital signal processing" (Prentice-Hall ISBN 0-13-914101-4). That should cover the concept of overlap-and-add processing.

When using an FFT for overlap-add or overlap-save fast convolution filtering, normally you don't want to use a windowing function. The circular windowing artifacts cancel out when combining successive FFT frames in canonical overlap add/save filtering.
ADDED:
If you do use a non-rectangular window, you might want to make sure that all the overlapped frames of windows sum to DC, otherwise your resulting filtered signal will have amplitude scalloping. Rectangular windows and raised-cosine (von Hann) windows will sum to DC if the overlap amount is an exact submultiple of the window width (except, of course, at the very start and end of the overlap sequence).

I have been playing with this attempting to answer the question for myself as to why one would use a window. My only references to a synthesis window are this:
https://ccrma.stanford.edu/~jos/sasp/Inverse_FFT_Synthesis.html
http://recherche.ircam.fr/anasyn/roebel/amt_audiosignale/VL2.pdf
http://www.dspdimension.com/tutorials/
Stephan Bernsee has some good overview information. His smbpitchshift code uses a synthesis window -- He uses the raised cosine on the input block, then applies it again on the output block, but this I believe is necessary because the pitch shifting algorithm is not a linear filtering operation, so it is certain there may be discontinuous artifacts on the window boundaries, thus a synthesis window is used to create a smooth transition between frames.
I think the reason there is not much information specifically addressing windowing for frequency domain real-time convolution is because it doesn't have a practical application unless you also need to do some analysis (ie, and adaptive filter of some sort), then the topics related to spectral spreading is again of interest.
I have plotted outputs from a filtered signal using both a raised cosine window as well as overlap-add method, and the end result is an identical IR, and identical signals. It comes as no surprise since the same operations performed in the time domain yield the same results.
On the other hand, if I implement a broken filter kernel, a smooth windowing function can help mask artifacts. This in a sense windows the broken IR so there is a more cohesive transition between frames. It would still be better to have an IR that is limited to length nfft/2 in the time domain. If you need to obtain a filter response with an IR longer than nfft/2, then you should consider either using a larger FFT size (if latency is not a problem) or use a partitioned convolution scheme:
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CB4QFjAA&url=http%3A%2F%2Fpcfarina.eng.unipr.it%2FPublic%2FPapers%2F164-Mohonk2001.PDF&ei=qtH0TorDEoKziQKAloHEDg&usg=AFQjCNGDmz79DiuG1kmPXifbWJ7M-gr9rQ&sig2=CMopEcGc1VArZ3gipWTr_w
or
http://www.music.miami.edu/programs/mue/Research/jvandekieft/jvchapter2.htm
I hope that is helpful to somebody reading this
I hope those links help, even though it doesn't directly address windowing as used in real-time Frequency domain filtering.

Related

Efficient generation of sampled waveforms without aliasing artifacts

For a project of mine I am working with sampled sound generation and I need to create various waveforms at various frequencies. When the waveform is sinusoidal, everything is fine, but when the waveform is rectangular, there is trouble: it sounds as if it came from the eighties, and as the frequency increases, the notes sound wrong. On the 8th octave, each note sounds like a random note from some lower octave.
The undesirable effect is the same regardless of whether I use either one of the following two approaches:
The purely mathematical way of generating a rectangular waveform as sample = sign( secondsPerHalfWave - (timeSeconds % secondsPerWave) ) where secondsPerWave = 1.0 / wavesPerSecond and secondsPerHalfWave = secondsPerWave / 2.0
My preferred way, which is to describe one period of the wave using line segments and to interpolate along these lines. So, a rectangular waveform is described (regardless of sampling rate and regardless of frequency) by a horizontal line from x=0 to x=0.5 at y=1.0, followed by another horizontal line from x=0.5 to x=1.0 at y=-1.0.
From what I gather, the literature considers these waveform generation approaches "naive", resulting in "aliasing", which is the cause of all the undesirable effects.
What this all practically translates to when I look at the generated waveform is that the samples-per-second value is not an exact multiple of the waves-per-second value, so each wave does not have an even number of samples, which in turn means that the number of samples at level 1.0 is often not equal to the number of samples at level -1.0.
I found a certain solution here: https://www.nayuki.io/page/band-limited-square-waves which even includes source code in Java, and it does indeed sound awesome: all undesirable effects are gone, and each note sounds pure and at the right frequency. However, this solution is entirely unsuitable for me, because it is extremely computationally expensive. (Even after I have replaced sin() and cos() with approximations that are ten times faster than Java's built-in functions.) Besides, when I look at the resulting waveforms they look awfully complex, so I wonder whether they can legitimately be called rectangular.
So, my question is:
What is the most computationally efficient method for the generation of periodic waveforms such as the rectangular waveform that does not suffer from aliasing artifacts?
Examples of what the solution could entail:
The computer audio problem of generating correct sample values at discrete time intervals to describe a sound wave seems to me somewhat related to the computer graphics problem of generating correct integer y coordinates at discrete integer x coordinates for drawing lines. The Bresenham line generation algorithm is extremely efficient, (even if we disregard for a moment the fact that it is working with integer math,) and it works by accumulating a certain error term which, at the right time, results in a bump in the Y coordinate. Could some similar mechanism perhaps be used for calculating sample values?
The way sampling works is understood to be as reading the value of the analog signal at a specific, infinitely narrow point in time. Perhaps a better approach would be to consider reading the area of the entire slice of the analog signal between the last sample and the current sample. This way, sampling a 1.0 right before the edge of the rectangular waveform would contribute a little to the sample value, while sampling a -1.0 considerable time after the edge would contribute a lot, thus naturally yielding a point which is between the two extreme values. Would this solve the problem? Does such an algorithm exist? Has anyone ever tried it?
Please note that I have posted this question here as opposed to dsp.stackexchange.com because I do not want to receive answers with preposterous jargon like band-limiting, harmonics and low-pass filters, lagrange interpolations, DC compensations, etc. and I do not want answers that come from the purely analog world or the purely theoretical outer space and have no chance of ever receiving a practical and efficient implementation using a digital computer.
I am a programmer, not a sound engineer, and in my little programmer's world, things are simple: I have an array of samples which must all be between -1.0 and 1.0, and will be played at a certain rate (44100 samples per second.) I have arithmetic operations and trigonometric functions at my disposal, I can describe lines and use simple linear interpolation, and I need to generate the samples extremely efficiently because the generation of a dozen waveforms simultaneously and also the mixing of them together may not consume more than 1% of the total CPU time.
I'm not sure but you may have a few of misconceptions about the nature of aliasing. I base this on your putting the term in quotes, and from the following quote:
What this all practically translates to when I look at the generated
waveform is that the samples-per-second value is not an exact multiple
of the waves-per-second value, so each wave does not have an even
number of samples, which in turn means that the number of samples at
level 1.0 is often not equal to the number of samples at level -1.0.
The samples/sec and waves/sec don't have to be exact multiples at all! One can play back all pitches below the Nyquist. So I'm not clear what your thinking on this is.
The characteristic sound of a square wave arises from the presence of odd harmonics, e.g., with a note of 440 (A5), the square wave sound could be generated by combining sines of 440, 1320, 2200, 3080, 3960, etc. progressing in increments of 880. This begs the question, how many odd harmonics? We could go to infinity, theoretically, for the sharpest possible corner on our square wave. If you simply "draw" this in the audio stream, the progression will continue well beyond the Nyquist number.
But there is a problem in that harmonics that are higher than the Nyquist value cannot be accurately reproduced digitally. Attempts to do so result in aliasing. So, to get as good a sounding square wave as the system is able to produce, one has to avoid the higher harmonics that are present in the theoretically perfect square wave.
I think the most common solution is to use a low-pass filtering algorithm. The computations are definitely more cpu-intensive than just calculating sine waves (or doing FM synthesis, which was my main interest). I am also weak on the math for DSP and concerned about cpu expense, and so, avoided this approach for long time. But it is quite viable and worth an additional look, imho.
Another approach is to use additive synthesis, and include as many sine harmonics as you need to get the tonal quality you want. The problem then is that the more harmonics you add, the more computation you are doing. Also, the top harmonics must be kept track of as they limit the highest note you can play. For example if using 10 harmonics, the note 500Hz would include content at 10500 Hz. That's below the Nyquist value for 44100 fps (which is 22050 Hz). But you'll only be able to go up about another octave (doubles everything) with a 10-harmonic wave and little more before your harmonic content goes over the limit and starts aliasing.
Instead of computing multiple sines on the fly, another solution you might consider is to instead create a set of lookup tables (LUTs) for your square wave. To create the values in the table, iterate through and add the values from the sine harmonics that will safely remain under the Nyquist for the range in which you use the given table. I think a table of something like 1024 values to encode a single period could be a good first guess as to what would work.
For example, I am guestimating, but the table for the octave C4-C5 might use 10 harmonics, the table for C5-C6 only 5, the table for C3-C4 might have 20. I can't recall what this strategy/technique is called, but I do recall it has a name, it is an accepted way of dealing with the situation. Depending on how the transitions sound and the amount of high-end content you want, you can use fewer or more LUTs.
There may be other methods to consider. The wikipedia entry on Aliasing describes a technique it refers to as "bandpass" that seems to be intentionally using aliasing. I don't know what that is about or how it relates to the article you cite.
The Soundpipe library has the concept of a frequency table, which is a data structure that holds a precomputed waveform such as a sine. You can initialize the frequency table with the desired waveform and play it through an oscilator. There is even a module named oscmorph which allows you to morph between two or more wavetables.
This is an example of how to generate a sine wave, taken from Soundpipe's documentation.
int main() {
UserData ud;
sp_data *sp;
sp_create(&sp);
sp_ftbl_create(sp, &ud.ft, 2048);
sp_osc_create(&ud.osc);
sp_gen_sine(sp, ud.ft);
sp_osc_init(sp, ud.osc, ud.ft);
ud.osc->freq = 500;
sp->len = 44100 * 5;
sp_process(sp, &ud, write_osc);
sp_ftbl_destroy(&ud.ft);
sp_osc_destroy(&ud.osc);
sp_destroy(&sp);
return 0;
}

Ideas on filtering out consistent time series data

So I have two subsets of data that represent two situations. The one that look more consistent needs to be filtered out (they are noise) while the one looks random are kept (they are motions). The method I was using was to define a moving window = 10 and whenever the standard deviation of the data within the window was smaller than some threshold, I suppressed them. However, this method could not filter out all "consistent" noise while also hurting the inconsistent one (real motion). I was hoping to use some kinds of statistical models and not machine learning to accomplish this. Any suggestions would be appreciated!
noise
real motion
The Kolmogorov–Smirnov test is used to compare two samples to determine if they come from the same distribution. I realized that real world data would never be uniform. So instead of comparing my noise data against the uniform distribution, I used scipy.stats.ks_2samp function to compare any bursts against one real motion burst. I then muted the motion if the return p-value is significantly small, meaning I can reject the hypothesis that two samples are from the same distribution.

Theory behind Autotune/vocoder

I've been hunting all over the web for material about vocoder or autotune, but haven't got any satisfactory answers. Could someone in a simple way please explain how do you autotune a given sound file using a carrier sound file?
(I'm familiar with ffts, windowing, overlap etc., I just don't get the what do we do when we have the ffts of the carrier and the original sound file which has to be modulated)
EDIT: After looking around a bit more, I finally got to know exactly what I was looking for -- a channel vocoder. The way it works is, it takes two inputs, one a voice signal and the other a musical signal rich in frequency. The musical signal is modulated by the envelope of the voice signal, and the output signal sounds like the voice singing in the musical tone.
Thanks for your help!
Using a phase vocoder to adjust pitch is basically pitch estimation plus interpolation in the frequency domain.
A phase vocoder reconstruction method might resample the frequency spectrum at, potentially, a new FFT bin spacing to shift all the frequencies up or down by some ratio. The phase vocoder algorithm additionally uses information shared between adjacent FFT frames to make sure this interpolation result can create continuous waveforms across frame boundaries. e.g. it adjusts the phases of the interpolation results to make sure that successive sinewave reconstructions are continuous rather than having breaks or discontinuities or phase cancellations between frames.
How much to shift the spectrum up or down is determined by pitch estimation, and calculating the ratio between the estimated pitch of the source and that of the target pitch. Again, phase vocoders use information about any phase differences between FFT frames to help better estimate pitch. This is possible by using more a bit more global information than is available from a single local FFT frame.
Of course, this frequency and phase changing can smear out transient detail and cause various other distortions, so actual phase vocoder products may additionally do all kinds of custom (often proprietary) special case tricks to try and fix some of these problems.
The first step is pitch detection. There are a number of pitch detection algorithms, introduced briefly in wikipedia: http://en.wikipedia.org/wiki/Pitch_detection_algorithm
Pitch detection can be implemented in either frequency domain or time domain. Various techniques in both domains exist with various properties (latency, quality, etc.) In the F domain, it is important to realize that a naive approach is very limiting because of the time/frequency trade-off. You can get around this limitation, but it takes work.
Once you've identified the pitch, you compare it with a desired pitch and determine how much you need to actually pitch shift.
Last step is pitch shifting, which, like pitch detection, can be done in the T or F domain. The "phase vocoder" method other folks mentioned is the F domain method. T domain methods include (in increasing order of quality) OLA, SOLA and PSOLA, some of which you can read about here: http://www.scribd.com/doc/67053489/60/Synchronous-Overlap-and-Add-SOLA
Basically you do an FFT, then in the frequency domain you move the signals to the nearest perfect semitone pitch.

Note Onset Detection using Spectral Difference

Im fairly new to onset detection. I read some papers about it and know that when working only with the time-domain, it is possible that there will be a large number of false-positives/negatives, and that it is generally advisable to work with either both the time-domain and frequency-domain or the frequency domain.
Regarding this, I am a bit confused because, I am having trouble on how the spectral energy or the results from the FFT bin can be used to determine note onsets. Because, aren't note onsets represented by sharp peaks in amplitude?
Can someone enlighten me on this? Thank you!
This is the easiest way to think about note onset:
think of a music signal as a flat constant signal. When and onset occurs you look at it as a large rapid CHANGE in signal (a positive or negative peak)
What this means in the frequency domain:
the FT of a constant signal is, well, CONSTANT! and flat
When the onset event occurs there is a rapid increase in spectrial content.
While you may think "Well you're actually talking about the peak of the onset right?" not at all. We are not actually interested in the peak of the onset, but rather the rising edge of the signal. When there is a sharp increase in the signal, the high frequency content increases.
one way to do this is using the spectrial difference function:
take your time domain signal and cut it up into overlaping strips (typically 50% overlap)
apply a hamming/hann window (this is to reduce spectrial smudging) (remember cutting up the signal into windows is like multiplying it by a pulse, in the frequency domain its like convolving the signal with a sinc function)
Apply the FFT algorithm on two sucessive windows
For each DFT bin, calculate the difference between the Xn and Xn-1 bins if it is negative set it to zero
square the results and sum all th bins together
repeat till end of signal.
look for peaks in signal using median thresholding and there are your onset times!
Source:
https://adamhess.github.io/Onset_Detection_Nov302011.pdf
and
http://www.elec.qmul.ac.uk/people/juan/Documents/Bello-TSAP-2005.pdf
You can look at sharp differences in amplitude at a specific frequency as suspected sound onsets. For instance if a flute switches from playing a G5 to playing a C, there will be a sharp drop in amplitude of the spectrum at around 784 Hz.
If you don't know what frequency to examine, the magnitude of an FFT vector will give you the amplitude of every frequency over some window in time (with a resolution dependent on the length of the time window). Pick your frequency, or a bunch of frequencies, and diff two FFTs of two different time windows. That might give you something that can be used as part of a likelihood estimate for a sound onset or change somewhere between the two time windows. Sliding the windows or successive approximation of their location in time might help narrow down the time of a suspected note onset or other significant change in the sound.
"Because, aren't note onsets represented by sharp peaks in amplitude?"
A: Not always. On percussive instruments (including piano) this is true, but for violin, flute, etc. notes often "slide" into each other as frequency changes without sharp amplitude increases.
If you stick to a single instrument like the piano onset detection is do-able. Generalized onset detection is a much more difficult problem. There are about a dozen primitive features that have been used for onset detection. Once you code them, you still have to decide how best to use them.

Downsampling and applying a lowpass filter to digital audio

I've got a 44Khz audio stream from a CD, represented as an array of 16 bit PCM samples. I'd like to cut it down to an 11KHz stream. How do I do that? From my days of engineering class many years ago, I know that the stream won't be able to describe anything over 5500Hz accurately anymore, so I assume I want to cut everything above that out too. Any ideas? Thanks.
Update: There is some code on this page that converts from 48KHz to 8KHz using a simple algorithm and a coefficient array that looks like { 1, 4, 12, 12, 4, 1 }. I think that is what I need, but I need it for a factor of 4x rather than 6x. Any idea how those constants are calculated? Also, I end up converting the 16 byte samples to floats anyway, so I can do the downsampling with floats rather than shorts, if that helps the quality at all.
Read on FIR and IIR filters. These are the filters that use a coefficent array.
If you do a google search on "FIR or IIR filter designer" you will find lots of software and online-applets that does the hard job (getting the coefficients) for you.
EDIT:
This page here ( http://www-users.cs.york.ac.uk/~fisher/mkfilter/ ) lets you enter the parameters of your filter and will spit out ready to use C-Code...
You're right in that you need apply lowpass filtering on your signal. Any signal over 5500 Hz will be present in your downsampled signal but 'aliased' as another frequency so you'll have to remove those before downsampling.
It's a good idea to do the filtering with floats. There are fixed point filter algorithms too but those generally have quality tradeoffs to work. If you've got floats then use them!
Using DFT's for filtering is generally overkill and it makes things more complicated because dft's are not a contiuous process but work on buffers.
Digital filters generally come in two tastes. FIR and IIR. The're generally the same idea but IIF filters use feedback loops to achieve a steeper response with far less coefficients. This might be a good idea for downsampling because you need a very steep filter slope there.
Downsampling is sort of a special case. Because you're going to throw away 3 out of 4 samples there's no need to calculate them. There is a special class of filters for this called polyphase filters.
Try googling for polyphase IIR or polyphase FIR for more information.
Notice (in additions to the other comments) that the simple-easy-intuitive approach "downsample by a factor of 4 by replacing each group of 4 consecutive samples by the average value", is not optimal but is nevertheless not wrong, nor practically nor conceptually. Because the averaging amounts precisely to a low pass filter (a rectangular window, which corresponds to a sinc in frequency). What would be conceptually wrong is to just downsample by taking one of each 4 samples: that would definitely introduce aliasing.
By the way: practically any software that does some resampling (audio, image or whatever; example for the audio case: sox) takes this into account, and frequently lets you choose the underlying low-pass filter.
You need to apply a lowpass filter before you downsample the signal to avoid "aliasing". The cutoff frequency of the lowpass filter should be less than the nyquist frequency, which is half the sample frequency.
The "best" solution possible is indeed a DFT, discarding the top 3/4 of the frequencies, and performing an inverse DFT, with the domain restricted to the bottom 1/4th. Discarding the top 3/4ths is a low-pass filter in this case. Padding to a power of 2 number of samples will probably give you a speed benefit. Be aware of how your FFT package stores samples though. If it's a complex FFT (which is much easier to analyze, and generally has nicer properties), the frequencies will either go from -22 to 22, or 0 to 44. In the first case, you want the middle 1/4th. In the latter, the outermost 1/4th.
You can do an adequate job by averaging sample values together. The naïve way of grabbing samples four by four and doing an equal weighted average works, but isn't too great. Instead you'll want to use a "kernel" function that averages them together in a non-intuitive way.
Mathwise, discarding everything outside the low-frequency band is multiplication by a box function in frequency space. The (inverse) Fourier transform turns pointwise multiplication into a convolution of the (inverse) Fourier transforms of the functions, and vice-versa. So, if we want to work in the time domain, we need to perform a convolution with the (inverse) Fourier transform of box function. This turns out to be proportional to the "sinc" function (sin at)/at, where a is the width of the box in the frequency space. So at every 4th location (since you're downsampling by a factor of 4) you can add up the points near it, multiplied by sin (a dt) / a dt, where dt is the distance in time to that location. How nearby? Well, that depends on how good you want it to sound. It's common to ignore everything outside the first zero, for instance, or just take the number of points to be the ratio by which you're downsampling.
Finally there's the piss-poor (but fast) way of just discarding the majority of the samples, keeping just the zeroth, the fourth, and so on.
Honestly, if it fits in memory, I'd recommend just going the DFT route. If it doesn't use one of the software filter packages that others have recommended to construct the filter for you.
The process you're after called "Decimation".
There are 2 steps:
Applying Low Pass Filter on the data (In your case LPF with Cut Off at Pi / 4).
Downsampling (In you case taking 1 out of 4 samples).
There are many methods to design and apply the Low Pass Filter.
You may start here:
http://en.wikipedia.org/wiki/Filter_design
You could make use of libsamplerate to do the heavy lifting. Libsamplerate is a C API, and takes care of calculating the filter coefficients. You to select from different quality filters so that you can trade off quality for speed.
If you would prefer not to write any code, you could just use Audacity to do the sample rate conversion. It offers a powerful GUI, and makes use of libsamplerate for it's sample rate conversion.
I would try applying DFT, chopping 3/4 of the result and applying inverse DFT. I can't tell if it will sound good without actually trying tough.
I recently came across BruteFIR which may already do some of what you're interested in?
You have to apply low-pass filter (removing frequencies above 5500 Hz) and then apply decimation (leave every Nth sample, every 4th in your case).
For decimation, FIR, not IIR filters are usually employed, because they don't depend on previous outputs and therefore you don't have to calculate anything for discarded samples. IIRs, generally, depends on both inputs and outputs, so, unless a specific type of IIR is used, you'd have to calculate every output sample before discarding 3/4 of them.
Just googled an intro-level article on the subject: https://www.dspguru.com/dsp/faqs/multirate/decimation

Resources