Normalizing audio waveforms code implementation (Peak, RMS) - audio

I have some audio data (array of floats) which I use to plot a simple
waveform.
When plotted, the waveform doesn't max out at the edges.
No problem - the data just needs to be normalized. I iterate once to find the max, and then iterate again dividing each by the max. Plot again and everything looks great!
But wait videos which have a loud intro, or loud explosion, causes the rest of the waveform to still be tiny.
After some research, I come across RMS that is supposed to address this. I iterate through the samples and calculate the RMS, and again divide each sample by the RMS value. This results in considerable "clipping":
What is the best method to solve this?
Intuitively, it seems I might need to calculate a local max or average based on a moving window (rather than the entire set) but I'm not entirely sure. Help?
Note: The waveform is purely for visual purposes (the audio will not be played back to the user).

You could transpose it (effectively making the y-axis non-linear, or you can think it as a form of companding).
Assuming the signal is within the range [-1, 1].
One popular quick and simple solution is to simply apply the hyperbolic tangens function (tanh). This will limit values to [-1, 1] by penalizing higher values more. If you amplify the signal before applying tanh, the effect will be more pronounced.
Another alternative is a logarithmic transform. As the signal changes sign some pre-processing has to be performed.
If r is a series of sample values one approach could be something like this:
r.log1p <- log2(1.1 * (abs(r) + 1)) * sign(r)
That is, for every value take its absolute, add one, multiply with some small constant, take the log and then finally multiply it with the sign of its corresponding old value.
The effect can be something like this:

Related

Efficient generation of sampled waveforms without aliasing artifacts

For a project of mine I am working with sampled sound generation and I need to create various waveforms at various frequencies. When the waveform is sinusoidal, everything is fine, but when the waveform is rectangular, there is trouble: it sounds as if it came from the eighties, and as the frequency increases, the notes sound wrong. On the 8th octave, each note sounds like a random note from some lower octave.
The undesirable effect is the same regardless of whether I use either one of the following two approaches:
The purely mathematical way of generating a rectangular waveform as sample = sign( secondsPerHalfWave - (timeSeconds % secondsPerWave) ) where secondsPerWave = 1.0 / wavesPerSecond and secondsPerHalfWave = secondsPerWave / 2.0
My preferred way, which is to describe one period of the wave using line segments and to interpolate along these lines. So, a rectangular waveform is described (regardless of sampling rate and regardless of frequency) by a horizontal line from x=0 to x=0.5 at y=1.0, followed by another horizontal line from x=0.5 to x=1.0 at y=-1.0.
From what I gather, the literature considers these waveform generation approaches "naive", resulting in "aliasing", which is the cause of all the undesirable effects.
What this all practically translates to when I look at the generated waveform is that the samples-per-second value is not an exact multiple of the waves-per-second value, so each wave does not have an even number of samples, which in turn means that the number of samples at level 1.0 is often not equal to the number of samples at level -1.0.
I found a certain solution here: https://www.nayuki.io/page/band-limited-square-waves which even includes source code in Java, and it does indeed sound awesome: all undesirable effects are gone, and each note sounds pure and at the right frequency. However, this solution is entirely unsuitable for me, because it is extremely computationally expensive. (Even after I have replaced sin() and cos() with approximations that are ten times faster than Java's built-in functions.) Besides, when I look at the resulting waveforms they look awfully complex, so I wonder whether they can legitimately be called rectangular.
So, my question is:
What is the most computationally efficient method for the generation of periodic waveforms such as the rectangular waveform that does not suffer from aliasing artifacts?
Examples of what the solution could entail:
The computer audio problem of generating correct sample values at discrete time intervals to describe a sound wave seems to me somewhat related to the computer graphics problem of generating correct integer y coordinates at discrete integer x coordinates for drawing lines. The Bresenham line generation algorithm is extremely efficient, (even if we disregard for a moment the fact that it is working with integer math,) and it works by accumulating a certain error term which, at the right time, results in a bump in the Y coordinate. Could some similar mechanism perhaps be used for calculating sample values?
The way sampling works is understood to be as reading the value of the analog signal at a specific, infinitely narrow point in time. Perhaps a better approach would be to consider reading the area of the entire slice of the analog signal between the last sample and the current sample. This way, sampling a 1.0 right before the edge of the rectangular waveform would contribute a little to the sample value, while sampling a -1.0 considerable time after the edge would contribute a lot, thus naturally yielding a point which is between the two extreme values. Would this solve the problem? Does such an algorithm exist? Has anyone ever tried it?
Please note that I have posted this question here as opposed to dsp.stackexchange.com because I do not want to receive answers with preposterous jargon like band-limiting, harmonics and low-pass filters, lagrange interpolations, DC compensations, etc. and I do not want answers that come from the purely analog world or the purely theoretical outer space and have no chance of ever receiving a practical and efficient implementation using a digital computer.
I am a programmer, not a sound engineer, and in my little programmer's world, things are simple: I have an array of samples which must all be between -1.0 and 1.0, and will be played at a certain rate (44100 samples per second.) I have arithmetic operations and trigonometric functions at my disposal, I can describe lines and use simple linear interpolation, and I need to generate the samples extremely efficiently because the generation of a dozen waveforms simultaneously and also the mixing of them together may not consume more than 1% of the total CPU time.
I'm not sure but you may have a few of misconceptions about the nature of aliasing. I base this on your putting the term in quotes, and from the following quote:
What this all practically translates to when I look at the generated
waveform is that the samples-per-second value is not an exact multiple
of the waves-per-second value, so each wave does not have an even
number of samples, which in turn means that the number of samples at
level 1.0 is often not equal to the number of samples at level -1.0.
The samples/sec and waves/sec don't have to be exact multiples at all! One can play back all pitches below the Nyquist. So I'm not clear what your thinking on this is.
The characteristic sound of a square wave arises from the presence of odd harmonics, e.g., with a note of 440 (A5), the square wave sound could be generated by combining sines of 440, 1320, 2200, 3080, 3960, etc. progressing in increments of 880. This begs the question, how many odd harmonics? We could go to infinity, theoretically, for the sharpest possible corner on our square wave. If you simply "draw" this in the audio stream, the progression will continue well beyond the Nyquist number.
But there is a problem in that harmonics that are higher than the Nyquist value cannot be accurately reproduced digitally. Attempts to do so result in aliasing. So, to get as good a sounding square wave as the system is able to produce, one has to avoid the higher harmonics that are present in the theoretically perfect square wave.
I think the most common solution is to use a low-pass filtering algorithm. The computations are definitely more cpu-intensive than just calculating sine waves (or doing FM synthesis, which was my main interest). I am also weak on the math for DSP and concerned about cpu expense, and so, avoided this approach for long time. But it is quite viable and worth an additional look, imho.
Another approach is to use additive synthesis, and include as many sine harmonics as you need to get the tonal quality you want. The problem then is that the more harmonics you add, the more computation you are doing. Also, the top harmonics must be kept track of as they limit the highest note you can play. For example if using 10 harmonics, the note 500Hz would include content at 10500 Hz. That's below the Nyquist value for 44100 fps (which is 22050 Hz). But you'll only be able to go up about another octave (doubles everything) with a 10-harmonic wave and little more before your harmonic content goes over the limit and starts aliasing.
Instead of computing multiple sines on the fly, another solution you might consider is to instead create a set of lookup tables (LUTs) for your square wave. To create the values in the table, iterate through and add the values from the sine harmonics that will safely remain under the Nyquist for the range in which you use the given table. I think a table of something like 1024 values to encode a single period could be a good first guess as to what would work.
For example, I am guestimating, but the table for the octave C4-C5 might use 10 harmonics, the table for C5-C6 only 5, the table for C3-C4 might have 20. I can't recall what this strategy/technique is called, but I do recall it has a name, it is an accepted way of dealing with the situation. Depending on how the transitions sound and the amount of high-end content you want, you can use fewer or more LUTs.
There may be other methods to consider. The wikipedia entry on Aliasing describes a technique it refers to as "bandpass" that seems to be intentionally using aliasing. I don't know what that is about or how it relates to the article you cite.
The Soundpipe library has the concept of a frequency table, which is a data structure that holds a precomputed waveform such as a sine. You can initialize the frequency table with the desired waveform and play it through an oscilator. There is even a module named oscmorph which allows you to morph between two or more wavetables.
This is an example of how to generate a sine wave, taken from Soundpipe's documentation.
int main() {
UserData ud;
sp_data *sp;
sp_create(&sp);
sp_ftbl_create(sp, &ud.ft, 2048);
sp_osc_create(&ud.osc);
sp_gen_sine(sp, ud.ft);
sp_osc_init(sp, ud.osc, ud.ft);
ud.osc->freq = 500;
sp->len = 44100 * 5;
sp_process(sp, &ud, write_osc);
sp_ftbl_destroy(&ud.ft);
sp_osc_destroy(&ud.osc);
sp_destroy(&sp);
return 0;
}

I get strange sizzle/artifacts in my audio when doing differnt FFT approaches

I am doing filter convolution by using fft (FFTW). I experience something I can not understand.
I have an input x(n) which I want to apply a filter IR u(n). Both length N. So I zero pad both e.g. to 2n and do FFT of both to get X(n) and U(n). if I just do X(n)*U(n) and IFFT I get a signal y(t). If I listen to the signal there is no sizzling, all sounds ok. For speeding up the programm and saving memory I tried to take advantage of symmetrie of U(n) and X(n)and to use only first half of U(n) and X(n) and zero padding the second half. So I did X(n0...n/2,0,0,0,0,..,N)U(n0,..,n/2,0,0,0,..,N) and IFFT.
The resulting output sounds not different to the result when multipling full length XU but there is strange subtile sizzling noise audible laying on the output. Mostley apparent on loud/resonant input signal parts, sounds almost like clipping the stage. I did not change anything in the scaling in both methods so, I don´t understand whats going on. Could someone help me out with an idea?
Is it wrong to just use half of U and X and zero pad the rest , must I use the full length? Or does this change e.g. scaling?
You can not simply set part of your signal spectra to zero. Any real signal (with no imaginary component) has a conjugate complex spectrum. I guess this is the symmetry you are talking about. If you set part of the spectrum to zero your signal in the time domain will be complex and completely different from the original signal you started with.
If you want to speed up your computation reduce the number of your samples you are working with

Changing frequency amplitude with RealFFT, flickering sound

i have been trying to modify the amplitude for specific frequencies. Here is what i have done:
I get the data 2048 as float array which have a value range of [-1,1]. It's raw data.
I use this RealFFT algorithm http://www.lomont.org/Software/Misc/FFT/LomontFFT.html
I divide the raw data into left and right channel (this works great).
I perform RealFFT (forward enable) on both left and right and i use this equation to find which index is the right frequency that i want: freq/(samplerate/sizeOfBuffer/2.0)
I modify the frequency that i want.
I perform RealFFT (forward disable) to go back to frequency domain.
Now when i play back, i hear the change tat i did to the frequency but there is a flickering noise ( kinda the same flickering when you play an old vinyl song).
Any idea what i might do wrong?
It was a while ago i took my signal processing course at my university so i might have forgot something.
Thanks in advance!
The comments may be confusing. Here are some clarifications.
The imaginary part is not the phase. The real and imaginary parts form a vector, think of a 2-d plot where real is on the x axis and imaginary on the y. The amplitude of a frequency is the length of the line formed from the origin to the point. So, the phase is the arctan of the real and imaginary parts divided. The magnitude is the square root of the sum of squares of the real and imaginary parts.
So. The first step is that you want to change the magnitude of the vector, you must scale both the real and imaginary parts.
That's easy. The second part is much more complicated. The Fourier transform's "view" of the world is that it is infinitely periodic - that is, it looks like the signal wraps from the end, back to the beginning. If you put a perfect sine tone into your algorithm, and say that the period of the sine tone is 4096 samples. The first sample into the FFT is +1, then the last sample into the FFT is -1. If you look at the spectrum in the FFT, it will appear as if there are lots of high frequencies, which are the harmonics of transforming a signal that has a jump from -1 to 1. The longer and longer the FFT, the closer that the FFT shows you the "real" view of the signal.
Techniques to smooth out the transitions between FFT blocks have been developed, by windowing and overlapping the FFT blocks, so that the transitions between the blocks are not so "discontinuous". A fairly common technique is to use a Hann window and overlap by a factor of 4. That is, for every 2048 samples, you actually do 4 FFTs, and every FFT overlaps the previous block by 1536. The Hann window gets mathy, but basically it has nice properties so that you can do overlaps like this and everything sums up nicely.
I found this pretty fun blog showing exactly the same learning pains that you're going through: http://www.katjaas.nl/FFTwindow/FFTwindow&filtering.html
This technique is different from another commenter who mentions Overlap-Save. This is a a method developed to use FFTs to do FIR filtering. However, designing the FIR filter will typically be done in a mathematical package like Matlab/Octave.
If you use a series of shorter FFTs to modify a longer signal, then one should zero-pad each window so that it uses a longer FFT (longer by the impulse response of the modification's spectrum), and combine the series of longer FFTs by overlap-add or overlap-save. Otherwise, waveform changes that should ripple past the end of each FFT/IFFT modification will , due to circular convolution, ripple around to the beginning of each window, and cause that periodic flickering distortion you hear.

DSP - Filter sweep effect

I'm implementing a 'filter sweep' effect (I don't know if it's called like that). What I do is basically create a low-pass filter and make it 'move' along a certain frequency range.
To calculate the filter cut-off frequency at a given moment I use a user-provided linear function, which yields values between 0 and 1.
My first attempt was to directly map the values returned by the linear function to the range of frequencies, as in cf = freqRange * lf(x). Although it worked ok it looked as if the sweep ran much faster when moving through low frequencies and then slowed down during its way to the high frequency zone. I'm not sure why is this but I guess it's something to do with human hearing perceiving changes in frequency in a non-linear manner.
My next attempt was to move the filter's cut-off frequency in a logarithmic way. It works much better now but I still feel that the filter doesn't move at a constant perceived speed through the range of frequencies.
How should I divide the frequency space to obtain a constant perceived sweep speed?
Thanks in advance.
The frequency sweep effect you're referring to is likely a wah-wah filter, named for the ubiquitous wah-wah pedal.
We hear frequency in terms of octaves, and sweeping through octaves with a logarithmic scale is the way to linearize it. Not to sound dismissive, but it sounds like what you're doing is physically and mathematically correct. (You should spent as much time between 200 and 400 Hz as you do between 2000 and 4000 Hz, etc.) You just don't like how it sounds. And that's quite okay on both counts -- audio is highly subjective.
To mix things up a bit, one option would be to try the Bark scale, which is based on psychoacoustics and the structure of the ear. As I understand it, this is designed to spend equal amounts of time in each of your ear's internal "bandpass filters".
You could always try a quadratic or cubic function between 0 and 1. Audio potentiometers often use a few piecewise quadratic or cubic sections to get their mapping.
Winging it, but try this:
http://en.wikipedia.org/wiki/Physics_of_music#Scales "The following table shows the ratios between the frequencies of all the notes of the just major scale and the fixed frequency of the first note of the scale."
There is then a chart showing fractional values between 1 and 2, and if you tweak your timing to match, you may get what you wish. While the overall progression is still logarithmic, the stepping between each one should divide up into equal stepped 8ths (a bit jumpy).
Put another way, every half second adjust one note up. Each octave (I think) will cover twice the frequency range of the prior octave.
EDIT: Also, you'll find the frequencies here: http://en.wikipedia.org/wiki/Middle_C#Designation_by_octave (doesn't the programmer in you wish that C0 was exactly 16hz?)

Downsampling and applying a lowpass filter to digital audio

I've got a 44Khz audio stream from a CD, represented as an array of 16 bit PCM samples. I'd like to cut it down to an 11KHz stream. How do I do that? From my days of engineering class many years ago, I know that the stream won't be able to describe anything over 5500Hz accurately anymore, so I assume I want to cut everything above that out too. Any ideas? Thanks.
Update: There is some code on this page that converts from 48KHz to 8KHz using a simple algorithm and a coefficient array that looks like { 1, 4, 12, 12, 4, 1 }. I think that is what I need, but I need it for a factor of 4x rather than 6x. Any idea how those constants are calculated? Also, I end up converting the 16 byte samples to floats anyway, so I can do the downsampling with floats rather than shorts, if that helps the quality at all.
Read on FIR and IIR filters. These are the filters that use a coefficent array.
If you do a google search on "FIR or IIR filter designer" you will find lots of software and online-applets that does the hard job (getting the coefficients) for you.
EDIT:
This page here ( http://www-users.cs.york.ac.uk/~fisher/mkfilter/ ) lets you enter the parameters of your filter and will spit out ready to use C-Code...
You're right in that you need apply lowpass filtering on your signal. Any signal over 5500 Hz will be present in your downsampled signal but 'aliased' as another frequency so you'll have to remove those before downsampling.
It's a good idea to do the filtering with floats. There are fixed point filter algorithms too but those generally have quality tradeoffs to work. If you've got floats then use them!
Using DFT's for filtering is generally overkill and it makes things more complicated because dft's are not a contiuous process but work on buffers.
Digital filters generally come in two tastes. FIR and IIR. The're generally the same idea but IIF filters use feedback loops to achieve a steeper response with far less coefficients. This might be a good idea for downsampling because you need a very steep filter slope there.
Downsampling is sort of a special case. Because you're going to throw away 3 out of 4 samples there's no need to calculate them. There is a special class of filters for this called polyphase filters.
Try googling for polyphase IIR or polyphase FIR for more information.
Notice (in additions to the other comments) that the simple-easy-intuitive approach "downsample by a factor of 4 by replacing each group of 4 consecutive samples by the average value", is not optimal but is nevertheless not wrong, nor practically nor conceptually. Because the averaging amounts precisely to a low pass filter (a rectangular window, which corresponds to a sinc in frequency). What would be conceptually wrong is to just downsample by taking one of each 4 samples: that would definitely introduce aliasing.
By the way: practically any software that does some resampling (audio, image or whatever; example for the audio case: sox) takes this into account, and frequently lets you choose the underlying low-pass filter.
You need to apply a lowpass filter before you downsample the signal to avoid "aliasing". The cutoff frequency of the lowpass filter should be less than the nyquist frequency, which is half the sample frequency.
The "best" solution possible is indeed a DFT, discarding the top 3/4 of the frequencies, and performing an inverse DFT, with the domain restricted to the bottom 1/4th. Discarding the top 3/4ths is a low-pass filter in this case. Padding to a power of 2 number of samples will probably give you a speed benefit. Be aware of how your FFT package stores samples though. If it's a complex FFT (which is much easier to analyze, and generally has nicer properties), the frequencies will either go from -22 to 22, or 0 to 44. In the first case, you want the middle 1/4th. In the latter, the outermost 1/4th.
You can do an adequate job by averaging sample values together. The naïve way of grabbing samples four by four and doing an equal weighted average works, but isn't too great. Instead you'll want to use a "kernel" function that averages them together in a non-intuitive way.
Mathwise, discarding everything outside the low-frequency band is multiplication by a box function in frequency space. The (inverse) Fourier transform turns pointwise multiplication into a convolution of the (inverse) Fourier transforms of the functions, and vice-versa. So, if we want to work in the time domain, we need to perform a convolution with the (inverse) Fourier transform of box function. This turns out to be proportional to the "sinc" function (sin at)/at, where a is the width of the box in the frequency space. So at every 4th location (since you're downsampling by a factor of 4) you can add up the points near it, multiplied by sin (a dt) / a dt, where dt is the distance in time to that location. How nearby? Well, that depends on how good you want it to sound. It's common to ignore everything outside the first zero, for instance, or just take the number of points to be the ratio by which you're downsampling.
Finally there's the piss-poor (but fast) way of just discarding the majority of the samples, keeping just the zeroth, the fourth, and so on.
Honestly, if it fits in memory, I'd recommend just going the DFT route. If it doesn't use one of the software filter packages that others have recommended to construct the filter for you.
The process you're after called "Decimation".
There are 2 steps:
Applying Low Pass Filter on the data (In your case LPF with Cut Off at Pi / 4).
Downsampling (In you case taking 1 out of 4 samples).
There are many methods to design and apply the Low Pass Filter.
You may start here:
http://en.wikipedia.org/wiki/Filter_design
You could make use of libsamplerate to do the heavy lifting. Libsamplerate is a C API, and takes care of calculating the filter coefficients. You to select from different quality filters so that you can trade off quality for speed.
If you would prefer not to write any code, you could just use Audacity to do the sample rate conversion. It offers a powerful GUI, and makes use of libsamplerate for it's sample rate conversion.
I would try applying DFT, chopping 3/4 of the result and applying inverse DFT. I can't tell if it will sound good without actually trying tough.
I recently came across BruteFIR which may already do some of what you're interested in?
You have to apply low-pass filter (removing frequencies above 5500 Hz) and then apply decimation (leave every Nth sample, every 4th in your case).
For decimation, FIR, not IIR filters are usually employed, because they don't depend on previous outputs and therefore you don't have to calculate anything for discarded samples. IIRs, generally, depends on both inputs and outputs, so, unless a specific type of IIR is used, you'd have to calculate every output sample before discarding 3/4 of them.
Just googled an intro-level article on the subject: https://www.dspguru.com/dsp/faqs/multirate/decimation

Resources