Amplitude of audio signal harmonics in Unity3D - audio

I have managed to calculate the pitch of audio input from microphone using the GetSpectrumData function. But now I need to get the amplitudes of the first 7 harmonics of audio (Project requirement)
I have very less knowledge of Audio dsp. Only thing I understood is that harmonics are multiples of the fundamental frequency. But how will I get the amplitudes of the harmonics.
Thanks

First you need to figure out which FFT bin your fundamental frequency is in. Say it resides in bin# 10. The harmonics will reside in integer multiples of that bin so the 2nd harmonic will be in bin 20, 3rd in bin 30 and so on. For each of these harmonic bins you need to compute the amplitude. Depending on the window function you used in the FFT you will need to include a small number of bins in the calculation (google spectral leakage if you're interested).
double computeAmpl(double[] spectrum, int windowHalfLen, int peakBin, int harmonic)
{
double sumOfSquares = 0.0;
for (int bin = peakBin-windowHalfLen; bin <= peakBin+windowHalfLen; bin++)
{
sumOfSquares += spectrum[bin] * spectrum[bin];
}
return sqrt(sumOfSquares);
}
As I mentioned the window half length depends on the window. Some common ones are:
blackman-harris 3 - 3
blackman-harris 4 - 4
flat top - 5
hann - 3

Related

How can I compute (for later uses) a wave wtih a very high frequency?

I'm running a physics simulation related to visible light, and the resulting wave function has a very, very high frequency -- cyclic frequency is on the order of 1.0e15, and the spatial frequency k is on the order of 1.0e7. Thankfully, I only use the spatial frequency, but when I calculate it for later usage (using either math or numpy), I get something that resembles a beat wave, unless I use N ~= k sample points, because I have to calculate it over a much greater range (on the order of 1.0e-3 - 1.0e-1). It produces a beat wave so consistently I spent a few hours to make sure I'm not actually calculating one. I'll also have to use fft() on the resulting wave and I'm afraid it won't work properly with a misrepresented wave.
I've tried using various amounts of sample points, but unless it's extraordinarily high (takes a good minute or two to calculate), only the prominence of beating changes. Just in case I'm misusing numpy, I tried the same thing with appending wave.value calculated by math.sin to a float array, but it had the same result.
import numpy as np
import matplotlib.pyplot as plt
mmScale = 1.0e-3
nmScale = 1.0e-9
c = 3.0e8
N = 1000
class Wave:
def __init__(self, amplitude, wavelength):
self.wavelength = wavelength*nmScale
self.amplitude = amplitude
self.omega = 2*pi*c/self.wavelength
self.k = 2*pi/self.wavelength
def value(self, time, travel):
return self.amplitude*np.sin(self.omega*time - self.k*travel)
x = np.linspace(50, 250, N)*mmScale
wave = Wave(1, 400)
y = wave.value(0.1, x)
plt.plot(x,y)
plt.show()
The code above produces a graph of the function, and you can put in different values for N to see how it gives different waveforms.
Your sampling spatial frequency is:
1/Ts = 1 / ((250-50)*mmScale) / N) = 5000 [samples/meter]
Your wave's spatial frequency is:
1/Tw = 1 / wavelength = 1 / (400e-9) = 2500000 [wavelengths/meter]
You fail to satisfy Nyquist criterion by a factor of (2*2500000 ) / 5000 = 1000.
Thus you must expect serious aliasing effects. See https://en.wikipedia.org/wiki/Aliasing.
Not much can be done to battle it. But there are some tricks that may help you depending on application. One is to represent a wave as a complex envelop around carier frequency, which is 400e-9. Please provide more detail on what you do with the wave.

What is the correct audio volume slider formula?

I'm building a VoIP application. If I take the slider value and just multiply audio samples by it, I get incorrect, nonlinear sounding results. What's the correct formula to get smooth results?
The correct formula is the decibel formula solved for Prms. Here's example code in C:
// level is 0 to 1, silence is dBFS at level 0
void AdjustVolume(int16_t* buffer, size_t length, float level, float silence = -96)
{
float factor = pow(10.0f, (1 - level) * silence / 20.0f);
for (size_t i = 0; i < length; i++)
buffer[i] = static_cast<int16_t>(buffer[i] * factor);
}
There's one tweakable: silence. It's the amount of noise when there's no sound. Or: the loudness level below which you can't hear the sound because of the background noise. The theoretical maximum silence for 16 bit audio samples is -96 dB (a sample with integer value of 1 out of 32767). In the real world however, there's background noise produced by the audio equipment and the surroundings of the listener, so you might want to pick a noisier silence level, like -30 dB or something. Picking the correct silence value will maximize the useful surface area of your volume slider, or minimize the amount of slider area where no perceptible change in volume occurs.

How does Audacity mix audio samples?

So let's say I want to mix these 2 audio tracks:
In Audacity, I can use the "Mix and Render" option to mix them together, and I'll get this:
However, when I try to write my own code to mix, I get this:
This is essentially how I mix the samples:
private function mixSamples(sample1:UInt, sample2:UInt):UInt
{
return (sample1 + sample2) & 0xFF;
}
(The syntax is Haxe but it should be easy to follow if you don't know it.)
These are 8-bit sample audio files, and I want the product to be 8-bit as well, hence the & 0xFF.
I do understand that by simply adding the samples, I should expect clipping. My issue is that mixing in Audacity doesn't cause clipping (at least not to the extent that my code does), and by looking at the "tail" of the second (longer) track, it doesn't seem to reduce the amplitude. It doesn't sound any softer either.
So basically, my question is this: what's Audacity doing that I'm not? I want to mix tracks to sound exactly as if they're being played on top of one another, but I (obviously) don't want this horrendous clipping.
EDIT:
Here is what I get if I sign the values before I add, then unsign the sum value, as suggested by Radiodef:
As you can see it's much better than before, but is still quite distorted and noisy compared to the result Audacity produces. So my problem still stands, Audacity must be doing something differently.
EDIT2:
I mixed the first track on itself, both with my code and Audacity, and compared the points where distortion occurs. This is Audacity's result:
And this is my result:
I think what is happening is you are summing them as unsigned. A typical sound wave is both positive and negative which is why they add together the way they do (some parts cancel). If you have some 8-bit sample that is -96 and another that is 96 and you sum them you will get 0. If what you have is unsigned audio you will instead have the samples 32 and 224 summed = 256 (offset and overflow).
What you need to do is sign them before summing. To sign 8-bit samples convert them to a signed int type and subtract 128 from all of them. I assume what you have are WAV files and you will need to unsign them again after the sum.
Audacity probably does floating point processing. I've heard some real dubious claims about floating point like that it has "infinite dynamic range" and garbage like that but it doesn't clip in the same determinate and obvious way as integers do. Floating point has a finite range of values same as integers but the largest and smallest values are much farther apart. (That's about the simplest way to put it.) Floating point can allow much greater amplitude changes in the audio but the catch is the overall signal to noise ratio is lower than integers.
With the weird distortion my best guess is it is from the mask you are doing with & 0xFF. If you want to actually clip instead of getting overflow you will need to do so yourself.
for (int i = 0; i < samplesLength; i++) {
if (samples[i] > 127) {
samples[i] = 127;
} else if (samples[i] < -128) {
samples[i] = -128;
}
}
Otherwise say you have two samples that are 125, summing gets you 250 (11111010). Then you unsign (add 128) and get 378 (101111010). An & will get you 1111010 which is 122. Other numbers might get you results that are effectively negative or close to 0.
If you want to clip at something other than 8-bit, full scale for a bit depth n will be positive (2 ^ (n - 1)) - 1 and negative 2 ^ (n - 1) so for example 32767 and -32768 for 16-bit.
Another thing you can do instead of clipping is to search for clipping and normalize. Something like:
double[] normalize(double[] samples, int length, int destBits) {
double fsNeg = -pow(2, destBits - 1);
double fsPos = -fsNeg - 1;
double peak = 0;
double norm = 1;
for (int i = 0; i < length; i++) {
// find highest clip if there is one
if (samples[i] < fsNeg || samples[i] > fsPos) {
norm = abs(samples[i]);
if (norm > peak) {
norm = peak;
}
}
}
if (peak != 0) {
// ratio to reduce to where there is not a clip
norm = -fsNeg / peak;
for (int i = 0; i < length; i++) {
samples[i] *= norm;
}
}
return samples;
}
It's a lot simpler than you think; although your original files are 8-bit, Audacity handles them internally as 32-bit floating point. You can see this in the screenshot, in the information panel to the left of each track. This means that adding 2 tracks together means adding two floating point samples at each point, and will simply yield sample values from -2.0 to +2.0, which are then clamped to the -1 to +1 range. By comparison, adding two 8-bit integers together will yield another 8-bit number where the value overflows and wraps around. (This can apply whether you use signed or unsigned values.)

What is the unit of the return values (coefficients) of an FFT?

My application performs an FFT on the raw audio signal (all microphone readings are 16bit integer values in values, which is 1024 cells). It first normalizes the readings according to the 16bit. Then it extracts the magnitude of the frequency 400Hz.
int sample_rate = 22050;
int values[1024];
// omitted: code to read 16bit audio samples into values array
double doublevalues[1024];
for (int i = 0; i < 1024; i++) {
doublevalues[i] = (double)values[i] / 32768.0; // 16bit
}
fft(doublevalues); // inplace FFT, returns only real coefficients
double magnitude = 400.0 / sample_rate * 2048;
printf("magnitude of 400Hz: %f", magnitude);
When I try this out and generate a 400Hz signal to see the value of magnitude, it is around 0 when there is no 400Hz signal and goes up to 30 or 40 when there is.
What is the unit or meaning of the magnitude field? It surprises me that it is larger than 1 even though I normalize the raw signal to be between -1..+1.
It depends on which FFT you are using, as there are different conventions on scaling. The most common convention is that the output values are scaled by N, where N is the size of the FFT. So a 1024 point FFT will have output values which are 1024 times greater than the corresponding input values. A further complication is that for real-to-complex FFTs people typically ignore the symmetric upper half of the FFT, which is fine (because it's conjugate symmetric) but you need to account for a factor of 2 if you do this.
Other common conventions for FFT scaling are (a) no scaling (i.e. the factor of N has been removed) and (b) sqrt(N), which is sometimes used for symmetric scaling behaviour of FFT versus IFFT (sqrt(N) in each direction).
Since sqrt(1024) == 32 it's possible that you're using an FFT routine with sqrt(N) scaling, since you seem to be seeing values of around 30 for for a unit magnitude sine wave input.

How to draw a frequency spectrum from a Fourier transform

I want to plot the frequency spectrum of a music file (like they do for example in Audacity). Hence I want the frequency in Hertz on the x-axis and the amplitude (or desibel) on the y-axis.
I devide the song (about 20 million samples) into blocks of 4096 samples at a time. These blocks will result in 2049 (N/2 + 1) complex numbers (sine and cosine -> real and imaginary part). So now I have these thousands of individual 2049-arrays, how do I combine them?
Lets say I do the FFT 5000 times resulting in 5000 2049-arrays of complex numbers. Do I plus all the values of the 5000 arrays and then take the magnitude of the combined 2049-array? Do I then sacle the x-axis with the songs sample rate / 2 (eg: 22050 for a 44100hz file)?
Any information will be appriciated
What application are you using for this? I assume you are not doing this by hand, so here is a Matlab example:
>> fbins = fs/N * (0:(N/2 - 1)); % Where N is the number of fft samples
now you can perform
>> plot(fbins, abs(fftOfSignal(1:N/2)))
Stolen
edit: check this out http://www.codeproject.com/Articles/9388/How-to-implement-the-FFT-algorithm
Wow I've written a load about this just recently.
I even turned it into a blog post available here.
My explanation is leaning towards spectrograms but its just as easy to render a chart like you describe!
I might not be correct on this one, but as far as I'm aware, you have 2 ways to get the spectrum of the whole song.
1) Do a single FFT on the whole song, which will give you an extremely good frequency resolution, but is in practice not efficient, and you don't need this kind of resolution anyway.
2) Divide it into small chunks (like 4096 samples blocks, as you said), get the FFT for each of those and average the spectra. You will compromise on the frequency resolution, but make the calculation more manageable (and also decrease the variance of the spectrum). Wilhelmsen link's describes how to compute an FFT in C++, and I think some library already exists to do that, like FFTW (but I never managed to compile it, to be fair =) ).
To obtain the magnitude spectrum, average the energy (square of the magnitude) accross all you chunks for every single bins. To get the result in dB, just 10 * log10 the results. That is of course assuming that you are not interested in the phase spectrum. I think this is known as the Barlett's method.
I would do something like this:
// At this point you have the FFT chunks
float sum[N/2+1];
// For each bin
for (int binIndex = 0; binIndex < N/2 + 1; binIndex++)
{
for (int chunkIndex = 0; chunkIndex < chunkNb; chunkIndex++)
{
// Get the magnitude of the complex number
float magnitude = FFTChunk[chunkIndex].bins[binIndex].real * FFTChunk[chunkIndex].bins[binIndex].real
+ FFTChunk[chunkIndex].bins[binIndex].im * FFTChunk[chunkIndex].bins[binIndex].im;
magnitude = sqrt(magnitude);
// Add the energy
sum[binIndex] += magnitude * magnitude;
}
// Average the energy;
sum[binIndex] /= chunkNb;
}
// Then get the values in decibel
for (int binIndex = 0; binIndex < N/2 + 1; binIndex++)
{
sum[binIndex] = 10 * log10f(sum[binIndex]);
}
Hope this answers your question.
Edit: Goz's post will give you plenty of information on the matter =)
Commonly, you would take just one of the arrays, corresponding to the point in time of the music in which you are interested. The you would calculate the log of the magnitude of each complex array element. Plot the N/2 results as Y values, and scale the X axis from 0 to Fs/2 (where Fs is the sampling rate).

Resources