Bandlimited waveform generation [closed]

Bandlimited waveform generation [closed] - audio

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am writing a software synthesizer and need to generate bandlimited, alias free waveforms in real time at 44.1 kHz samplerate. Sawtooth waveform would do for now, since I can generate a pulse wave by mixing two sawtooths together, one inverted and phase shifted.
So far I've tried the following approaches:
Precomputing one-cycle perfectly bandlimited waveform samples at different bandlimit frequencies at startup, then playing back the two closest ones mixed together. Works okay I guess, but does not feel very elegant. A lot of samples are needed or the "gaps" between them will be heard. Interpolating and mixing is also quite CPU intensive.
Integrating a train of DC compensated sinc pulses to get a sawtooth wave. Sounds great except that the wave drifts away from zero if you don't get the DC compensation exactly right (which I found to be really tricky). The DC problem can be reduced by adding a bit of leakage to the integrator, but then you lose the low frequencies.
So, my question is: What is the usual way this is done? Any suggested solution must be efficient in terms of CPU, since it must be done in real time, for many voices at once.

One fast way to generate band-limited waveforms is by using band-limited steps (BLEPs). You generate the band-limited step itself:
and store that in a wavetable, then replace each transition with a band-limited step, to create waveforms that look like this:
See the walk-through at Band-Limited Sound Synthesis.
Since this BLEP is non-causal (meaning it extends into the future), for generating real-time waveforms, it's better to use the minimum-phase band-limited step, called a MinBLEP, which has the same frequency spectrum, but only extends into the past:
MinBLEPs take the idea further and
take a windowed sinc, perform a
minimum phase reconstruction and then
integrate the result and store it in a
table. Now to make an oscillator you
just insert a MinBLEP at each
discontinuity in the waveform. So for
a square wave you insert a MinBLEP
where the waveform inverts, for saw
wave you insert a MinBLEP where the
value inverts, but you generate the
ramp as normal.

There are a lot of ways to approach the bandlimited waveform generation. You will end up trading computational cost against quality as usual.
I suggest that you take a look at this site here:
http://www.musicdsp.org/
Check out the archive! It's full of good material. I just did a search on the keyword "bandlimited". The material that pops up should you keep busy for at least a week.
Btw - Don't know if that's what you looking for, but I did alias reduced (e.g. not really band limited) waveform generation a couple of years ago. I just calculated the integral between the last and current sample-position. For traditional synth-waveforms you can do that rather easy if you split your integration interval at the singularities (e.g. when the sawtooth get's his reset). The CPU load was low and the quality acceptable for my needs.
I had the same drift-problems, but applying a high-pass with a very low cutoff-frequency on the integral got rid of that effect. Real analog-synth don't go down into the subhertz region anyway, so you won't miss much.

This is what I came up with, inspired by Nils' ideas. Pasting it here in case it is useful for someone else. I simply box filter a sawtooth wave analytically using the change in phase from the last sample as a kernel size (or cutoff). It works fairly well, there is some audible aliasing at the very highest notes, but for normal usage it sounds great.
To reduce aliasing even more the kernel size can be increased a bit, making it 2*phaseChange for example sounds good as well, though you lose a bit of the highest frequencies.
Also, here is another good DSP resource I found when browsing SP for similar topics: The Synthesis ToolKit in C++ (STK). It's a class library that has lot's of useful DSP tools. It even has ready to use bandlimited waveform generators. The method they use is to integrate sinc as I described in my first post (though I guess they do it better then me...).
float getSaw(float phaseChange)
{
static float phase = 0.0f;
phase = fmod(phase + phaseChange, 1.0f);
return getBoxFilteredSaw(phase, phaseChange);
}
float getPulse(float phaseChange, float pulseWidth)
{
static float phase = 0.0f;
phase = fmod(phase + phaseChange, 1.0f);
return getBoxFilteredSaw(phase, phaseChange) - getBoxFilteredSaw(fmod(phase + pulseWidth, 1.0f), phaseChange);
}
float getBoxFilteredSaw(float phase, float kernelSize)
{
float a, b;
// Check if kernel is longer that one cycle
if (kernelSize >= 1.0f) {
return 0.0f;
}
// Remap phase and kernelSize from [0.0, 1.0] to [-1.0, 1.0]
kernelSize *= 2.0f;
phase = phase * 2.0f - 1.0f;
if (phase + kernelSize > 1.0f)
{
// Kernel wraps around edge of [-1.0, 1.0]
a = phase;
b = phase + kernelSize - 2.0f;
}
else
{
// Kernel fits nicely in [-1.0, 1.0]
a = phase;
b = phase + kernelSize;
}
// Integrate and divide with kernelSize
return (b * b - a * a) / (2.0f * kernelSize);
}

The DC offset from a blit - can be reduced with a simple High Pass Filter! - much like a real analogue circuit where they use a DC blocking cap!

Related

Possible to find velocity of person in video or camera using openpose

Question is, I want to calculate the speed of my arm for Slap detection. So I am using openpose to get the body points (here total points: 25) using body_25 model and using this along with the time I want to deduce the speed of my arm, i googled through openpose, stackoverflow, github.But could not succeed?
Velocity = Distance / Time = dx/dt
dx = frame3_bodypoints - frame_1_bodypoints;
dt = ?
I don't know how to find this from the openpose, is there a way I can find this? Any thoughts, would be great help!

I've never used OpenPose. But Newtonian physics would indicate that a slap corresponds to a sudden change in velocity of the hand.
I think it's a reasonable first approximation to assume that the Δt between frames is constant. Instantaneous variation in frame rate is called jitter. I would expect jitter to be small for modern recording devices. In any case, I don't know how to get instantaneous frame rate with the tools (OpenCV, PIL) that I am familiar with. I couldn't find any references to frame rate or time in the OpenPose docs.
For calculating velocity and delta-velocity, you have choices. Straight up linear velocity of the hand might be the easiest. For position changes use the geometric mean of positions (Δs = sqrt((x2-x1)^2 + (y2-y1)^2).
You could also calculate an angular velocity between the hand and the elbow, but that would be a little more involved and prone to noise.

Efficient generation of sampled waveforms without aliasing artifacts

For a project of mine I am working with sampled sound generation and I need to create various waveforms at various frequencies. When the waveform is sinusoidal, everything is fine, but when the waveform is rectangular, there is trouble: it sounds as if it came from the eighties, and as the frequency increases, the notes sound wrong. On the 8th octave, each note sounds like a random note from some lower octave.
The undesirable effect is the same regardless of whether I use either one of the following two approaches:
The purely mathematical way of generating a rectangular waveform as sample = sign( secondsPerHalfWave - (timeSeconds % secondsPerWave) ) where secondsPerWave = 1.0 / wavesPerSecond and secondsPerHalfWave = secondsPerWave / 2.0
My preferred way, which is to describe one period of the wave using line segments and to interpolate along these lines. So, a rectangular waveform is described (regardless of sampling rate and regardless of frequency) by a horizontal line from x=0 to x=0.5 at y=1.0, followed by another horizontal line from x=0.5 to x=1.0 at y=-1.0.
From what I gather, the literature considers these waveform generation approaches "naive", resulting in "aliasing", which is the cause of all the undesirable effects.
What this all practically translates to when I look at the generated waveform is that the samples-per-second value is not an exact multiple of the waves-per-second value, so each wave does not have an even number of samples, which in turn means that the number of samples at level 1.0 is often not equal to the number of samples at level -1.0.
I found a certain solution here: https://www.nayuki.io/page/band-limited-square-waves which even includes source code in Java, and it does indeed sound awesome: all undesirable effects are gone, and each note sounds pure and at the right frequency. However, this solution is entirely unsuitable for me, because it is extremely computationally expensive. (Even after I have replaced sin() and cos() with approximations that are ten times faster than Java's built-in functions.) Besides, when I look at the resulting waveforms they look awfully complex, so I wonder whether they can legitimately be called rectangular.
So, my question is:
What is the most computationally efficient method for the generation of periodic waveforms such as the rectangular waveform that does not suffer from aliasing artifacts?
Examples of what the solution could entail:
The computer audio problem of generating correct sample values at discrete time intervals to describe a sound wave seems to me somewhat related to the computer graphics problem of generating correct integer y coordinates at discrete integer x coordinates for drawing lines. The Bresenham line generation algorithm is extremely efficient, (even if we disregard for a moment the fact that it is working with integer math,) and it works by accumulating a certain error term which, at the right time, results in a bump in the Y coordinate. Could some similar mechanism perhaps be used for calculating sample values?
The way sampling works is understood to be as reading the value of the analog signal at a specific, infinitely narrow point in time. Perhaps a better approach would be to consider reading the area of the entire slice of the analog signal between the last sample and the current sample. This way, sampling a 1.0 right before the edge of the rectangular waveform would contribute a little to the sample value, while sampling a -1.0 considerable time after the edge would contribute a lot, thus naturally yielding a point which is between the two extreme values. Would this solve the problem? Does such an algorithm exist? Has anyone ever tried it?
Please note that I have posted this question here as opposed to dsp.stackexchange.com because I do not want to receive answers with preposterous jargon like band-limiting, harmonics and low-pass filters, lagrange interpolations, DC compensations, etc. and I do not want answers that come from the purely analog world or the purely theoretical outer space and have no chance of ever receiving a practical and efficient implementation using a digital computer.
I am a programmer, not a sound engineer, and in my little programmer's world, things are simple: I have an array of samples which must all be between -1.0 and 1.0, and will be played at a certain rate (44100 samples per second.) I have arithmetic operations and trigonometric functions at my disposal, I can describe lines and use simple linear interpolation, and I need to generate the samples extremely efficiently because the generation of a dozen waveforms simultaneously and also the mixing of them together may not consume more than 1% of the total CPU time.

I'm not sure but you may have a few of misconceptions about the nature of aliasing. I base this on your putting the term in quotes, and from the following quote:
What this all practically translates to when I look at the generated
waveform is that the samples-per-second value is not an exact multiple
of the waves-per-second value, so each wave does not have an even
number of samples, which in turn means that the number of samples at
level 1.0 is often not equal to the number of samples at level -1.0.
The samples/sec and waves/sec don't have to be exact multiples at all! One can play back all pitches below the Nyquist. So I'm not clear what your thinking on this is.
The characteristic sound of a square wave arises from the presence of odd harmonics, e.g., with a note of 440 (A5), the square wave sound could be generated by combining sines of 440, 1320, 2200, 3080, 3960, etc. progressing in increments of 880. This begs the question, how many odd harmonics? We could go to infinity, theoretically, for the sharpest possible corner on our square wave. If you simply "draw" this in the audio stream, the progression will continue well beyond the Nyquist number.
But there is a problem in that harmonics that are higher than the Nyquist value cannot be accurately reproduced digitally. Attempts to do so result in aliasing. So, to get as good a sounding square wave as the system is able to produce, one has to avoid the higher harmonics that are present in the theoretically perfect square wave.
I think the most common solution is to use a low-pass filtering algorithm. The computations are definitely more cpu-intensive than just calculating sine waves (or doing FM synthesis, which was my main interest). I am also weak on the math for DSP and concerned about cpu expense, and so, avoided this approach for long time. But it is quite viable and worth an additional look, imho.
Another approach is to use additive synthesis, and include as many sine harmonics as you need to get the tonal quality you want. The problem then is that the more harmonics you add, the more computation you are doing. Also, the top harmonics must be kept track of as they limit the highest note you can play. For example if using 10 harmonics, the note 500Hz would include content at 10500 Hz. That's below the Nyquist value for 44100 fps (which is 22050 Hz). But you'll only be able to go up about another octave (doubles everything) with a 10-harmonic wave and little more before your harmonic content goes over the limit and starts aliasing.
Instead of computing multiple sines on the fly, another solution you might consider is to instead create a set of lookup tables (LUTs) for your square wave. To create the values in the table, iterate through and add the values from the sine harmonics that will safely remain under the Nyquist for the range in which you use the given table. I think a table of something like 1024 values to encode a single period could be a good first guess as to what would work.
For example, I am guestimating, but the table for the octave C4-C5 might use 10 harmonics, the table for C5-C6 only 5, the table for C3-C4 might have 20. I can't recall what this strategy/technique is called, but I do recall it has a name, it is an accepted way of dealing with the situation. Depending on how the transitions sound and the amount of high-end content you want, you can use fewer or more LUTs.
There may be other methods to consider. The wikipedia entry on Aliasing describes a technique it refers to as "bandpass" that seems to be intentionally using aliasing. I don't know what that is about or how it relates to the article you cite.

The Soundpipe library has the concept of a frequency table, which is a data structure that holds a precomputed waveform such as a sine. You can initialize the frequency table with the desired waveform and play it through an oscilator. There is even a module named oscmorph which allows you to morph between two or more wavetables.
This is an example of how to generate a sine wave, taken from Soundpipe's documentation.
int main() {
UserData ud;
sp_data *sp;
sp_create(&sp);
sp_ftbl_create(sp, &ud.ft, 2048);
sp_osc_create(&ud.osc);
sp_gen_sine(sp, ud.ft);
sp_osc_init(sp, ud.osc, ud.ft);
ud.osc->freq = 500;
sp->len = 44100 * 5;
sp_process(sp, &ud, write_osc);
sp_ftbl_destroy(&ud.ft);
sp_osc_destroy(&ud.osc);
sp_destroy(&sp);
return 0;
}

Theory behind Autotune/vocoder

I've been hunting all over the web for material about vocoder or autotune, but haven't got any satisfactory answers. Could someone in a simple way please explain how do you autotune a given sound file using a carrier sound file?
(I'm familiar with ffts, windowing, overlap etc., I just don't get the what do we do when we have the ffts of the carrier and the original sound file which has to be modulated)
EDIT: After looking around a bit more, I finally got to know exactly what I was looking for -- a channel vocoder. The way it works is, it takes two inputs, one a voice signal and the other a musical signal rich in frequency. The musical signal is modulated by the envelope of the voice signal, and the output signal sounds like the voice singing in the musical tone.
Thanks for your help!

Using a phase vocoder to adjust pitch is basically pitch estimation plus interpolation in the frequency domain.
A phase vocoder reconstruction method might resample the frequency spectrum at, potentially, a new FFT bin spacing to shift all the frequencies up or down by some ratio. The phase vocoder algorithm additionally uses information shared between adjacent FFT frames to make sure this interpolation result can create continuous waveforms across frame boundaries. e.g. it adjusts the phases of the interpolation results to make sure that successive sinewave reconstructions are continuous rather than having breaks or discontinuities or phase cancellations between frames.
How much to shift the spectrum up or down is determined by pitch estimation, and calculating the ratio between the estimated pitch of the source and that of the target pitch. Again, phase vocoders use information about any phase differences between FFT frames to help better estimate pitch. This is possible by using more a bit more global information than is available from a single local FFT frame.
Of course, this frequency and phase changing can smear out transient detail and cause various other distortions, so actual phase vocoder products may additionally do all kinds of custom (often proprietary) special case tricks to try and fix some of these problems.

The first step is pitch detection. There are a number of pitch detection algorithms, introduced briefly in wikipedia: http://en.wikipedia.org/wiki/Pitch_detection_algorithm
Pitch detection can be implemented in either frequency domain or time domain. Various techniques in both domains exist with various properties (latency, quality, etc.) In the F domain, it is important to realize that a naive approach is very limiting because of the time/frequency trade-off. You can get around this limitation, but it takes work.
Once you've identified the pitch, you compare it with a desired pitch and determine how much you need to actually pitch shift.
Last step is pitch shifting, which, like pitch detection, can be done in the T or F domain. The "phase vocoder" method other folks mentioned is the F domain method. T domain methods include (in increasing order of quality) OLA, SOLA and PSOLA, some of which you can read about here: http://www.scribd.com/doc/67053489/60/Synchronous-Overlap-and-Add-SOLA

Basically you do an FFT, then in the frequency domain you move the signals to the nearest perfect semitone pitch.

Programmatically increase the pitch of an array of audio samples

Hello kind people of the audio computing world,
I have an array of samples that respresent a recording. Let us say that it is 5 seconds at 44100Hz. How would I play this back at an increased pitch? And is it possible to increase and decrease the pitch dynamically? Like have the pitch slowly increase to double the speed and then back down.
In other words I want to take a recording and play it back as if it is being 'scratched' by a d.j.
Pseudocode is always welcomed. I will be writing this up in C.
Thanks,
EDIT 1
Allow me to clarify my intentions. I want to keep the playback at 44100Hz and so therefore I need to manipulate the samples before playback. This is also because I would want to mix the audio that has an increased pitch with audio that is running at a normal rate.
Expressed in another way, maybe I need to shrink the audio over the same number of samples somehow? That way when it is played back it will sound faster?
EDIT 2
Also, I would like to do this myself. No libraries please (unless you feel I could pick through the code and find something interesting).
EDIT 3
A sample piece of code written in C that takes 2 arguments (array of samples and pitch factor) and then returns an array of the new audio would be fantastic!
PS I've started a bounty on this not because I don't think the answers already given aren't valid. I just thought it would be good to get more feedback on the subject.
AWARD OF BOUNTY
Honestly I wish I could distribute the bounty over several different answers as they were quite a few that I thought were super helpful. Special shoutout to Daniel for passing me some code and AShelly and Hotpaw2 for putting in such detailed responses.
Ultimately though I used an answer from another SO question referenced by datageist and so the award goes to him.
Thanks again everyone!

Take a look at the "Elephant" paper in Nosredna's answer to this (very similar) SO question:
How do you do bicubic (or other non-linear) interpolation of re-sampled audio data?
Sample implementations are provided starting on page 37, and for reference, AShelly's answer corresponds to linear interpolation (on that same page). With a little tweaking, any of the other formulas in the paper could be plugged into that framework.
For evaluating the quality of a given interpolation method (and understanding the potential problems with using "cheaper" schemes), take a look at this page:
http://www.discodsp.com/highlife/aliasing/
For more theory than you probably want to deal with (with source code), this is a good reference as well:
https://ccrma.stanford.edu/~jos/resample/

One way is to keep a floating point index into the original wave, and mix interpolated samples into the output wave.
//Simulate scratching of `inwave`:
// `rate` is the speedup/slowdown factor.
// result mixed into `outwave`
// "Sample" is a typedef for the raw audio type.
void ScratchMix(Sample* outwave, Sample* inwave, float rate)
{
float index = 0;
while (index < inputLen)
{
int i = (int)index;
float frac = index-i; //will be between 0 and 1
Sample s1 = inwave[i];
Sample s2 = inwave[i+1];
*outwave++ += s1 + (s2-s1)*frac; //do clipping here if needed
index+=rate;
}
}
If you want to change rate on the fly, you can do that too.
If this creates noisy artifacts when rate > 1, try replacing *outwave++ += s1 + (s2-s1)*frac; with this technique (from this question)
*outwave++ = InterpolateHermite4pt3oX(inwave+i-1,frac);
where
public static float InterpolateHermite4pt3oX(Sample* x, float t)
{
float c0 = x[1];
float c1 = .5F * (x[2] - x[0]);
float c2 = x[0] - (2.5F * x[1]) + (2 * x[2]) - (.5F * x[3]);
float c3 = (.5F * (x[3] - x[0])) + (1.5F * (x[1] - x[2]));
return (((((c3 * t) + c2) * t) + c1) * t) + c0;
}
Example of using the linear interpolation technique on "Windows Startup.wav" with a factor of 1.1. The original is on top, the sped-up version is on the bottom:
It may not be mathematically perfect, but it sounds like it should, and ought to work fine for the OP's needs..

Yes, it is possible.
But this is not a small amount of pseudo code. You are asking for a time pitch modification algorithm, which is a fairly large and complicated amount of DSP code for decent results.
Here's a Time Pitch stretching overview from DSP Dimensions. You can also Google for phase vocoder algorithms.
ADDED:
If you want to "scratch", as a DJ might do with an LP on a physical turntable, you don't need time-pitch modification. Scratching changes the pitch and the speed of play by the same amount (not independently as would require time-pitch modification).
And the resulting array won't be of the same length, but will be shorter or longer by the amont of pitch/speed change.
You can change the pitch, as well as make the sound play faster or slower by the same ratio, by just resampling the signal using properly filtered interpolation. Just move each sample point, instead of by 1.0, by floating point addition by your desired rate change, then filter and interpolate the data at that point. Interpolation using a windowed Sinc interpolation kernel, with a low-pass filter transition frequency below the lower of the original and interpolated local sample rate, will work fairly well. Searching for "windowed Sinc interpolation" on the web returns lots of suitable result.
You need an interpolation method that includes a low-pass filter, or else you will hear horrible aliasing noise. (The exception to this might be if your original sound file is already severely low-pass filtered a decade or more below the sample rate.)

If you want this done easily, see AShelly's suggestion [edit: as a matter of fact, try it first anyway]. If you need good quality, you basically need a phase vocoder.
The very basic idea of a phase vocoder is to find the frequencies that the sound consists of, change those frequencies as needed and resynthesize the sound. So a brutal simplification would be:
run FFT
change all frequencies by a factor
run inverse FFT
If you're going to implement this yourself, you definitely should read a thorough explanation of how a phase vocoder works. The algorithm really needs many more considerations than the three-step simplification above.
Of course, ready-made implementations exist, but from the question I gather you want to do this yourself.

To decrease and increase the pitch is as simple as playing the sample back at a lower or higher rate than 44.1kHz. This will produce the slower/faster record sound but you'll need to add the 'scratchiness' of real records.

This helped me with resampling, which is same thing you need just looked from the opposite side.
If you can't find code, ping me, I have a nice C routine for this.

How do you analyse the fundamental frequency of a PCM or WAV sample? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a sample held in a buffer from DirectX. It's a sample of a note played and captured from an instrument. How do I analyse the frequency of the sample (like a guitar tuner does)? I believe FFTs are involved, but I have no pointers to HOWTOs.

The FFT can help you figure out where the frequency is, but it can't tell you exactly what the frequency is. Each point in the FFT is a "bin" of frequencies, so if there's a peak in your FFT, all you know is that the frequency you want is somewhere within that bin, or range of frequencies.
If you want it really accurate, you need a long FFT with a high resolution and lots of bins (= lots of memory and lots of computation). You can also guess the true peak from a low-resolution FFT using quadratic interpolation on the log-scaled spectrum, which works surprisingly well.
If computational cost is most important, you can try to get the signal into a form in which you can count zero crossings, and then the more you count, the more accurate your measurement.
None of these will work if the fundamental is missing, though. :)
I've outlined a few different algorithms here, and the interpolated FFT is usually the most accurate (though this only works when the fundamental is the strongest harmonic - otherwise you need to be smarter about finding it), with zero-crossings a close second (though this only works for waveforms with one crossing per cycle). Neither of these conditions is typical.
Keep in mind that the partials above the fundamental frequency are not perfect harmonics in many instruments, like piano or guitar. Each partial is actually a little bit out of tune, or inharmonic. So the higher-frequency peaks in the FFT will not be exactly on the integer multiples of the fundamental, and the wave shape will change slightly from one cycle to the next, which throws off autocorrelation.
To get a really accurate frequency reading, I'd say to use the autocorrelation to guess the fundamental, then find the true peak using quadratic interpolation. (You can do the autocorrelation in the frequency domain to save CPU cycles.) There are a lot of gotchas, and the right method to use really depends on your application.

There are also other algorithms that are time-based, not frequency based.
Autocorrelation is a relatively simple algorithm for pitch detection.
Reference: http://cnx.org/content/m11714/latest/
I have written c# implementations of autocorrelation and other algorithms that are readable. Check out http://code.google.com/p/yaalp/.
http://code.google.com/p/yaalp/source/browse/#svn/trunk/csaudio/WaveAudio/WaveAudio
Lists the files, and PitchDetection.cs is the one you want.
(The project is GPL; so understand the terms if you use the code).

Guitar tuners don't use FFT's or DFT's. Usually they just count zero crossings. You might not get the fundamental frequency because some waveforms have more zero crossings than others but you can usually get a multiple of the fundamental frequency that way. That's enough to get the note although you might be one or more octaves off.
Low pass filtering before counting zero crossings can usually get rid of the excess zero crossings. Tuning the low pass filter requires some knowlegde of the range of frequency you want to detect though

FFTs (Fast-Fourier Transforms) would indeed be involved. FFTs allow you to approximate any analog signal with a sum of simple sine waves of fixed frequencies and varying amplitudes. What you'll essentially be doing is taking a sample and decomposing it into amplitude->frequency pairs, and then taking the frequency that corresponds to the highest amplitude.
Hopefully another SO reader can fill the gaps I'm leaving between the theory and the code!

A little more specifically:
If you start with the raw PCM in an input array, what you basically have is a graph of wave amplitude vs time.Doing a FFT will transform that to a frequency histogram for frequencies from 0 to 1/2 the input sampling rate. The value of each entry in the result array will be the 'strength' of the corresponding sub-frequency.
So to find the root frequency given an input array of size N sampled at S samples/second:
FFT(N, input, output);
max = max_i = 0;
for(i=0;i<N;i++)
if (output[i]>max) max_i = i;
root = S/2.0 * max_i/N ;

Retrieval of fundamental frequencies in a PCM audio signal is a difficult task, and there would be a lot to talk about it...
Anyway, usually time-based method are not suitable for polyphonic signals, because a complex wave given by the sum of different harmonic components due to multiple fundamental frequencies has a zero-crossing rate which depends only from the lowest frequency component...
Also in the frequency domain the FFT is not the most suitable method, since frequency spacing between notes follow an exponential scale, not linear. This means that a constant frequency resolution, used in the FFT method, may be insufficient to resolve lower frequency notes if the size of the analysis window in the time domain is not large enough.
A more suitable method would be a constant-Q transform, which is DFT applied after a process of low-pass filtering and decimation by 2 (i.e. halving each step the sampling frequency) of the signal, in order to obtain different subbands with different frequency resolution. In this way the calculation of DFT is optimized. The trouble is that also time resolution is variable, and increases for the lower subbands...
Finally, if we are trying to estimate the fundamental frequency of a single note, FFT/DFT methods are ok. Things change for a polyphonic context, in which partials of different sounds overlap and sum/cancel their amplitude depending from their phase difference, and so a single spectral peak could belong to different harmonic contents (belonging to different notes). Correlation in this case don't give good results...

Apply a DFT and then derive the fundamental frequency from the results. Googling around for DFT information will give you the information you need -- I'd link you to some, but they differ greatly in expectations of math knowledge.
Good luck.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string