Is it possible to, as accurately as possible, decompose an audio into MIDI, given the SoundFont that was used?

Is it possible to, as accurately as possible, decompose an audio into MIDI, given the SoundFont that was used? - audio

If I know the SoundFont that a MIDI to audio track has used, can I theoretically reverse the audio back into it's (most likely) MIDI components? If so, what would be one of the best approach to doing this?
The end goal is to try encoding audio (even voice samples) into MIDI such that I can reproduce the original audio in MIDI format better than, say, BearFileConverter. Hopefully with better results than just bandpass filters or FFT.
And no, this is not for any lossy audio compression or sheet transcription, this is mostly for my curiosity.

For monophonic music only, with no background sound, and if your SoundFont synthesis engine and your record sample rates are exactly matched (synchronized to 1ppm or better, have no additional effects, also both using a known A440 reference frequency, known intonation, etc.), then you can try using a set of cross correlations of your recorded audio against a set of synthesized waveform samples at each MIDI pitch from your a-priori known font to create a time line of statistical likelihoods for each MIDI note. Find the local maxima across your pitch range, threshold, and peak pick to find the most likely MIDI note onset times.
Another possibility is sliding sound fingerprinting, but at an even higher computational cost.
This fails in real life due to imperfectly matched sample rates plus added noise, speaker and room acoustic effects, multi-path reverb, and etc. You might also get false positives for note waveforms that are very similar to their own overtones. Voice samples vary even more from any template.
Forget bandpass filters or looking for FFT magnitude peaks, as this works reliably only for close to pure sinewaves, which very few musical instruments or interesting fonts sound like (or are as boring as).

Related

I need to analyse many audio WAV files for characteristic noise, ideas?

I need to be able to analyze (search thru) hundreds of WAV files and detect but not remove static noise. As done currently now, I must listen to each conversation and find the characteristic noise/static manually, which takes too much time. Ideally, I would need a program that can read each new WAV file and be able to detect characteristic signatures of the static noise such as periods of bursts of white noise or full audio band, high amplitude noise (like AM radio noise over phone conversation such as a wall of white noise) or bursts of peek high frequency high amplitude (as in crackling on the phone line) in a background of normal voice. I do not need to remove the noise but simply detect it and flag the recording for further troubleshooting. Ideas?
I can listen to the recordings and find the static or crackling but this takes time. I need an automated or batch process that can run on its own and flag the troubled call recordings (WAV files for a phone PBX). These are SIP and analog conversations depending on the leg of the conversation so RTSP/SIP packet analysis might be an option, but the raw WAV file is the simplest. I can use Audacity, but this still requires opening each file and looking at the visual representation of the audio spectrometry and is only a little faster than listening to each call but still cumbersome.
I currently have no code or methods for this task. I simply listen to each call wav file to find the noise.
I need a batch Wav file search that can render wav file recordings that contain the characteristic noise or static or crackling over the recording phone conversation.

Unless you can tell the program how the noise looks like, it's going to be challenging to run any sort of batch processing. I was facing a similar challenge and that prompted me to develop (free and open source) software to help user in audio exploration, analysis and signal separation:
App: https://audioexplorer.online/
Docs: https://tracek.github.io/audio-explorer/
Source code: https://github.com/tracek/audio-explorer
Essentially, it visualises audio as a 2d scatter plot rather than only "linear", as in waveform or spectrogram. When you upload audio the following happens:
Onsets are detected (based on high-frequency content algorithm from aubio) according to the threshold you set. Set it to None if you want all.
Per each audio fragment, calculate audio features based on your selection. There's no universal best set of features, all depends on the application. You might try for starter with e.g. Pitch statistics. Consider setting proper values for bandpass filter and sample length (that's the length of audio fragment we're going to use). Sample length could be in future established dynamically. Check docs for more info.
The result is that for each fragment you have many features, e.g. 6 or 60. That means we have then k-dimensional (where k is number of features) structure, which we then project to 2d space with dimensionality reduction algorithm of your selection. Uniform Manifold Approximation and Projection is a sound choice.
In theory, the resulting embedding should be such that similar sounds (according to features we have selected) are closely together, while different further apart. Your noise should be now separated from your "not noise" and form cluster.
When you hover over the graph, in right-upper corner a set of icons appears. One is lasso selection. Use it to mark points, inspect spectrogram and e.g. download table with features that describe that signal. At that moment you can also reduce the noise (extra button appears) in a similar way to Audacity - it analyses the spectrum and reduces these frequencies with some smoothing.
It does not completely solve your problem right now, but could severely cut the effort. Going through hundreds of wavs could take better part of the day, but you will be done. Want it automated? There's CLI (command-line interface) that I am developing at the same time. In not-too-distant future it should take what you have labelled as noise and signal and then use supervised machine learning to go through everything in batch mode.
Suggestions / feedback? Drop an issue on GitHub.

How to recognize if an audio sample has been compressed and then decompressed?

Some years ago I made a music audio recording, and I can't find the original WAV files, I have only compressed MP3s. Now I found an audio CD, but I don't know if it was made using the original, uncompressed WAVs, or if it was made from compressed MP3 or OGG files.
Is there a way how to detect if an audio sample has been compressed and decompressed using a lossy compression such as MP, OGG, ..., without having the original to compare to?
Update:
Trying #MisterHenson's suggestion, I plotted the spectra of the two samples, with obvious differences in the graphs:
The sample from the CD:
The sample from the MP3:
This practically solves solves my current problem, but still I have these open questions:
If the spectra were visually indistinguishable, I wouldn't know if there is a real difference, or that I just can't distinguish them (i.e. the compression would be of better quality). What else could I try?
Similarly what would I do if I didn't have the MP3 file to compare to, just a single audio sample?
Is there an automated method, that'd answer the question with a reasonable probability?

I made an example to stress the topology of all MP3 transcodes, the source material being a Chopin nocturne. MP3 on top, Lossless on bottom. All recordings have background noise of some amplitude, and that noise is faintly visible here. What the MP3 transcode (Lame's V2 preset in this case) does is create a hard limit at ~16kHz. On a 320kbps bitrate 44.1kHz sample rate MP3, this hard limit appears at around 20kHz, but it would still be visibly different in this image.
You can pick out this shelf without having the original lossless file for comparison. I'm willing to say all music has amplitude at frequencies above even 19kHz. Here's an example for which I do not have the lossless source file, just a 320kbps MP3. You can see the very hard limit at 20kHz as well as a milder cutoff at 19kHz. Were it lossless, that red blob in the middle would extend all the way up to 22kHz since the sample rate is 44.1kHz.
I would say this process is probably automatable, but I do not know of any attempts to automate it. If this were automated, though, I'd say it could pick Lossy from Lossless with much higher accuracy than you or I, by virtue of it being able to analyze the entire spectrum as opposed to just the high frequency cutoffs.
Full res images:
http://i.imgur.com/dezONol.jpg
http://i.imgur.com/1qokxAN.jpg

The above approaches sound very promising although maybe a little complicated -- you might first try something easy, like check the distribution of the least significant bit. In a natural sample, LSB should be an almost exact 50/50 distribution between zeroes and ones (actually across many samples would have some variance following a binomial distribution but with millions or billions of bits this will be ridiculously close to 50/50 in any given sample). In a lossy sample, you will find an unlikely distribution in the LSB.
Something like this:
1 -- extract LSB from each data point
2 -- apply chi-squared test to judge if distribution is unusual

Here is the deal.
A raw sample (or a raw piece of sound) is encoded in a certain quality.
Some sound cards can go further with 64bit sampling.
But let's assume that we have sound files of a certain KNOWN quality.
CD quality is okay for the human ear.
A studio, would make use of more quality samples though. Like 24bit as a standard.
So you got a waveform filename.wav that really has a sample rate 44100 Hz.
What does that mean?
It means the computer can take a huge amount of different samples per second to represent almost the exact sound.
Is the sound original? Depends on how it was made.
If it was made by your computer and a piece of software using a 16bit default sound card yes it is.
If it was from an analogue recording though, it loses some of its quality on the digitization at 44100 Hz fortunately not so significant for the human ear.
NOTE THAT mp3 recordings is a bad idea for professional recording.
But since mp3 recording do exist... this adds complexity to your question. :P
So some sound quality is lost on digitization with a 16bit sound card.
Now similar thing can happen when you encode something to mp3.
Check out your picture. Above 17000 there is no sound. It was butchered to make the sound file significant smaller, without making any significant damage to the audio quality. Is it the same piece of sound? No. It sounds the same though. But a sound engineer LOVES original and good quality samples, because of the information that is NOT cut.
Imagine me, making an original sound, so balanced and compressed that even after an mp3 converting it is hard to tell if it is original sound or not. Imagine me using equalizers to cut any sharp edges, and gate effects to extremely normalize it. Also, my sound generators are some 8bit oscillators passing through some fx and filters.
If I convert it back to wavetable, there might be no difference.
For instance:
[UNCHANGED FREQUENCIES][CUT FREQUENCIES]
Waveform: =================================
mp3: =======================
Waveform: =======================
Waveform:
[UNCHANGED FREQUENCIES][CUT FREQUENCIES]
Waveform: =================
mp3 =================
Waveform: =================
The following seems impossible to me (except if the converter has bugs thing that can be heard)
[UNCHANGED FREQUENCIES][CUT FREQUENCIES]
Waveform: =========================
mp3 =======================
Waveform: =============================
So your question depends on the original source you used in the first waveform.
Good news is that a sample is RARELY THAT limited and compressed.
So it seems to me that the CD you used will probably sound like original waveform,
while as you can see, the mp3 has cut out frequencies.
To be sure of course you need a frequency analyzer and spectrum as MischaNix already has shown.
There are many mp3 encodings too. Some are static, some dynamic, some cut more and some cut less sound information. Some are also bigger than others for that reason.
Now there are lossless formats too.
And then there is ogg that is small enough and also has great quality.
So this question can become a huge topic for no reason here. I will not talk about all these.
If the issue is giving an original sample, your pictures show me significant differences between the two samples. I mean, making a waveform out of the mp3 cut variation, should look like that cut variation. You can not get information out of nothing.
Burn the mp3 on a cd, then get the wave, compare the new waveform with the old and the mp3 waveform. It will probably not be the same thing so you might hit the jackpot here. It is possible you got an original backup on your hands.
From now on though, try sampling raw material and store them in a CD or DVD before discarding them.
Or at least keep good uncompressed samples in a backup.
Open questions:
If the spectra were visually indistinguishable, I wouldn't know if there is a real difference, or that I just can't distinguish them.
Correct. But this would occur seldom without intention on sampling.
Why asking such a question? :) Do you have steganography in mind?
If yes, make sure to keep in mind the nature of the piece of sound you are gonna use. Samples are not appropriate. "Finished songs" are!
Similarly what would I do if I didn't have the MP3 file to compare to, just a single audio sample?
Since there are many mp3 encoding settings of different qualities, you can check if the lowest quality was used. If not there is uncertainty because of the compression capabilities. If this applies to the whole sample, then you got to see if compression was needed. That's why you can not be certain on a song. You don't record with SO hard compression in the first place. I guess this is another meta-reason why you need a natural sound. So if its about a recording you might be lucky.
Now about a finished mastered song... things get rough once again. It is about the nature, the type, of the sound. A recording is easier to figure out what is going on if you knew you used waveform recording. An mp3 recording of course is a waste of time. On the other hand a finished song, usually nowadays makes compressors, limiters, gates and chain compressors burn out. The amount of use of this techniques in modern mastering is enormous. So... you will really need luck to find out if the original piece was compressed before, before having an original waveform to begin with.
Is there an automated method, that'd answer the question with a reasonable probability?
None that I know. Sorry. :(
But that doesn't mean than nobody can make one.
BUT!
A stereo sample is usually split out to two channels. Left and right.
Now if you got a spectrum analyzer in a Digital Audio Workstation,
and take a look only on the left channels of two different samples, you can on the fly see
if they are the same or not I guess.
In order to understand what I mean, take a look at THIS link.
Go at 05:00 and just watch the interface.
Phew. Hope this will help you further, since it took some time. :P
Cheers.
Edit: Fixing some stuff here and there.

I found a description of the problem, a solution and an implementation in Python by Maurits van der Schee, that works with a FLAC though.
From the sample only the first 30 seconds are analyzed. For every
second the frequency spectrum of the sample is computed by applying a
Hanning Window and doing a Fast Fourier Transform. These spectrums are
added, so that eventually you end up with 30 stacked spectrums. These
are divided by 30 to get the average spectrum. Then the spectrum is
normalized using log10. After that we applied a rolling average on the
spectrum with a window size of 1/100th of the frequency, being
44100/100=441 samples.
If there is an unnatural cutoff in the frequency spectrum, this cutoff
is the thing we need to find. We sweep the spectrum from 44100th back
to the 1st frequency, where the variable frequency is f. As soon as
the magnitude at f-220 is more than 1.25 higher than the magnitude at
f and the magnitude at f is no bigger than 1.1x the magnitude at 44100
we have found the cutoff point. The cutoff point is multiplied by 100
and divided by the frequency to get to the percentage of the spectrum
not cut off.

Things to look for:
Cut-off frequency changing on frame boundaries (not going to be a 100% hard cut, but look for "audible" to "inaudible" and vice versa)
Frequencies disappearing or appearing on frame boundaries (again, not 100%)
Noise levels changing on frame boundaries (actually pretty solid for lossy codecs)
For MP3, the frame boundaries are precisely every 1152 samples, though you might be able to "see" the granules every 576 samples.
For Vorbis, the frame boundaries are typically every 128 or 1024 samples depending on transients the encoder "saw". You can probably get away with doing every 128 samples...
You'll have to research the other formats to know their frame sizes (I don't know them offhand).

Extracting pitch from singing voice

I'd like to extract the pitch from a singing voice. The track in question contains only a single voice and no other sounds.
I want to know the loudness and perceived pitch frequency at a given point in time. So something like the following:
0.0sec 400Hz -20dB
0.1sec 401Hz -9dB
0.2sec 403Hz -10dB
0.3sec 403Hz -10dB
0.4sec 404Hz -11dB
0.5sec 406Hz -13dB
0.6sec 410Hz -15dB
0.7sec 411Hz -16dB
0.8sec 409Hz -20dB
0.9sec 407Hz -24dB
1.0sec 402Hz -34dB
How might I achieve such an output? I'm interested in slight changes in frequency as apposed to a specific note value. I have some DSP knowledge and I can program in C++ and python but I'd like to avoid reinventing the wheel if possible.

Note that slight changes in frequency in Hz and perceived pitch may not be the same thing. Perceived pitch resolution seems to vary with absolute frequency, duration, and loudness. If you want more accuracy than this, there might be some research papers on estimating the time between each glottal closure (probably using a deconvolution or pattern matching technique), which would give you some sort of pitch period. The simplest pitch estimate might be some form of weighted autocorrelation, for which lots of canned algorithms and code is available.
Since dB is log scale, this measure might be somewhat closer to perceived loudness, but has to be spectrally weighted with some perceptual frequency response curve over some duration of measurement.
There seem to be research papers on both of these topics, as well as many textbooks on human audio perception as well as on common audio DSP techniques.

I suggest you read this article
http://audition.ens.fr/adc/pdf/2002_JASA_YIN.pdf
. This is one of the simplest methods of pitch detection, and it works very well.
Also, for measuring the instantaneous power of the signal, you can just take the absolute value of the signal and divide by 1/√2 (Gives the RMS value) and then smooth it (usually a first order low pass filter). I hope this helps. Good luck!

Basic unit of Sound?

If we consider computer graphics to be the art of image synthesis where the basic unit is a pixel.
What is the basic unit of sound synthesis?
[This relates to programming as I want to generate this via a computer program.]
Thanks!

The basic unit is a sample
In a WAVE file, the sample is just an integer specifying where to move the speaker head to.
The sample rate determines how often a new sample is fed to the speakers (I'm not entirely sure how this part works, but it does get converted to an analog signal first). The samples are typically laid out in the file one right after another.
When you plot all the samples with x-axis being time and y-axis being sample_value, you can see the waveform.
In a wave file, samples can (theoretically) be any bit-size from 0-65535, which remains constant throughout the wave file. But typically 16 or 24 bits are used.

Computer graphics can also have vector shapes as basic units, not just pixels. Generally, vector graphics are generated via computer tools while captured data tends to appear as a grid of pixels (corresponding to an array of sensors in a camera or other capture device). Obviously there is considerable crossover between those classifications.
Similarly, there are sampled (such as .WAV) and generative (such as .MIDI) forms of computer audio. In the sampled case, the smallest unit is a single sample. Just like an array of pixels in the brightness, x- and y-dimensions come together to form an image, an array of samples in the loudness and time dimensions come together to form a sound. In the generative case, it will be something more like a single tone rendered in a particular voice just like vector graphics have paths drawn with particular textures.

A pixel can have a value and be encoded in digital bitmap samples. The same properties apply to sound and digital audio samples.
A pixel is a physical device that can only render the amplitudes of 3 frequencies of light (Red, Green, Blue) at a time. A speaker is a physical device that can render the amplitudes of a wide range of frequencies (~40,000) at a time. The bit resolution of a sample (number of bits used to to store the value of a sample) mainly determines how many colors/tones can be rendered - the fidelity of the physical playback device.
Also, as patterns of pixels can be encoded or compressed, most patterns of sound samples are also encoded or compressed (or both).

The fundamental unit of signal processing (of which audio is a special case) would be the sample.
The frequency at which you need to sample a signal depends on the maximum frequency present in the waveform. Sampling theorem states that it is normally sufficient to sample at twice the frequency of the maximum frequency present in the signal.
http://en.wikipedia.org/wiki/Sampling_theorem
The human ear is sensitive to sounds up to around 20kHz (the upper frequency lowers with age). This is why music on CD is sampled at 44kHz.
It is often more useful to think of music as being comprised of individual frequencies.
http://www.phys.unsw.edu.au/jw/sound.spectrum.html
Most sound analysis and creation is based on this idea.
Related concepts:
Psychoacoustics: Human perception of sound. Relates to modern sound compression techniques such as mp3.
Fourier series: How complex waveforms are composed of individual frequencies.

I would say the basic unit of sound synthesis is the sine wave. But your definition of synthesis is perhaps different to what audio people would refer to sound synthesis. Sound systhesis is the creation of sound using the fundamental components of sound.
With sine waves, we can synthesise sounds using many techniques such as substractive synthesis, additive synthesis or FM synthesis.
Fourier theory states that every sound is a summation of sine waves of differing phases, frequencies and amplitudes.
OK, so how do we represent a sine wave on a computer? well, a sine wave will be generated using a buffer(array) of 'samples' that have been generated by a function or read from a table. The same technique applies to any sound captured on a computer.
A 'sample' is typically represented as number between -1 and 1 that directly correlates to the amplitude of a sound at a given moment in time. A typical sound recorded at 16 bit depth, would have 65536 (2pow16) possible amplitude values. When being recorded, typically, a sample will be captured 44.1k per second of sound. This is called the sampling frequency rate, or simply the sample rate.
Upon playback from you computer, each sample will pass though an Digital to Analogue converter and generate a vibration on your pc speaker and will in turn cause your ear to percieve the recorded sound.

Sound can be expressed as several different units, but the most common in synthesis/computer music is decibels (dB), which are a relative logarithmic measure of amplitude. Specifically they are normally relative to the maximum amplitude of the audio system.
When measuring sound in "real life", the units are normally A-weighted Decibels or dB(A).
The frequency of a sound (i.e. its pitch) is its amplitude over time, or in the digital world, its amplitude over samples. The number of samples per unit of real time is called the sampling rate; conventional hi-fi systems have sampling rates of 44 kHz (44,000 samples per second) and synthesis/recording software usually supports up to 96 kHz.
Everything sound in the digital domain can be represented as a waveform with the X-axis representing the time (or sample number) and the Y-axis representing the amplitude.

frequency and amplitude of the wave are what make up sound.
That is for a tone.
Music or for that matter most noise is a composite of multiple simultaneous sound waves superimposed on one another.
The unit for amplitute is the
Bel. (We use tenths of a Bel
therefore the term decibel)
The unit for frequency is the
Hertz.
That being said synthesis of music is a large field.

Bitmapped graphics are based on sampling the amplitude of light in a 2D space, where each sample is digitized to a given bit depth and often converted to a logarithmic representation at a different bit depth. The samples are always positive, since you can't be darker than pure black. Each of these samples is called a pixel.
Sound recording is most often based on sampling the magnitude of sound pressure at a microphone, where the samples are taken at constant time intervals. These samples can be positive or negative with respect to perfect silence. Most often these samples are not converted to a logarithm, even though sound is perceived in a logarithmic fashion just as light is. There is no special term to refer to these samples as there is with pixels.
The Bels and Decibels mentioned by others are useful in the context of measuring peak or average sound levels. They are not used to describe the individual sound samples.
You might also find it useful to know how sound file formats compare to image file formats. WAVE is an uncompressed format specific to Windows and is analogous to BMP. MP3 is a lossy compression analogous to JPEG. FLAC is a lossless compression analogous to 24-bit PNG.

If computer graphics are colored dots in 2 dimensional space representing a 3 dimensional space, then sound synthesis is amplitude values regularly partitioned in time representing musical events.
If you want your result to sound like music (the kind of music most people like at least), then you are either going to use some standard synthesis techniques, or literally waste decades of your life reinventing them from scratch.
The most basic techniques are additive synthesis, in which the individual elements are the frequencies, amplitudes, and phases of sine oscillators; subtractive synthesis, where you work with filter coefficients and a complex input waveform; frequency modulation synthesis, where you work with modulation depths and rates of stages of modulation; granular synthesis where short (hundredths to tenths of a second long) enveloped pieces of a recorded sound or an artificial waveform are combined in immense numbers. Each of these in practice uses parameters that evolve over the course of a note, and often you will mix elements of various techniques into a larger instrument.
I recommend this book, though it doesn't have the math for many concepts it at least lays the ground for the concepts used, and gives a nice overview of the techniques.
You wouldn't waste your time going sample by sample to do music in practice any more than you would waste your time going pixel by pixel to render 3d (in other words yeah go sample by sample if making a tool for other people to make music with, but that is way too low a level if you are interested in the task of making music).

Probably the envelope. A tone/note has a shape described by: attack decay sustain release

The byte, or word, depending on the bit-depth of the sound.

Real time pitch detection

I'm trying to do real time pitch detection of a users singing, but I'm running into alot of problems. I've tried lots of methods, including FFT (FFT Problem (Returns random results)) and autocorrelation (Autocorrelation pitch detection returns random results with mic input), but I can't seem to get any methods to give a good result. Can anyone suggest a method for real-time pitch tracking or how to improve on a method I already have? I can't seem to find any good C / C++ methods for real time pitch detection.
Thanks,
Niall.
Edit: Just to note, i've checked that the mic input data is correct, and that when using a sine wave the results are more or less the correct pitch.
Edit: Sorry this is late, but at the moment, im visualizing the autocolleration by taking the values out of the results array, and each index, and plotting the index on the X axis and the value on the Y axis (both are divided by 100000 or something, and im using OpenGL), plugging the data into a VST host and using VST plugins isn't an option to me. At the moment, it just looks like some random dots. Am i doing it correctly, or can you please point me torwards some code for doing it or help me understand how to visualize the raw audio data and autocorrelation data.

Taking a step back... To get this working you MUST figure out a way to plot intermediate steps of this process. What you're trying to do is not particularly hard, but it is error prone and fiddly. Clipping, windowing, bad wiring, aliasing, DC offsets, reading the wrong channels, the weird FFT frequency axis, impedance mismatches, frame size errors... who knows. But if you can plot the raw data, and then plot the FFT, all will become clear.

I found several open source implementations of real-time pitch tracking
dywapitchtrack uses a wavelet-based algorithm
"Realtime C# Pitch Tracker" uses a modified autocorrelation approach now removed from Codeplex - try searching on GitHub
aubio (mentioned by piem; several algorithms are available)
There are also some pitch trackers out there which might not be designed for real-time, but may be usable that way for all I know, and could also be useful as a reference to compare your real-time tracker to:
Praat is an open source package sometimes used for pitch extraction by linguists and you can find the algorithm documented at http://www.fon.hum.uva.nl/paul/praat.html
Snack and WaveSurfer also contain a pitch extractor

I know this answer isn't going to make everyone happy but here goes.
This stuff is hard, very hard. Firstly go read as many tutorials as you can find on FFT, Autocorrelation, Wavelets. Although I'm still struggling with DSP I did get some insights from the following.
https://www.coursera.org/course/audio the course isn't running at the moment but the videos are still available.
http://miracle.otago.ac.nz/tartini/papers/Philip_McLeod_PhD.pdf thesis about the development of a pitch recognition algorithm.
http://dsp.stackexchange.com a whole site dedicated to digital signal processing.
If like me you didn't do enough maths to completely follow the tutorials don't give up as some of the diagrams and examples still helped me to understand what was going on.
Next is test data and testing. Write yourself a library that generates test files to use in checking your algorithm/s.
1) A super simple pure sine wave generator. So say you are looking at writing YAT(Yet Another Tuner) then use your sine generator to create a series of files around 440Hz say from 420-460Hz in varying increments and see how sensitive and accurate your code is. Can it resolve to within 5Hz, 1Hz, finer still?
2) Then upgrade your sine wave generator so that it adds a series of weaker harmonics to the signal.
3) Next are real world variations on harmonics. So whilst for most stringed instruments you'll see a series of harmonics as simple multiples of the fundamental frequency F0, for instruments like clarinets and flutes because of the way the air behaves in the chamber the even harmonics will be missing or very weak. And for some instruments F0 is missing but can be determined from the distribution of the other harmonics. F0 being what the human ear perceives as pitch.
4) Throw in some deliberate distortion by shifting the harmonic peak frequencies up and down in an irregular manner
The point being that if you are creating files with known results then its easier to verify that what you are building actually works, bugs aside of course.
There are also a number of "libraries" out there containing sound samples.
https://freesound.org from the Coursera series mentioned above.
http://theremin.music.uiowa.edu/MIS.html
Next be aware that your microphone is not perfect and unless you have spent thousands of dollars on it will have a fairly variable frequency response range. In particular if you are working with low notes then cheaper microphones, read the inbuilt ones in your PC or Phone, have significant rolloff starting at around 80-100Hz. For reasonably good external ones you might get down to 30-40Hz. Go find the data on your microphone.
You can also check what happens by playing the tone through speakers and then recording with you favourite microphone. But of course now we are talking about 2 sets of frequency response curves.
When it comes to performance there are a number of freely available libraries out there although do be aware of the various licensing models.
Above all don't give up after your first couple of tries. Best of luck.

Here's the C++ source code for an unusual two-stage algorithm that I devised which can do Realtime Pitch Detection on polyphonic MP3 files while being played on Windows. This free application (PitchScope Player, available on web) is frequently used to detect the notes of a guitar or saxophone solo upon a MP3 recording. The algorithm is designed to detect the most dominant pitch (a musical note) at any given moment in time within a MP3 music file. Note onsets are accurately inferred by a significant change in the most dominant pitch (a musical note) at any given moment during the MP3 recording.
When a single key is pressed upon a piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies. The elements of this composite of vibrations at differing frequencies are referred to as harmonics or partials. For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523, 3 x 261.6 = 785, 4 x 261.6 = 1046 ). Linked at the bottom, is a snapshot of the actual harmonics which occur during a polyphonic MP3 recording of a guitar solo.
Instead of a FFT, I use a modified DFT transform, with logarithmic frequency spacing, to first detect these possible harmonics by looking for frequencies with peak levels (see diagram below). Because of the way that I gather data for my modified Log DFT, I do NOT have to apply a Windowing Function to the signal, nor do add and overlap. And I have created the DFT so its frequency channels are logarithmically located in order to directly align with the frequencies where harmonics are created by the notes on a guitar, saxophone, etc.
Now being retired, I have decided to release the source code for my pitch detection engine within a free demonstration app called PitchScope Player. PitchScope Player is available on the web, and you could download the executable for Windows to see my algorithm at work on a mp3 file of your choosing. The below link to GitHub.com will lead you to my full source code where you can view how I detect the harmonics with a custom Logarithmic DFT transform, and then look for partials (harmonics) whose frequencies satisfy the correct integer relationship which defines a 'pitch'.
My Pitch Detection Algorithm is actually a two-stage process: a) First the ScalePitch is detected ('ScalePitch' has 12 possible pitch values: {E, F, F#, G, G#, A, A#, B, C, C#, D, D#} ) b) and after ScalePitch is determined, then the Octave is calculated by examining all the harmonics for the 4 possible Octave-Candidate notes. The algorithm is designed to detect the most dominant pitch (a musical note) at any given moment in time within a polyphonic MP3 file. That usually corresponds to the notes of an instrumental solo. Those interested in the C++ source code for my Two-Stage Pitch Detection algorithm might want to start at the Estimate_ScalePitch() function within the SPitchCalc.cpp file at GitHub.com.
https://github.com/CreativeDetectors/PitchScope_Player
Below is the image of a Logarithmic DFT (created by my C++ software) for 3 seconds of a guitar solo on a polyphonic mp3 recording. It shows how the harmonics appear for individual notes on a guitar, while playing a solo. For each note on this Logarithmic DFT we can see its multiple harmonics extending vertically, because each harmonic will have the same time-width. After the Octave of the note is determined, then we know the frequency of the Fundamental.

I had a similar problem with microphone input on a project I did a few years back - turned out to be due to a DC offset.
Make sure you remove any bias before attempting FFT or whatever other method you are using.
It is also possible that you are running into headroom or clipping problems.
Graphs are the best way to diagnose most problems with audio.

Take a look at this sample application:
http://www.codeproject.com/KB/audio-video/SoundCatcher.aspx
I realize the app is in C# and you need C++, and I realize this is .Net/Windows and you're on a mac... But I figured his FFT implementation might be a starting reference point. Try to compare your FFT implementation to his. (His is the iterative, breadth-first version of Cooley-Tukey's FFT). Are they similar?
Also, the "random" behavior you're describing might be because you're grabbing data returned by your sound card directly without assembling the values from the byte-array properly. Did you ask your sound card to sample 16 bit values, and then gave it a byte-array to store the values in? If so, remember that two consecutive bytes in the returned array make up one 16-bit audio sample.

Java code for a real-time real detector is available at http://code.google.com/p/freqazoid/.
It works fairly well on any computer running post-2008 real-time Java. The project has been dropped and could be picked up by any interested party. Contact me if you want further details.

Check out aubio, and open source library which includes several state-of-the-art methods for pitch tracking.

I have asked a similar question here:
C/C++/Obj-C Real-time algorithm to ascertain Note (not Pitch) from Vocal Input
EDIT:
Performous contains a C++ module for realtime pitch detection
Also Yin Pitch-Tracking algorithm

You could do real time pitch detection, be it of a singer's voice, with TarsosDSP
https://github.com/JorenSix/TarsosDSP
just in case anyone hasn't heard of it yet :-)

Can you adapt anything from instrument tuners? My delightfully compact guitar tuner is able to detect the pitch of the strings pretty well. I see this reference to a piano tuner which explains an algorithm to some extent.

Here are some open source libraries that implement pitch detection:
WORLD : speech analysis/synthesis toolkit. This is especially suitable if your source signal is voice.
aubio : audio feature extraction library. Implements many pitch detection algorithms.
Pitch detection : a collection of pitch detection algorithms implemented in C++.
dywapitchtrack : a high quality pitch detection algorithm.
YIN : another implementation of the YIN algorithm in a single C++ source file.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string