Why were square waves used in old hardware instead of sine waves? - audio

This answer here postulates that to actually generate a square wave (or any other abstract wave-shape) you have to layer multiple sine waves on top of each other. Yet old hardware (Commodore, NES, etc) lacked sine wave channels and instead relied heavily on square pulse-waves, triangle waves, noise and sawtooth waves. I always assumed this was done because those waves are easier to generate than a simple sine wave. So,would genereating these wave shapes not be computationally more expensive? Why was it done anyway?

This answer here postulates that to actually generate a square wave […] you have to layer multiple sine waves on top of each other.
Not really, it just describes how a square wave can be analyzed to prove certain facts about its sound - how much energy is in each frequency band and such. This is somewhat similar to how every integer can be factored into one or more smaller prime factors (15=3×5) which is useful when analyzing algorithms but still doesn't change how we came up with the original number (maybe counting 15 sheep).
Separating a "complex" wave into sinusoidal components are very useful mathematically, but does not tell us the mechanism behind its original creation.
I always assumed this was done because those waves are easier to generate than a simple sine wave.
Your assumption here is correct. Starting with a digital circuit, the square wave is the easiest and cheapest waveform to create1. Just turn a voltage on and off using a single transistor. It is also cheaper in a mass-market manufacturing context because a sine wave generator (and even a saw-tooth) made from analog electronics will require a lot of extra components in order to not drift with temperature, age, and humidity.
It is also arguably more useful in a synthesizer context than one single sine wave because it has a lot of harmonics you can modify with a filter like in the SID.
The next step on the complexity ladder is any ramp-shape, like the triangle or saw-tooth. While you can make these using analog electronics, even back in the early eighties they were typically implemented by a simple DAC driven by a digital counter. The rate of the counter determined how fast the waveform goes from 0 to MAX and thus determined the pitch.
Once you have your DAC in your computer you could use it to generate a sine wave but it requires either impossibly expensive real-time calculations or a large table of pre-calculated sine values, so it was rarely (never?) done. When computers got some useful amount of RAM and bandwidth, they quickly switched to plain arbitrary samples and never looked back.
1) In fact, anything else is so much more complicated that today we just do everything using simple digital pulses and just filter the result in various ways (PDM, PWM, Delta-sigma)

My recollection was that a member of our team figured out we could generate sounds by turning something on and off quickly. This was the early 1980's and unfortunately I don't remember the specifics. But I think a key point is that we were flipping a switch, not computing the data for those waves. The waves that resulted were the results of a "pulsed" action. This may account for some of the early sounds but I think its also speculative. I wasn't the one directly involved, and theoretically, this at best only explains square and pulse waves, not triangle or sawtooth waves. Will be interested in what others come up with.

Related

Efficient generation of sampled waveforms without aliasing artifacts

For a project of mine I am working with sampled sound generation and I need to create various waveforms at various frequencies. When the waveform is sinusoidal, everything is fine, but when the waveform is rectangular, there is trouble: it sounds as if it came from the eighties, and as the frequency increases, the notes sound wrong. On the 8th octave, each note sounds like a random note from some lower octave.
The undesirable effect is the same regardless of whether I use either one of the following two approaches:
The purely mathematical way of generating a rectangular waveform as sample = sign( secondsPerHalfWave - (timeSeconds % secondsPerWave) ) where secondsPerWave = 1.0 / wavesPerSecond and secondsPerHalfWave = secondsPerWave / 2.0
My preferred way, which is to describe one period of the wave using line segments and to interpolate along these lines. So, a rectangular waveform is described (regardless of sampling rate and regardless of frequency) by a horizontal line from x=0 to x=0.5 at y=1.0, followed by another horizontal line from x=0.5 to x=1.0 at y=-1.0.
From what I gather, the literature considers these waveform generation approaches "naive", resulting in "aliasing", which is the cause of all the undesirable effects.
What this all practically translates to when I look at the generated waveform is that the samples-per-second value is not an exact multiple of the waves-per-second value, so each wave does not have an even number of samples, which in turn means that the number of samples at level 1.0 is often not equal to the number of samples at level -1.0.
I found a certain solution here: https://www.nayuki.io/page/band-limited-square-waves which even includes source code in Java, and it does indeed sound awesome: all undesirable effects are gone, and each note sounds pure and at the right frequency. However, this solution is entirely unsuitable for me, because it is extremely computationally expensive. (Even after I have replaced sin() and cos() with approximations that are ten times faster than Java's built-in functions.) Besides, when I look at the resulting waveforms they look awfully complex, so I wonder whether they can legitimately be called rectangular.
So, my question is:
What is the most computationally efficient method for the generation of periodic waveforms such as the rectangular waveform that does not suffer from aliasing artifacts?
Examples of what the solution could entail:
The computer audio problem of generating correct sample values at discrete time intervals to describe a sound wave seems to me somewhat related to the computer graphics problem of generating correct integer y coordinates at discrete integer x coordinates for drawing lines. The Bresenham line generation algorithm is extremely efficient, (even if we disregard for a moment the fact that it is working with integer math,) and it works by accumulating a certain error term which, at the right time, results in a bump in the Y coordinate. Could some similar mechanism perhaps be used for calculating sample values?
The way sampling works is understood to be as reading the value of the analog signal at a specific, infinitely narrow point in time. Perhaps a better approach would be to consider reading the area of the entire slice of the analog signal between the last sample and the current sample. This way, sampling a 1.0 right before the edge of the rectangular waveform would contribute a little to the sample value, while sampling a -1.0 considerable time after the edge would contribute a lot, thus naturally yielding a point which is between the two extreme values. Would this solve the problem? Does such an algorithm exist? Has anyone ever tried it?
Please note that I have posted this question here as opposed to dsp.stackexchange.com because I do not want to receive answers with preposterous jargon like band-limiting, harmonics and low-pass filters, lagrange interpolations, DC compensations, etc. and I do not want answers that come from the purely analog world or the purely theoretical outer space and have no chance of ever receiving a practical and efficient implementation using a digital computer.
I am a programmer, not a sound engineer, and in my little programmer's world, things are simple: I have an array of samples which must all be between -1.0 and 1.0, and will be played at a certain rate (44100 samples per second.) I have arithmetic operations and trigonometric functions at my disposal, I can describe lines and use simple linear interpolation, and I need to generate the samples extremely efficiently because the generation of a dozen waveforms simultaneously and also the mixing of them together may not consume more than 1% of the total CPU time.
I'm not sure but you may have a few of misconceptions about the nature of aliasing. I base this on your putting the term in quotes, and from the following quote:
What this all practically translates to when I look at the generated
waveform is that the samples-per-second value is not an exact multiple
of the waves-per-second value, so each wave does not have an even
number of samples, which in turn means that the number of samples at
level 1.0 is often not equal to the number of samples at level -1.0.
The samples/sec and waves/sec don't have to be exact multiples at all! One can play back all pitches below the Nyquist. So I'm not clear what your thinking on this is.
The characteristic sound of a square wave arises from the presence of odd harmonics, e.g., with a note of 440 (A5), the square wave sound could be generated by combining sines of 440, 1320, 2200, 3080, 3960, etc. progressing in increments of 880. This begs the question, how many odd harmonics? We could go to infinity, theoretically, for the sharpest possible corner on our square wave. If you simply "draw" this in the audio stream, the progression will continue well beyond the Nyquist number.
But there is a problem in that harmonics that are higher than the Nyquist value cannot be accurately reproduced digitally. Attempts to do so result in aliasing. So, to get as good a sounding square wave as the system is able to produce, one has to avoid the higher harmonics that are present in the theoretically perfect square wave.
I think the most common solution is to use a low-pass filtering algorithm. The computations are definitely more cpu-intensive than just calculating sine waves (or doing FM synthesis, which was my main interest). I am also weak on the math for DSP and concerned about cpu expense, and so, avoided this approach for long time. But it is quite viable and worth an additional look, imho.
Another approach is to use additive synthesis, and include as many sine harmonics as you need to get the tonal quality you want. The problem then is that the more harmonics you add, the more computation you are doing. Also, the top harmonics must be kept track of as they limit the highest note you can play. For example if using 10 harmonics, the note 500Hz would include content at 10500 Hz. That's below the Nyquist value for 44100 fps (which is 22050 Hz). But you'll only be able to go up about another octave (doubles everything) with a 10-harmonic wave and little more before your harmonic content goes over the limit and starts aliasing.
Instead of computing multiple sines on the fly, another solution you might consider is to instead create a set of lookup tables (LUTs) for your square wave. To create the values in the table, iterate through and add the values from the sine harmonics that will safely remain under the Nyquist for the range in which you use the given table. I think a table of something like 1024 values to encode a single period could be a good first guess as to what would work.
For example, I am guestimating, but the table for the octave C4-C5 might use 10 harmonics, the table for C5-C6 only 5, the table for C3-C4 might have 20. I can't recall what this strategy/technique is called, but I do recall it has a name, it is an accepted way of dealing with the situation. Depending on how the transitions sound and the amount of high-end content you want, you can use fewer or more LUTs.
There may be other methods to consider. The wikipedia entry on Aliasing describes a technique it refers to as "bandpass" that seems to be intentionally using aliasing. I don't know what that is about or how it relates to the article you cite.
The Soundpipe library has the concept of a frequency table, which is a data structure that holds a precomputed waveform such as a sine. You can initialize the frequency table with the desired waveform and play it through an oscilator. There is even a module named oscmorph which allows you to morph between two or more wavetables.
This is an example of how to generate a sine wave, taken from Soundpipe's documentation.
int main() {
UserData ud;
sp_data *sp;
sp_create(&sp);
sp_ftbl_create(sp, &ud.ft, 2048);
sp_osc_create(&ud.osc);
sp_gen_sine(sp, ud.ft);
sp_osc_init(sp, ud.osc, ud.ft);
ud.osc->freq = 500;
sp->len = 44100 * 5;
sp_process(sp, &ud, write_osc);
sp_ftbl_destroy(&ud.ft);
sp_osc_destroy(&ud.osc);
sp_destroy(&sp);
return 0;
}

Theory behind Autotune/vocoder

I've been hunting all over the web for material about vocoder or autotune, but haven't got any satisfactory answers. Could someone in a simple way please explain how do you autotune a given sound file using a carrier sound file?
(I'm familiar with ffts, windowing, overlap etc., I just don't get the what do we do when we have the ffts of the carrier and the original sound file which has to be modulated)
EDIT: After looking around a bit more, I finally got to know exactly what I was looking for -- a channel vocoder. The way it works is, it takes two inputs, one a voice signal and the other a musical signal rich in frequency. The musical signal is modulated by the envelope of the voice signal, and the output signal sounds like the voice singing in the musical tone.
Thanks for your help!
Using a phase vocoder to adjust pitch is basically pitch estimation plus interpolation in the frequency domain.
A phase vocoder reconstruction method might resample the frequency spectrum at, potentially, a new FFT bin spacing to shift all the frequencies up or down by some ratio. The phase vocoder algorithm additionally uses information shared between adjacent FFT frames to make sure this interpolation result can create continuous waveforms across frame boundaries. e.g. it adjusts the phases of the interpolation results to make sure that successive sinewave reconstructions are continuous rather than having breaks or discontinuities or phase cancellations between frames.
How much to shift the spectrum up or down is determined by pitch estimation, and calculating the ratio between the estimated pitch of the source and that of the target pitch. Again, phase vocoders use information about any phase differences between FFT frames to help better estimate pitch. This is possible by using more a bit more global information than is available from a single local FFT frame.
Of course, this frequency and phase changing can smear out transient detail and cause various other distortions, so actual phase vocoder products may additionally do all kinds of custom (often proprietary) special case tricks to try and fix some of these problems.
The first step is pitch detection. There are a number of pitch detection algorithms, introduced briefly in wikipedia: http://en.wikipedia.org/wiki/Pitch_detection_algorithm
Pitch detection can be implemented in either frequency domain or time domain. Various techniques in both domains exist with various properties (latency, quality, etc.) In the F domain, it is important to realize that a naive approach is very limiting because of the time/frequency trade-off. You can get around this limitation, but it takes work.
Once you've identified the pitch, you compare it with a desired pitch and determine how much you need to actually pitch shift.
Last step is pitch shifting, which, like pitch detection, can be done in the T or F domain. The "phase vocoder" method other folks mentioned is the F domain method. T domain methods include (in increasing order of quality) OLA, SOLA and PSOLA, some of which you can read about here: http://www.scribd.com/doc/67053489/60/Synchronous-Overlap-and-Add-SOLA
Basically you do an FFT, then in the frequency domain you move the signals to the nearest perfect semitone pitch.

Basic unit of Sound?

If we consider computer graphics to be the art of image synthesis where the basic unit is a pixel.
What is the basic unit of sound synthesis?
[This relates to programming as I want to generate this via a computer program.]
Thanks!
The basic unit is a sample
In a WAVE file, the sample is just an integer specifying where to move the speaker head to.
The sample rate determines how often a new sample is fed to the speakers (I'm not entirely sure how this part works, but it does get converted to an analog signal first). The samples are typically laid out in the file one right after another.
When you plot all the samples with x-axis being time and y-axis being sample_value, you can see the waveform.
In a wave file, samples can (theoretically) be any bit-size from 0-65535, which remains constant throughout the wave file. But typically 16 or 24 bits are used.
Computer graphics can also have vector shapes as basic units, not just pixels. Generally, vector graphics are generated via computer tools while captured data tends to appear as a grid of pixels (corresponding to an array of sensors in a camera or other capture device). Obviously there is considerable crossover between those classifications.
Similarly, there are sampled (such as .WAV) and generative (such as .MIDI) forms of computer audio. In the sampled case, the smallest unit is a single sample. Just like an array of pixels in the brightness, x- and y-dimensions come together to form an image, an array of samples in the loudness and time dimensions come together to form a sound. In the generative case, it will be something more like a single tone rendered in a particular voice just like vector graphics have paths drawn with particular textures.
A pixel can have a value and be encoded in digital bitmap samples. The same properties apply to sound and digital audio samples.
A pixel is a physical device that can only render the amplitudes of 3 frequencies of light (Red, Green, Blue) at a time. A speaker is a physical device that can render the amplitudes of a wide range of frequencies (~40,000) at a time. The bit resolution of a sample (number of bits used to to store the value of a sample) mainly determines how many colors/tones can be rendered - the fidelity of the physical playback device.
Also, as patterns of pixels can be encoded or compressed, most patterns of sound samples are also encoded or compressed (or both).
The fundamental unit of signal processing (of which audio is a special case) would be the sample.
The frequency at which you need to sample a signal depends on the maximum frequency present in the waveform. Sampling theorem states that it is normally sufficient to sample at twice the frequency of the maximum frequency present in the signal.
http://en.wikipedia.org/wiki/Sampling_theorem
The human ear is sensitive to sounds up to around 20kHz (the upper frequency lowers with age). This is why music on CD is sampled at 44kHz.
It is often more useful to think of music as being comprised of individual frequencies.
http://www.phys.unsw.edu.au/jw/sound.spectrum.html
Most sound analysis and creation is based on this idea.
Related concepts:
Psychoacoustics: Human perception of sound. Relates to modern sound compression techniques such as mp3.
Fourier series: How complex waveforms are composed of individual frequencies.
I would say the basic unit of sound synthesis is the sine wave. But your definition of synthesis is perhaps different to what audio people would refer to sound synthesis. Sound systhesis is the creation of sound using the fundamental components of sound.
With sine waves, we can synthesise sounds using many techniques such as substractive synthesis, additive synthesis or FM synthesis.
Fourier theory states that every sound is a summation of sine waves of differing phases, frequencies and amplitudes.
OK, so how do we represent a sine wave on a computer? well, a sine wave will be generated using a buffer(array) of 'samples' that have been generated by a function or read from a table. The same technique applies to any sound captured on a computer.
A 'sample' is typically represented as number between -1 and 1 that directly correlates to the amplitude of a sound at a given moment in time. A typical sound recorded at 16 bit depth, would have 65536 (2pow16) possible amplitude values. When being recorded, typically, a sample will be captured 44.1k per second of sound. This is called the sampling frequency rate, or simply the sample rate.
Upon playback from you computer, each sample will pass though an Digital to Analogue converter and generate a vibration on your pc speaker and will in turn cause your ear to percieve the recorded sound.
Sound can be expressed as several different units, but the most common in synthesis/computer music is decibels (dB), which are a relative logarithmic measure of amplitude. Specifically they are normally relative to the maximum amplitude of the audio system.
When measuring sound in "real life", the units are normally A-weighted Decibels or dB(A).
The frequency of a sound (i.e. its pitch) is its amplitude over time, or in the digital world, its amplitude over samples. The number of samples per unit of real time is called the sampling rate; conventional hi-fi systems have sampling rates of 44 kHz (44,000 samples per second) and synthesis/recording software usually supports up to 96 kHz.
Everything sound in the digital domain can be represented as a waveform with the X-axis representing the time (or sample number) and the Y-axis representing the amplitude.
frequency and amplitude of the wave are what make up sound.
That is for a tone.
Music or for that matter most noise is a composite of multiple simultaneous sound waves superimposed on one another.
The unit for amplitute is the
Bel. (We use tenths of a Bel
therefore the term decibel)
The unit for frequency is the
Hertz.
That being said synthesis of music is a large field.
Bitmapped graphics are based on sampling the amplitude of light in a 2D space, where each sample is digitized to a given bit depth and often converted to a logarithmic representation at a different bit depth. The samples are always positive, since you can't be darker than pure black. Each of these samples is called a pixel.
Sound recording is most often based on sampling the magnitude of sound pressure at a microphone, where the samples are taken at constant time intervals. These samples can be positive or negative with respect to perfect silence. Most often these samples are not converted to a logarithm, even though sound is perceived in a logarithmic fashion just as light is. There is no special term to refer to these samples as there is with pixels.
The Bels and Decibels mentioned by others are useful in the context of measuring peak or average sound levels. They are not used to describe the individual sound samples.
You might also find it useful to know how sound file formats compare to image file formats. WAVE is an uncompressed format specific to Windows and is analogous to BMP. MP3 is a lossy compression analogous to JPEG. FLAC is a lossless compression analogous to 24-bit PNG.
If computer graphics are colored dots in 2 dimensional space representing a 3 dimensional space, then sound synthesis is amplitude values regularly partitioned in time representing musical events.
If you want your result to sound like music (the kind of music most people like at least), then you are either going to use some standard synthesis techniques, or literally waste decades of your life reinventing them from scratch.
The most basic techniques are additive synthesis, in which the individual elements are the frequencies, amplitudes, and phases of sine oscillators; subtractive synthesis, where you work with filter coefficients and a complex input waveform; frequency modulation synthesis, where you work with modulation depths and rates of stages of modulation; granular synthesis where short (hundredths to tenths of a second long) enveloped pieces of a recorded sound or an artificial waveform are combined in immense numbers. Each of these in practice uses parameters that evolve over the course of a note, and often you will mix elements of various techniques into a larger instrument.
I recommend this book, though it doesn't have the math for many concepts it at least lays the ground for the concepts used, and gives a nice overview of the techniques.
You wouldn't waste your time going sample by sample to do music in practice any more than you would waste your time going pixel by pixel to render 3d (in other words yeah go sample by sample if making a tool for other people to make music with, but that is way too low a level if you are interested in the task of making music).
Probably the envelope. A tone/note has a shape described by: attack decay sustain release
The byte, or word, depending on the bit-depth of the sound.

8 bit audio samples to 16 bit

This is my "weekend" hobby problem.
I have some well-loved single-cycle waveforms from the ROMs of a classic synthesizer.
These are 8-bit samples (256 possible values).
Because they are only 8 bits, the noise floor is pretty high. This is due to quantization error. Quantization error is pretty weird. It messes up all frequencies a bit.
I'd like to take these cycles and make "clean" 16-bit versions of them. (Yes, I know people love the dirty versions, so I'll let the user interpolate between dirty and clean to whatever degree they like.)
It sounds impossible, right, because I've lost the low 8 bits forever, right? But this has been in the back of my head for a while, and I'm pretty sure I can do it.
Remember that these are single-cycle waveforms that just get repeated over and over for playback, so this is a special case. (Of course, the synth does all kinds of things to make the sound interesting, including envelopes, modulations, filters cross-fading, etc.)
For each individual byte sample, what I really know is that it's one of 256 values in the 16-bit version. (Imagine the reverse process, where the 16-bit value is truncated or rounded to 8 bits.)
My evaluation function is trying to get the minimum noise floor. I should be able to judge that with one or more FFTs.
Exhaustive testing would probably take forever, so I could take a lower-resolution first pass. Or do I just randomly push randomly chosen values around (within the known values that would keep the same 8-bit version) and do the evaluation and keep the cleaner version? Or is there something faster I can do? Am I in danger of falling into local minimums when there might be some better minimums elsewhere in the search space? I've had that happen in other similar situations.
Are there any initial guesses I can make, maybe by looking at neighboring values?
Edit: Several people have pointed out that the problem is easier if I remove the requirement that the new waveform would sample to the original. That's true. In fact, if I'm just looking for cleaner sounds, the solution is trivial.
You could put your existing 8-bit sample into the high-order byte of your new 16-bit sample, and then use the low order byte to linear interpolate some new 16 bit datapoints between each original 8-bit sample.
This would essentially connect a 16 bit straight line between each of your original 8-bit samples, using several new samples. It would sound much quieter than what you have now, which is a sudden, 8-bit jump between the two original samples.
You could also try apply some low-pass filtering.
Going with the approach in your question, I would suggest looking into hill-climbing algorithms and the like.
http://en.wikipedia.org/wiki/Hill_climbing
has more information on it and the sidebox has links to other algorithms which may be more suitable.
AI is like alchemy - we never reached the final goal, but lots of good stuff came out along the way.
Well, I would expect some FIR filtering (IIR if you really need processing cycles, but FIR can give better results without instability) to clean up the noise. You would have to play with it to get the effect you want but the basic problem is smoothing out the sharp edges in the audio created by sampling it at 8 bit resolutions. I would give a wide birth to the center frequency of the audio and do a low pass filter, and then listen to make sure I didn't make it sound "flat" with the filter I picked.
It's tough though, there is only so much you can do, the lower 8 bits is lost, the best you can do is approximate it.
It's almost impossible to get rid of noise that looks like your signal. If you start tweeking stuff in your frequency band it will take out the signal of interest.
For upsampling, since you're already using an FFT, you can add zeros to the end of the frequency domain signal and do an inverse FFT. This completely preserves the frequecy and phase information of the original signal, although it spreads the same energy over more samples. If you shift it 8bits to be a 16bit samples first, this won't be a too much of a problem. But I usually kick it up by an integer gain factor before doing the transform.
Pete
Edit:
The comments are getting a little long so I'll move some to the answer.
The peaks in the FFT output are harmonic spikes caused by the quantitization. I tend to think of them differently than the noise floor. You can dither as someone mentioned and eliminate the amplitude of the harmonic spikes and flatten out the noise floor, but you loose over all signal to noise on the flat part of your noise floor. As far as the FFT is concerned. When you interpolate using that method, it retains the same energy and spreads over more samples, this reduces the amplitude. So before doing the inverse, give your signal more energy by multipling by a gain factor.
Are the signals simple/complex sinusoids, or do they have hard edges? i.e. Triangle, square waves, etc. I'm assuming they have continuity from cycle to cycle, is that valid? If so you can also increase your FFT resolution to more precisely pinpoint frequencies by increasing the number of waveform cycles fed to your FFT. If you can precisely identify the frequencies use, assuming they are somewhat discrete, you may be able to completely recreate the intended signal.
The 16-bit to 8-bit via truncation requirement will produce results that do not match the original source. (Thus making finding an optimal answer more difficult.) Typically you would produce a fixed point waveform by attempting to "get the closest match" that means rounding to the nearest number (trunking is a floor operation). That is most likely how they were originally generated. Adding 0.5 (in this case 0.5 is 128) and then trunking the output would allow you to generate more accurate results. If that's not a worry then ok, but it definitely will have a negative effect on accuracy.
UPDATED:
Why? Because the goal of sampling a signal is to be able to as close a possible reproduce the signal. If conversion threshold is set poorly on the sampling all you're error is to one side of signal and not well distributed and centered about zero. On such systems you typically try to maximize the use the availiable dynamic range, particularly if you have low resolution such as an 8-bit ADC.
Band limited versions? If they are filtered at different frequencies, I'd suspect it was to allow you to play the same sound with out distortions when you went too far out from the other variation. Kinda like mipmapping in graphics.
I suspect the two are the same signal with different aliasing filters applied, this may be useful in reproducing the original. They should be the same base signal with different convolutions applied.
There might be a simple approach taking advantange of the periodicity of the waveforms. How about if you:
Make a 16-bit waveform where the high bytes are the waveform and the low bytes are zero - call it x[n].
Calculate the discrete Fourier transform of x[n] = X[w].
Make a signal Y[w] = (dBMag(X[w]) > Threshold) ? X[w] : 0, where dBMag(k) = 10*log10(real(k)^2 + imag(k)^2), and Threshold is maybe 40 dB, based on 8 bits being roughly 48 dB dynamic range, and allowing ~1.5 bits of noise.
Inverse transform Y[w] to get y[n], your new 16 bit waveform.
If y[n] doesn't sound nice, dither it with some very low level noise.
Notes:
A. This technique only works in the original waveforms are exactly periodic!
B. Step 5 might be replaced with setting the "0" values to random noise in Y[w] in step 3, you'd have to experiment a bit to see what works better.
This seems easier (to me at least) than an optimization approach. But truncated y[n] will probably not be equal to your original waveforms. I'm not sure how important that constraint is. I feel like this approach will generate waveforms that sound good.

Pitch recognition of musical notes on a smart phone

With limited resources such as slower CPUs, code size and RAM, how best to detect the pitch of a musical note, similar to what an electronic or software tuner would do?
Should I use:
Kiss FFT
FFTW
Discrete Wavelet Transform
autocorrelation
zero crossing analysis
octave-spaced filters
other?
In a nutshell, what I am trying to do is to recognize a single musical note, two octaves below middle-C to two octaves above, played on any (reasonable) instrument. I'd like to be within 20% of the semitone - in other words, if the user plays too flat or too sharp, I need to distinguish that. However, I will not need the accuracy required for tuning.
If you don't need that much accuracy, an FFT could be sufficient. Window the chunk of audio first so that you get well-defined peaks, then find the first significant peak.
Bin width = sampling rate / FFT size:
Fundamentals range from 20 Hz to 7 kHz, so a sampling rate of 14 kHz would be enough. The next "standard" sampling rate is 22050 Hz.
The FFT size is then determined by the precision you want. FFT output is linear in frequency, while musical tones are logarithmic in frequency, so the worst case precision will be at low frequencies. For 20% of a semitone at 20 Hz, you need a width of 1.2 Hz, which means an FFT length of 18545. The next power of two is 215 = 32768. This is 1.5 seconds of data, and takes my laptop's processor 3 ms to calculate.
This won't work with signals that have a "missing fundamental", and finding the "first significant" peak is somewhat difficult (since harmonics are often higher than the fundamental), but you can figure out a way that suits your situation.
Autocorrelation and harmonic product spectrum are better at finding the true fundamental for a wave instead of one of the harmonics, but I don't think they deal as well with inharmonicity, and most instruments like piano or guitar are inharmonic (harmonics are slightly sharp from what they should be). It really depends on your circumstances, though.
Also, you can save even more processor cycles by computing only within a specific frequency band of interest, using the Chirp-Z transform.
I've written up a few different methods in Python for comparison purposes.
If you want to do pitch recognition in realtime (and accurate to within 1/100 of a semi-tone), your only real hope is the zero-crossing approach. And it's a faint hope, sorry to say. Zero-crossing can estimate pitch from just a couple of wavelengths of data, and it can be done with a smartphone's processing power, but it's not especially accurate, as tiny errors in measuring the wavelengths result in large errors in the estimated frequency. Devices like guitar synthesizers (which deduce the pitch from a guitar string with just a couple of wavelengths) work by quantizing the measurements to notes of the scale. This may work for your purposes, but be aware that zero-crossing works great with simple waveforms, but tends to work less and less well with more complex instrument sounds.
In my application (a software synthesizer that runs on smartphones) I use recordings of single instrument notes as the raw material for wavetable synthesis, and in order to produce notes at a particular pitch, I need to know the fundamental pitch of a recording, accurate to within 1/1000 of a semi-tone (I really only need 1/100 accuracy, but I'm OCD about this). The zero-crossing approach is much too inaccurate for this, and FFT-based approaches are either way too inaccurate or way too slow (or both sometimes).
The best approach that I've found in this case is to use autocorrelation. With autocorrelation you basically guess the pitch and then measure the autocorrelation of your sample at that corresponding wavelength. By scanning through the range of plausible pitches (say A = 55 Hz thru A = 880 Hz) by semi-tones, I locate the most-correlated pitch, then do a more finely-grained scan in the neighborhood of that pitch to get a more accurate value.
The approach best for you depends entirely on what you're trying to use this for.
I'm not familiar with all the methods you mention, but what you choose should depend primarily on the nature of your input data. Are you analysing pure tones, or does your input source have multiple notes? Is speech a feature of your input? Are there any limitations on the length of time you have to sample the input? Are you able to trade off some accuracy for speed?
To some extent what you choose also depends on whether you would like to perform your calculations in time or in frequency space. Converting a time series to a frequency representation takes time, but in my experience tends to give better results.
Autocorrelation compares two signals in the time domain. A naive implementation is simple but relatively expensive to compute, as it requires pair-wise differencing between all points in the original and time-shifted signals, followed by differentiation to identify turning points in the autocorrelation function, and then selection of the minimum corresponding to the fundamental frequency. There are alternative methods. For example, Average Magnitude Differencing is a very cheap form of autocorrelation, but accuracy suffers. All autocorrelation techniques run the risk of octave errors, since peaks other than the fundamental exist in the function.
Measuring zero-crossing points is simple and straightforward, but will run into problems if you have multiple waveforms present in the signal.
In frequency-space, techniques based on FFT may be efficient enough for your purposes. One example is the harmonic product spectrum technique, which compares the power spectrum of the signal with downsampled versions at each harmonic, and identifies the pitch by multiplying the spectra together to produce a clear peak.
As ever, there is no substitute for testing and profiling several techniques, to empirically determine what will work best for your problem and constraints.
An answer like this can only scratch the surface of this topic. As well as the earlier links, here are some relevant references for further reading.
Summary of pitch detection algorithms (Wikipedia)
Pros and cons of Autocorrelation vs Harmonic Product Spectrum
A high-level overview of pitch detection methods
In my project danstuner, I took code from Audacity. It essentially took an FFT, then found the peak power by putting a cubic curve on the FFT and finding the peak of that curve. Works pretty well, although I had to guard against octave-jumping.
See Spectrum.cpp.
Zero crossing won't work because a typical sound has harmonics and zero-crossings much more than the base frequency.
Something I experimented with (as a home side project) was this:
Sample the sound with ADC at whatever sample rate you need.
Detect the levels of the short-term positive and negative peaks of the waveform (sliding window or similar). I.e. an envelope detector.
Make a square wave that goes high when the waveform goes within 90% (or so) of the positive envelope, and goes low when the waveform goes within 90% of the negative envelope. I.e. a tracking square wave with hysteresis.
Measure the frequency of that square wave with straight-forward count/time calculations, using as many samples as you need to get the required accuracy.
However I found that with inputs from my electronic keyboard, for some instrument sounds it managed to pick up 2× the base frequency (next octave). This was a side project and I never got around to implementing a solution before moving on to other things. But I thought it had promise as being much less CPU load than FFT.

Resources