Pitch recognition of musical notes on a smart phone, pt. 2

Pitch recognition of musical notes on a smart phone, pt. 2 - audio

As a follow-up to my previous question, if I want my smartphone application to detect a certain musical note, and I only need to know whether the incoming sound is that musical note or not, with a certain amount of fuzziness, to allow the note to be off-key by x cents.
Given that, is there a superior method over others for speed and accuracy? That is, by knowing that the note you are looking for is, say, a #C3, how best to tell if that note is present or not? I'm assuming that looking for a single note would be easier than separating out all waveforms, and then looking at the results for the fundamental frequency.
In the responses to my original question, one respondent suggested that autocorrelation might work well if you know that the notes are within a certain range. I wonder if autocorrelation would then work even better, if you only have to check for the presence or absence of a certain note (+/- x cents).
Those methods being:
Kiss FFT
FFTW
Discrete Wavelet Transform
autocorrelation
zero crossing analysis
octave-spaced filters
DWT
Any thoughts would be appreciated.

As you describe it, you just need to determine if a particular pitch is present. A very simple (fast) detector would just record the equivalent of one period of the waveform, then record another period and correlate them, like an oversimplified (single-lag) autocorrelation. If there's a high match, you know the waveform being recorded is repeating at around the same period, or a harmonic of it.
For instance, to detect 1 kHz, record 1 ms of audio (48 samples at 48 kHz), then record another 1 ms, and compare them (correlate = multiply all samples and sum). If they line up (correlation above some threshold), then you're listening to 1 kHz, 2 kHz, 3 kHz, or some other multiple. Doing several periods would give you more confidence on the match.
A true autocorrelation would tell you which harmonic, specifically, if that's important to you.

Related

DSP - Filter sweep effect

I'm implementing a 'filter sweep' effect (I don't know if it's called like that). What I do is basically create a low-pass filter and make it 'move' along a certain frequency range.
To calculate the filter cut-off frequency at a given moment I use a user-provided linear function, which yields values between 0 and 1.
My first attempt was to directly map the values returned by the linear function to the range of frequencies, as in cf = freqRange * lf(x). Although it worked ok it looked as if the sweep ran much faster when moving through low frequencies and then slowed down during its way to the high frequency zone. I'm not sure why is this but I guess it's something to do with human hearing perceiving changes in frequency in a non-linear manner.
My next attempt was to move the filter's cut-off frequency in a logarithmic way. It works much better now but I still feel that the filter doesn't move at a constant perceived speed through the range of frequencies.
How should I divide the frequency space to obtain a constant perceived sweep speed?
Thanks in advance.

The frequency sweep effect you're referring to is likely a wah-wah filter, named for the ubiquitous wah-wah pedal.
We hear frequency in terms of octaves, and sweeping through octaves with a logarithmic scale is the way to linearize it. Not to sound dismissive, but it sounds like what you're doing is physically and mathematically correct. (You should spent as much time between 200 and 400 Hz as you do between 2000 and 4000 Hz, etc.) You just don't like how it sounds. And that's quite okay on both counts -- audio is highly subjective.
To mix things up a bit, one option would be to try the Bark scale, which is based on psychoacoustics and the structure of the ear. As I understand it, this is designed to spend equal amounts of time in each of your ear's internal "bandpass filters".
You could always try a quadratic or cubic function between 0 and 1. Audio potentiometers often use a few piecewise quadratic or cubic sections to get their mapping.

Winging it, but try this:
http://en.wikipedia.org/wiki/Physics_of_music#Scales "The following table shows the ratios between the frequencies of all the notes of the just major scale and the fixed frequency of the first note of the scale."
There is then a chart showing fractional values between 1 and 2, and if you tweak your timing to match, you may get what you wish. While the overall progression is still logarithmic, the stepping between each one should divide up into equal stepped 8ths (a bit jumpy).
Put another way, every half second adjust one note up. Each octave (I think) will cover twice the frequency range of the prior octave.
EDIT: Also, you'll find the frequencies here: http://en.wikipedia.org/wiki/Middle_C#Designation_by_octave (doesn't the programmer in you wish that C0 was exactly 16hz?)

How can I calculate audio dB level?

I want to calculate room noise level with the computer's microphone. I record noise as an audio file, but how can I calculate the noise dB level?
I don't know how to start!

All the previous answers are correct if you want a technically accurate or scientifically valuable answer. But if you just want a general estimation of comparative loudness, like if you want to check whether the dog is barking or whether a baby is crying and you want to specify the threshold in dB, then it's a relatively simple calculation.
Many wave-file editors have a vertical scale in decibels. There is no calibration or reference measurements, just a simple calculation:
dB = 20 * log10(amplitude)
The amplitude in this case is expressed as a number between 0 and 1, where 1 represents the maximum amplitude in the sound file. For example, if you have a 16 bit sound file, the amplitude can go as high as 32767. So you just divide the sample by 32767. (We work with absolute values, positive numbers only.) So if you have a wave that peaks at 14731, then:
amplitude = 14731 / 32767
= 0.44
dB = 20 * log10(0.44)
= -7.13
But there are very important things to consider, specifically the answers given by the others.
1) As Jörg W Mittag says, dB is a relative measurement. Since we don't have calibrations and references, this measurement is only relative to itself. And by that I mean that you will be able to see that the sound in the sound file at this point is 3 dB louder than at that point, or that this spike is 5 decibels louder than the background. But you cannot know how loud it is in real life, not without the calibrations that the others are referring to.
2) This was also mentioned by PaulR and user545125: Because you're evaluating according to a recorded sound, you are only measuring the sound at the specific location where the microphone is, biased to the direction the microphone is pointing, and filtered by the frequency response of your hardware. A few feet away, a human listening with human ears will get a totally different sound level and different frequencies.
3) Without calibrated hardware, you cannot say that the sound is 60dB or 89dB or whatever. All that this calculation can give you is how the peaks in the sound file compares to other peaks in the same sound file.
If this is all you want, then it's fine, but if you want to do something serious, like determine whether the noise level in a factory is safe for workers, then listen to Paul, user545125 and Jörg.

You do need reference hardware (i.e., a reference mic) to calculate noise level (dB SPL, or sound pressure level). One thing Radio Shack sells is a $50 dB SPL meter. If you're doing scientific calculations, I wouldn't use it. But if the goal is to get a general idea of a weighted measurement (dBA or dBC) of the sound pressure in a given environment, then it might be useful. As a sound engineer, I use mine all the time to see how much sound volume I'm generating while I mix. It's usually accurate to within 2 dB.
That's my answer. The rest is FYI stuff.
Jorg is correct that dB SPL is a relative measurement. All decibel measurements are. But you've implied a reference of 0 dB SPL, or 20 micropascals, scientifically agreed to be the most quiet sound a human ear can detect (though, understandably, what a person can actually hear is very difficult to determine). This, according to Wikipedia, is about the sound of a flying mosquito from about 10 feet away (http://en.wikipedia.org/wiki/Decibel).
By assuming you don't understand decibels, I think Jorg is just trying to out-geek you. He clearly didn't give you a practical answer. :-)
Unweighted measurements (dB, instead of dBA or dBC) are rarely used, because most sound pressure is not detected by the human ear. In a given office environment, there is usually 80-100 dB SPL (sound pressure level). To give you an idea of exactly how much is not heard, in the U.S., occupational regulations limit noise exposure to 80 dBA for a given 8-hour work shift (80 dBA is about the background noise level of your average downtown street - difficult, but not impossible to talk over). 85 dBA is oppressive, and at 90, most people are trying to get away. So the difference between 80 dB and 80 dBA is very significant -- 80 dBA is difficult to talk over, and 80 dB is quite peaceful. :-)
So what is 'A' weighting? 'A' weighting compensates for the fact that we don't perceive lower frequency sounds as well as high frequency sounds (we hear 20 Hz to 20,000 Hz). There's a lot of low-end rumble that our ears/brains pretty much ignore. In addition, we're more sensitive to a certain midrange (1000 Hz to 4000 Hz). Most agree that this frequency range contains the sounds of consonants of speech (vowels happen at a much lower frequency). Imagine talking with just vowels. You can't understand anything. Thus, the ability of a human to be able to communicate (conventionally) rests in the 1kHz-5kHz bump in hearing sensitivity. Interestingly, this is why most telephone systems only transmit 300 Hz to 3000 Hz. It was determined that this was the minimal response needed to understand the voice on the other end.
But I think that's more than you wanted to know. Hope it helps. :-)

You can't easily measure absolute dB SPL, since your microphone and analogue hardware are not calibrated. You may be able to do an approximate calibration for a particular hardware set up but you would need to repeat this for every different microphone and hardware set up that you plan to support.
If you do have some kind of SPL reference source that you can use then then it gets easier:
use your reference source to generate a tone at a known dB SPL - measure this
measure the ambient noise
calculate noise level = 20 * log10 (V_noise / V_ref) + dB_ref
Of course this assumes that the frequency response of your microphone and audio hardware is reasonably flat and that you just want a flat (unweighted) noise figure. If you want a weighted (e.g. A-weight) noise figure then you'll have to do rather more processing.

According to Merchant et al. (section 3.2 in the appendix: "Measuring acoustic habitats", Methods in Ecology and Evolution, 2015), you can actually calculate absolute, calibrated SPL values using manufacturer specifications by subtracting a correction term S to your relative (scaled to maximum) SPL values:
S = M + G + 20*log10(1/Vadc) + 20*log10(2^Nbit-1)
where M is the sensitivity of the transducer (microphone) re 1 V/Pa. G is the gain applied by the user. Vadc is the zero-to-peak voltage, given by multiplying the rms ADC voltage by a conversion factor of squareroot(2). Nbit is the bit sampling depth.
The last term is necessary if your system scales the amplitude by its maximum.
The correction will be more accurate using end-to-end calibration with sound calibrators.
Note that the formula above is dependent on frequency, but you could apply it over a wider frequency range if your microphone has a flat frequency response.

You can't. dB is a relative unit, IOW it is a unit for comparing two measurements against each other. You can only say that measurement A is x dB louder than measurement B, but in your case you only have one measurement. Therefore, it simply isn't possible to calculate the dB level.

The short answer is: you cannot do sound level measurements with your laptop, nor with your cellphone, etc., for all the reasons outlined previously, plus the fact your cellphone, laptop, etc. use compression algorithms to assure that everything recorded is within the hardware capability. So, if for example you measure a sound then run it through signal processing software such as Head Artemis or LMS Test.Lab, the indicated sound pressure level will always be in the neighborhood of 80 dB(A) regardless of the true level. I can say this from having used cellphone or laptop audio to get an idea of a noise frequency spectrum, while taking level measurements using a calibrated sound level meter. Interestingly, Radio Shack used to sell a microphone intended for speech input while videoconferencing that had very flat frequency response over a broad range, and only cost about $15.

I use a sound level calibrator.
It produces 94 dB or 114dB at 1 KHz
wich is a frecuency where weighting
filters share the same level.
With calibrator at 114dB I adjust mic gain to reach almost full scale
input simply watching a sound card based virtual osciloscope.
Now I know Vref # 114dB.
I developed a simple software based SPL meter
that can be provided if needed. You can use REW too.
You hace to know that PC hardware hardly
reaches 60 dB of dynamic range so calibrating
#114 dB it wont read less than 54dB, wich
is pretty high if you consider that sleeping
is good with less than 35 dB A.
In this case you can calibrate at 94dB
and then you may measure down to 34dB
but again you will hit pc and mic self noise
wich may you prevent to reach such low levels.
Anyway, once calibrated, measures at 114dB
and 94dB should read fine.
Note: the lab standard pistonphone calibrator operates at 250 Hz.

Well! I Used RobertT's Method But It Always Giving Me Oveflow Exception, Then I Used:- int dB = -36 - (value * -1), The Exception Gone, I Don't Know Whether It's Telling dB Values, If You Knew Using Code Given Below, Please Comment Me Whether it's A dB Value or not.
VB.NET:-
Dim dB As Integer = -36 - (9 * -1)
C#:-
int dB = -36 - (9 * -1)

Pitch recognition of musical notes on a smart phone

With limited resources such as slower CPUs, code size and RAM, how best to detect the pitch of a musical note, similar to what an electronic or software tuner would do?
Should I use:
Kiss FFT
FFTW
Discrete Wavelet Transform
autocorrelation
zero crossing analysis
octave-spaced filters
other?
In a nutshell, what I am trying to do is to recognize a single musical note, two octaves below middle-C to two octaves above, played on any (reasonable) instrument. I'd like to be within 20% of the semitone - in other words, if the user plays too flat or too sharp, I need to distinguish that. However, I will not need the accuracy required for tuning.

If you don't need that much accuracy, an FFT could be sufficient. Window the chunk of audio first so that you get well-defined peaks, then find the first significant peak.
Bin width = sampling rate / FFT size:
Fundamentals range from 20 Hz to 7 kHz, so a sampling rate of 14 kHz would be enough. The next "standard" sampling rate is 22050 Hz.
The FFT size is then determined by the precision you want. FFT output is linear in frequency, while musical tones are logarithmic in frequency, so the worst case precision will be at low frequencies. For 20% of a semitone at 20 Hz, you need a width of 1.2 Hz, which means an FFT length of 18545. The next power of two is 215 = 32768. This is 1.5 seconds of data, and takes my laptop's processor 3 ms to calculate.
This won't work with signals that have a "missing fundamental", and finding the "first significant" peak is somewhat difficult (since harmonics are often higher than the fundamental), but you can figure out a way that suits your situation.
Autocorrelation and harmonic product spectrum are better at finding the true fundamental for a wave instead of one of the harmonics, but I don't think they deal as well with inharmonicity, and most instruments like piano or guitar are inharmonic (harmonics are slightly sharp from what they should be). It really depends on your circumstances, though.
Also, you can save even more processor cycles by computing only within a specific frequency band of interest, using the Chirp-Z transform.
I've written up a few different methods in Python for comparison purposes.

If you want to do pitch recognition in realtime (and accurate to within 1/100 of a semi-tone), your only real hope is the zero-crossing approach. And it's a faint hope, sorry to say. Zero-crossing can estimate pitch from just a couple of wavelengths of data, and it can be done with a smartphone's processing power, but it's not especially accurate, as tiny errors in measuring the wavelengths result in large errors in the estimated frequency. Devices like guitar synthesizers (which deduce the pitch from a guitar string with just a couple of wavelengths) work by quantizing the measurements to notes of the scale. This may work for your purposes, but be aware that zero-crossing works great with simple waveforms, but tends to work less and less well with more complex instrument sounds.
In my application (a software synthesizer that runs on smartphones) I use recordings of single instrument notes as the raw material for wavetable synthesis, and in order to produce notes at a particular pitch, I need to know the fundamental pitch of a recording, accurate to within 1/1000 of a semi-tone (I really only need 1/100 accuracy, but I'm OCD about this). The zero-crossing approach is much too inaccurate for this, and FFT-based approaches are either way too inaccurate or way too slow (or both sometimes).
The best approach that I've found in this case is to use autocorrelation. With autocorrelation you basically guess the pitch and then measure the autocorrelation of your sample at that corresponding wavelength. By scanning through the range of plausible pitches (say A = 55 Hz thru A = 880 Hz) by semi-tones, I locate the most-correlated pitch, then do a more finely-grained scan in the neighborhood of that pitch to get a more accurate value.
The approach best for you depends entirely on what you're trying to use this for.

I'm not familiar with all the methods you mention, but what you choose should depend primarily on the nature of your input data. Are you analysing pure tones, or does your input source have multiple notes? Is speech a feature of your input? Are there any limitations on the length of time you have to sample the input? Are you able to trade off some accuracy for speed?
To some extent what you choose also depends on whether you would like to perform your calculations in time or in frequency space. Converting a time series to a frequency representation takes time, but in my experience tends to give better results.
Autocorrelation compares two signals in the time domain. A naive implementation is simple but relatively expensive to compute, as it requires pair-wise differencing between all points in the original and time-shifted signals, followed by differentiation to identify turning points in the autocorrelation function, and then selection of the minimum corresponding to the fundamental frequency. There are alternative methods. For example, Average Magnitude Differencing is a very cheap form of autocorrelation, but accuracy suffers. All autocorrelation techniques run the risk of octave errors, since peaks other than the fundamental exist in the function.
Measuring zero-crossing points is simple and straightforward, but will run into problems if you have multiple waveforms present in the signal.
In frequency-space, techniques based on FFT may be efficient enough for your purposes. One example is the harmonic product spectrum technique, which compares the power spectrum of the signal with downsampled versions at each harmonic, and identifies the pitch by multiplying the spectra together to produce a clear peak.
As ever, there is no substitute for testing and profiling several techniques, to empirically determine what will work best for your problem and constraints.
An answer like this can only scratch the surface of this topic. As well as the earlier links, here are some relevant references for further reading.
Summary of pitch detection algorithms (Wikipedia)
Pros and cons of Autocorrelation vs Harmonic Product Spectrum
A high-level overview of pitch detection methods

In my project danstuner, I took code from Audacity. It essentially took an FFT, then found the peak power by putting a cubic curve on the FFT and finding the peak of that curve. Works pretty well, although I had to guard against octave-jumping.
See Spectrum.cpp.

Zero crossing won't work because a typical sound has harmonics and zero-crossings much more than the base frequency.
Something I experimented with (as a home side project) was this:
Sample the sound with ADC at whatever sample rate you need.
Detect the levels of the short-term positive and negative peaks of the waveform (sliding window or similar). I.e. an envelope detector.
Make a square wave that goes high when the waveform goes within 90% (or so) of the positive envelope, and goes low when the waveform goes within 90% of the negative envelope. I.e. a tracking square wave with hysteresis.
Measure the frequency of that square wave with straight-forward count/time calculations, using as many samples as you need to get the required accuracy.
However I found that with inputs from my electronic keyboard, for some instrument sounds it managed to pick up 2× the base frequency (next octave). This was a side project and I never got around to implementing a solution before moving on to other things. But I thought it had promise as being much less CPU load than FFT.

How do you analyse the fundamental frequency of a PCM or WAV sample? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a sample held in a buffer from DirectX. It's a sample of a note played and captured from an instrument. How do I analyse the frequency of the sample (like a guitar tuner does)? I believe FFTs are involved, but I have no pointers to HOWTOs.

The FFT can help you figure out where the frequency is, but it can't tell you exactly what the frequency is. Each point in the FFT is a "bin" of frequencies, so if there's a peak in your FFT, all you know is that the frequency you want is somewhere within that bin, or range of frequencies.
If you want it really accurate, you need a long FFT with a high resolution and lots of bins (= lots of memory and lots of computation). You can also guess the true peak from a low-resolution FFT using quadratic interpolation on the log-scaled spectrum, which works surprisingly well.
If computational cost is most important, you can try to get the signal into a form in which you can count zero crossings, and then the more you count, the more accurate your measurement.
None of these will work if the fundamental is missing, though. :)
I've outlined a few different algorithms here, and the interpolated FFT is usually the most accurate (though this only works when the fundamental is the strongest harmonic - otherwise you need to be smarter about finding it), with zero-crossings a close second (though this only works for waveforms with one crossing per cycle). Neither of these conditions is typical.
Keep in mind that the partials above the fundamental frequency are not perfect harmonics in many instruments, like piano or guitar. Each partial is actually a little bit out of tune, or inharmonic. So the higher-frequency peaks in the FFT will not be exactly on the integer multiples of the fundamental, and the wave shape will change slightly from one cycle to the next, which throws off autocorrelation.
To get a really accurate frequency reading, I'd say to use the autocorrelation to guess the fundamental, then find the true peak using quadratic interpolation. (You can do the autocorrelation in the frequency domain to save CPU cycles.) There are a lot of gotchas, and the right method to use really depends on your application.

There are also other algorithms that are time-based, not frequency based.
Autocorrelation is a relatively simple algorithm for pitch detection.
Reference: http://cnx.org/content/m11714/latest/
I have written c# implementations of autocorrelation and other algorithms that are readable. Check out http://code.google.com/p/yaalp/.
http://code.google.com/p/yaalp/source/browse/#svn/trunk/csaudio/WaveAudio/WaveAudio
Lists the files, and PitchDetection.cs is the one you want.
(The project is GPL; so understand the terms if you use the code).

Guitar tuners don't use FFT's or DFT's. Usually they just count zero crossings. You might not get the fundamental frequency because some waveforms have more zero crossings than others but you can usually get a multiple of the fundamental frequency that way. That's enough to get the note although you might be one or more octaves off.
Low pass filtering before counting zero crossings can usually get rid of the excess zero crossings. Tuning the low pass filter requires some knowlegde of the range of frequency you want to detect though

FFTs (Fast-Fourier Transforms) would indeed be involved. FFTs allow you to approximate any analog signal with a sum of simple sine waves of fixed frequencies and varying amplitudes. What you'll essentially be doing is taking a sample and decomposing it into amplitude->frequency pairs, and then taking the frequency that corresponds to the highest amplitude.
Hopefully another SO reader can fill the gaps I'm leaving between the theory and the code!

A little more specifically:
If you start with the raw PCM in an input array, what you basically have is a graph of wave amplitude vs time.Doing a FFT will transform that to a frequency histogram for frequencies from 0 to 1/2 the input sampling rate. The value of each entry in the result array will be the 'strength' of the corresponding sub-frequency.
So to find the root frequency given an input array of size N sampled at S samples/second:
FFT(N, input, output);
max = max_i = 0;
for(i=0;i<N;i++)
if (output[i]>max) max_i = i;
root = S/2.0 * max_i/N ;

Retrieval of fundamental frequencies in a PCM audio signal is a difficult task, and there would be a lot to talk about it...
Anyway, usually time-based method are not suitable for polyphonic signals, because a complex wave given by the sum of different harmonic components due to multiple fundamental frequencies has a zero-crossing rate which depends only from the lowest frequency component...
Also in the frequency domain the FFT is not the most suitable method, since frequency spacing between notes follow an exponential scale, not linear. This means that a constant frequency resolution, used in the FFT method, may be insufficient to resolve lower frequency notes if the size of the analysis window in the time domain is not large enough.
A more suitable method would be a constant-Q transform, which is DFT applied after a process of low-pass filtering and decimation by 2 (i.e. halving each step the sampling frequency) of the signal, in order to obtain different subbands with different frequency resolution. In this way the calculation of DFT is optimized. The trouble is that also time resolution is variable, and increases for the lower subbands...
Finally, if we are trying to estimate the fundamental frequency of a single note, FFT/DFT methods are ok. Things change for a polyphonic context, in which partials of different sounds overlap and sum/cancel their amplitude depending from their phase difference, and so a single spectral peak could belong to different harmonic contents (belonging to different notes). Correlation in this case don't give good results...

Apply a DFT and then derive the fundamental frequency from the results. Googling around for DFT information will give you the information you need -- I'd link you to some, but they differ greatly in expectations of math knowledge.
Good luck.

Downsampling and applying a lowpass filter to digital audio

I've got a 44Khz audio stream from a CD, represented as an array of 16 bit PCM samples. I'd like to cut it down to an 11KHz stream. How do I do that? From my days of engineering class many years ago, I know that the stream won't be able to describe anything over 5500Hz accurately anymore, so I assume I want to cut everything above that out too. Any ideas? Thanks.
Update: There is some code on this page that converts from 48KHz to 8KHz using a simple algorithm and a coefficient array that looks like { 1, 4, 12, 12, 4, 1 }. I think that is what I need, but I need it for a factor of 4x rather than 6x. Any idea how those constants are calculated? Also, I end up converting the 16 byte samples to floats anyway, so I can do the downsampling with floats rather than shorts, if that helps the quality at all.

Read on FIR and IIR filters. These are the filters that use a coefficent array.
If you do a google search on "FIR or IIR filter designer" you will find lots of software and online-applets that does the hard job (getting the coefficients) for you.
EDIT:
This page here ( http://www-users.cs.york.ac.uk/~fisher/mkfilter/ ) lets you enter the parameters of your filter and will spit out ready to use C-Code...

You're right in that you need apply lowpass filtering on your signal. Any signal over 5500 Hz will be present in your downsampled signal but 'aliased' as another frequency so you'll have to remove those before downsampling.
It's a good idea to do the filtering with floats. There are fixed point filter algorithms too but those generally have quality tradeoffs to work. If you've got floats then use them!
Using DFT's for filtering is generally overkill and it makes things more complicated because dft's are not a contiuous process but work on buffers.
Digital filters generally come in two tastes. FIR and IIR. The're generally the same idea but IIF filters use feedback loops to achieve a steeper response with far less coefficients. This might be a good idea for downsampling because you need a very steep filter slope there.
Downsampling is sort of a special case. Because you're going to throw away 3 out of 4 samples there's no need to calculate them. There is a special class of filters for this called polyphase filters.
Try googling for polyphase IIR or polyphase FIR for more information.

Notice (in additions to the other comments) that the simple-easy-intuitive approach "downsample by a factor of 4 by replacing each group of 4 consecutive samples by the average value", is not optimal but is nevertheless not wrong, nor practically nor conceptually. Because the averaging amounts precisely to a low pass filter (a rectangular window, which corresponds to a sinc in frequency). What would be conceptually wrong is to just downsample by taking one of each 4 samples: that would definitely introduce aliasing.
By the way: practically any software that does some resampling (audio, image or whatever; example for the audio case: sox) takes this into account, and frequently lets you choose the underlying low-pass filter.

You need to apply a lowpass filter before you downsample the signal to avoid "aliasing". The cutoff frequency of the lowpass filter should be less than the nyquist frequency, which is half the sample frequency.

The "best" solution possible is indeed a DFT, discarding the top 3/4 of the frequencies, and performing an inverse DFT, with the domain restricted to the bottom 1/4th. Discarding the top 3/4ths is a low-pass filter in this case. Padding to a power of 2 number of samples will probably give you a speed benefit. Be aware of how your FFT package stores samples though. If it's a complex FFT (which is much easier to analyze, and generally has nicer properties), the frequencies will either go from -22 to 22, or 0 to 44. In the first case, you want the middle 1/4th. In the latter, the outermost 1/4th.
You can do an adequate job by averaging sample values together. The naïve way of grabbing samples four by four and doing an equal weighted average works, but isn't too great. Instead you'll want to use a "kernel" function that averages them together in a non-intuitive way.
Mathwise, discarding everything outside the low-frequency band is multiplication by a box function in frequency space. The (inverse) Fourier transform turns pointwise multiplication into a convolution of the (inverse) Fourier transforms of the functions, and vice-versa. So, if we want to work in the time domain, we need to perform a convolution with the (inverse) Fourier transform of box function. This turns out to be proportional to the "sinc" function (sin at)/at, where a is the width of the box in the frequency space. So at every 4th location (since you're downsampling by a factor of 4) you can add up the points near it, multiplied by sin (a dt) / a dt, where dt is the distance in time to that location. How nearby? Well, that depends on how good you want it to sound. It's common to ignore everything outside the first zero, for instance, or just take the number of points to be the ratio by which you're downsampling.
Finally there's the piss-poor (but fast) way of just discarding the majority of the samples, keeping just the zeroth, the fourth, and so on.
Honestly, if it fits in memory, I'd recommend just going the DFT route. If it doesn't use one of the software filter packages that others have recommended to construct the filter for you.

The process you're after called "Decimation".
There are 2 steps:
Applying Low Pass Filter on the data (In your case LPF with Cut Off at Pi / 4).
Downsampling (In you case taking 1 out of 4 samples).
There are many methods to design and apply the Low Pass Filter.
You may start here:
http://en.wikipedia.org/wiki/Filter_design

You could make use of libsamplerate to do the heavy lifting. Libsamplerate is a C API, and takes care of calculating the filter coefficients. You to select from different quality filters so that you can trade off quality for speed.
If you would prefer not to write any code, you could just use Audacity to do the sample rate conversion. It offers a powerful GUI, and makes use of libsamplerate for it's sample rate conversion.

I would try applying DFT, chopping 3/4 of the result and applying inverse DFT. I can't tell if it will sound good without actually trying tough.

I recently came across BruteFIR which may already do some of what you're interested in?

You have to apply low-pass filter (removing frequencies above 5500 Hz) and then apply decimation (leave every Nth sample, every 4th in your case).
For decimation, FIR, not IIR filters are usually employed, because they don't depend on previous outputs and therefore you don't have to calculate anything for discarded samples. IIRs, generally, depends on both inputs and outputs, so, unless a specific type of IIR is used, you'd have to calculate every output sample before discarding 3/4 of them.
Just googled an intro-level article on the subject: https://www.dspguru.com/dsp/faqs/multirate/decimation

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string