I'd like to build a an audio visualizer display using led strips to be used at parties. Building the display and programming the rendering engine is fairly straightforward, but I don't have any experience in signal processing, aside from rendering PCM samples.
The primary feature I'd like to implement would be animation driven by audible frequency. To keep things super simple and get the hang of it, I'd like to start by simply rendering a color according to audible frequency of the input signal (e.g. the highest audible frequency would be rendered as white).
I understand that reading input samples as PCM gives me the amplitude of air pressure (intensity) with respect to time and that using a Fourier transform outputs the signal as intensity with respect to frequency. But from there I'm lost as to how to resolve the actual frequency.
Would the numeric frequency need to be resolved as the inverse transform of the of the Fourier transform (e.g. the intensity is the argument and the frequency is the result)?
I understand there are different types of Fourier transforms that are suitable for different purposes. Which is useful for such an application?
You can transform the samples from time domain to frequency domain using DFT or FFT. It outputs frequencies and their intensities. Actually you get a set of frequencies not just one. Based on that LED strips can be lit. See DFT spectrum tracer
"The frequency", as in a single numeric audio frequency spectrum value, does not exist for almost all sounds. That's why an FFT gives you all N/2 frequency bins of the full audio spectrum, up to half the sample rate, with a resolution determined by the length of the FFT.
Related
I have an FFT output from a microphone and I want to detect a specific animal's howl from that (it howls in a characteristic frequency spectrum). Is there any way to implement a pattern recognition algorithm in Arduino to do that?
I already have the FFT part of it working with 128 samples #2kHz sampling rate.
lookup audio fingerprinting ... essentially you probe the frequency domain output from the FFT call and take a snapshot of the range of frequencies together with the magnitude of each freq then compare this between known animal signal and unknown signal and output a measurement of those differences.
Naturally this difference will approach zero when unknown signal is your actual known signal
Here is another layer : For better fidelity instead of performing a single FFT of the entire audio available, do many FFT calls each with a subset of the samples ... for each call slide this window of samples further into the audio clip ... lets say your audio clip is 2 seconds yet here you only ever send into your FFT call 200 milliseconds worth of samples this gives you at least 10 such FFT result sets instead of just one had you gulped the entire audio clip ... this gives you the notion of time specificity which is an additional dimension with which to derive a more lush data difference between known and unknown signal ... experiment to see if it helps to slide the window just a tad instead of lining up each window end to end
To be explicit you have a range of frequencies say spread across X axis then along Y axis you have magnitude values for each frequency at different points in time as plucked from your audio clips as you vary your sample window as per above paragraph ... so now you have a two dimensional grid of data points
Again to beef up the confidence intervals you will want to perform all of above across several different audio clips of your known source animal howl against each of your unknown signals so now you have a three dimensional parameter landscape ... as you can see each additional dimension you can muster will give more traction hence more accurate results
Start with easily distinguished known audio against a very different unknown audio ... say a 50 Hz sin curve tone for known audio signal against a 8000 Hz sin wave for the unknown ... then try as your known a single strum of a guitar and use as unknown say a trumpet ... then progress to using actual audio clips
Audacity is an excellent free audio work horse of the industry - it easily plots a WAV file to show its time domain signal or FFT spectrogram ... Sonic Visualiser is also a top shelf tool to use
This is not a simple silver bullet however each layer you add to your solution can give you better results ... it is a process you are crafting not a single dimensional trigger to squeeze.
In the video The Sound of Hydrogen (original here), the sound
is created using the NIST Atomic Spectra Database and then importing this edited data into Mathematica to modulate a Sine Wave. I was wondering how he turned the data from the website into the values shown in the video (3:47 - top of the page) because it is nothing like what is initially seen on the website.
Short answer: It's different because in the tutorial the sampling rate is 8 kHz while it's probably higher in the original video.
Long answer:
I wish you'd asked this on http://physics.stackexchange.com or http://math.stackexchange.com instead so I could use formulae... Use the bookmarklet
javascript:(function(){function%20a(a){var%20b=a.createElement('script'),c;b.src='https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML.js',b.type='text/javascript',c='MathJax.Hub.Config({tex2jax:{inlineMath:[[\'$\',\'$\']],displayMath:[[\'\\\\[\',\'\\\\]\']],processEscapes:true}});MathJax.Hub.Startup.onload();',window.opera?b.innerHTML=c:b.text=c,a.getElementsByTagName('head')[0].appendChild(b)}function%20b(b){b.MathJax===undefined?a(b.document):b.MathJax.Hub.Queue(new%20b.Array('Typeset',b.MathJax.Hub))}var%20c=document.getElementsByTagName('iframe'),d,e;b(window);for(d=0;d<c.length;d++)e=c[d].contentWindow||c[d].contentDocument,e.document||(e=e.parentNode),b(e)})()
to render the formulae with MathJax:
First of all, note how the Rydberg formula provides the resonance frequencies of hydrogen as $\nu_{nm} = c R \left(\frac1{n^2}-\frac1{m^2}\right)$ where $c$ is the speed of light and $R$ the Rydberg constant. The highest frequency is $\nu_{1\infty}\approx 3000$ THz while for $n,m\to\infty$ there is basically no lower limit, though if you restrict yourself to the Lyman series ($n=1$) and the Balmer series ($n=2$), the lower limit is $\nu_{23}\approx 400$ THz. These are electromagnetic frequencies corresponding to light (not entirely in the visual spectrum (ranging from 430–790 THz), there's some IR and lots of UV in there which you cannot see). "minutephysics" now simply considers these frequencies as sound frequencies that are remapped to the human hearing range (ca 20-20000 Hz).
But as the video stated, not all these frequencies resonate with the same strength, and the data at http://nist.gov/pml/data/asd.cfm also includes the amplitudes. For the frequency $\nu_{nm}$ let's call the intensity $I_{nm}$ (intensity is amplitude squared, I wonder if the video treated that correctly). Then your signal is simply
$f(t) = \sum\limits_{n=1}^N \sum\limits_{m=n+1}^M I_{nm}\sin(\alpha(\nu_{nm})t+\phi_{nm})$
where $\alpha$ denotes the frequency rescaling (probably something linear like $\alpha(\nu) = (20 + (\nu-400\cdot10^{12})\cdot\frac{20000-20}{(3000-400)\cdot 10^{12}})$ Hz) and the optional phase $\phi_{nm}$ is probably equal to zero.
Why does it sound slightly different? Probably the actual video did use a higher sampling rate than the 8 kHz used in the tutorial video.
I have found for several times the following guidelines for getting the power spectrum of an audio signal:
collect N samples, where N is a power of 2
apply a suitable window function to the samples, e.g. Hanning
pass the windowed samples to an FFT routine - ideally you want a real-to-complex FFT but if all you have a is complex-to-complex FFT then pass 0 for all the imaginary input parts
calculate the squared magnitude of your FFT output bins (re * re + im * im)
(optional) calculate 10 * log10 of each magnitude squared output bin to get a magnitude value in dB
Now that you have your power spectrum you just need to identify the peak(s), which should be pretty straightforward if you have a reasonable S/N ratio. Note that frequency resolution improves with larger N. For the above example of 44.1 kHz sample rate and N = 32768 the frequency resolution of each bin is 44100 / 32768 = 1.35 Hz.
But... why do I need to apply a window function to the samples? What does that really means?
What about the power spectrum, is it the power of each frequency in the range of sample rate? (example: windows media player visualizer of sound?)
Most real world audio signals are non-periodic, meaning that real audio signals do not generally repeat exactly, over any given time span.
However, the math of the Fourier transform assumes that the signal being Fourier transformed is periodic over the time span in question.
This mismatch between the Fourier assumption of periodicity, and the real world fact that audio signals are generally non-periodic, leads to errors in the transform.
These errors are called "spectral leakage", and generally manifest as a wrongful distribution of energy across the power spectrum of the signal.
The plot below shows a closeup of the power spectrum of an acoustic guitar playing the A4 note. The spectrum was calculated with the FFT (Fast Fourier Transform), but the signal was not windowed prior to the FFT.
Notice the distribution of energy above the -60 dB line, and the three distinct peaks at roughly 440 Hz, 880 Hz, and 1320 Hz. This particular distribution of energy contains "spectral leakage" errors.
To somewhat mitigate the "spectral leakage" errors, you can pre-multiply the signal by a window function designed specifically for that purpose, like for example the Hann window function.
The plot below shows the Hann window function in the time-domain. Notice how the tails of the function go smoothly to zero, while the center portion of the function tends smoothly towards the value 1.
Now let's apply the Hann window to the guitar's audio data, and then FFT the resulting signal.
The plot below shows a closeup of the power spectrum of the same signal (an acoustic guitar playing the A4 note), but this time the signal was pre-multiplied by the Hann window function prior to the FFT.
Notice how the distribution of energy above the -60 dB line has changed significantly, and how the three distinct peaks have changed shape and height. This particular distribution of spectral energy contains fewer "spectral leakage" errors.
The acoustic guitar's A4 note used for this analysis was sampled at 44.1 KHz with a high quality microphone under studio conditions, it contains essentially zero background noise, no other instruments or voices, and no post processing.
References:
Real audio signal data, Hann window function, plots, FFT, and spectral analysis were done here:
Fast Fourier Transform, spectral analysis, Hann window function, audio data
As #cyco130 says, your samples are already windowed by a rectangular function. Since a Fourier Transform assumes periodicity, any discontinuity between the last sample and the repeated first sample will cause artefacts in the spectrum (e.g. "smearing" of the peaks). This is known as spectral leakage. To reduce the effect of this we apply a tapered window function such as a Hann window which smooths out any such discontinuity and thereby reduces artefacts in the spectrum.
Note that a non-rectangular window has both benefits and costs. The result of a window in the time-domain is equivalent to a convolution of the window's transform with the signal's spectrum. A typical window, such as a von Hann window, will reduce the "leakage" from any non-periodic spectral content, which will result in a less noisy looking spectrum; but, in return, the convolution will "blur" any exactly or close to periodic spectral peaks across a few adjacent bins. e.g. all the spectral peaks will become rounder looking which may reduce frequency estimation accuracy. If you know, apriori, that there is no non-periodic content (e.g. data from some rotationally synchronous sampling system), a non-rectangular window could actually make the FFT look worse.
A non-rectangular window is also an informationally lossy process. A significant amount of spectral information near the edges of the window will be thrown away, assuming finite precision arithmetic. So non-rectangular windows are best used with overlapping window processing, and/or when one can assume that the spectrum of interest is either stationary across the entire window width, or centered in the window.
If you're not applying any windowing function, you're actually aplying a rectangular windowing function. Different windowing functions have different characteristics, it depends on what you want exactly.
I have a working tone detector which uses an FFT to determine whether a tone (or tone pair) of a particular frequency is present in an audio stream (if sufficiently above the noise floor). What method could I use to more precisely locate the onset time and duration of that tone? I am looking for something far more precise than the FFT frame duration (about 50 ms). The tone is assumed to be much longer than an FFT frame.
Sounds like DTMF detection. The standard technique for this is the Goertzel algorithm. You need one Goertzel detector for each frequency of interest, so you need to know the frequencies a priori.
If the particular frequency is known ahead of time, you could design a bandpass filter centered around that frequency and then just use an energy detector on the output. You'd have to account for the bulk delay through the filter, and probably also the rise and fall times of the steady-state response.
If you're using the FFT output to actually detect the tone, and you have sufficient memory to keep the recent past samples, you could get a rough estimate of the onset from the FFT, go back in time a few hundred milliseconds before, and start mixing the samples by a sinusoid at the detected frequency. Then run the mixed samples through a low-pass filter. Your tone detection, mixer, and LPF frequency resolutions/bandwidths will have to match, and again you'll need to consider the LPF characteristics.
If we consider computer graphics to be the art of image synthesis where the basic unit is a pixel.
What is the basic unit of sound synthesis?
[This relates to programming as I want to generate this via a computer program.]
Thanks!
The basic unit is a sample
In a WAVE file, the sample is just an integer specifying where to move the speaker head to.
The sample rate determines how often a new sample is fed to the speakers (I'm not entirely sure how this part works, but it does get converted to an analog signal first). The samples are typically laid out in the file one right after another.
When you plot all the samples with x-axis being time and y-axis being sample_value, you can see the waveform.
In a wave file, samples can (theoretically) be any bit-size from 0-65535, which remains constant throughout the wave file. But typically 16 or 24 bits are used.
Computer graphics can also have vector shapes as basic units, not just pixels. Generally, vector graphics are generated via computer tools while captured data tends to appear as a grid of pixels (corresponding to an array of sensors in a camera or other capture device). Obviously there is considerable crossover between those classifications.
Similarly, there are sampled (such as .WAV) and generative (such as .MIDI) forms of computer audio. In the sampled case, the smallest unit is a single sample. Just like an array of pixels in the brightness, x- and y-dimensions come together to form an image, an array of samples in the loudness and time dimensions come together to form a sound. In the generative case, it will be something more like a single tone rendered in a particular voice just like vector graphics have paths drawn with particular textures.
A pixel can have a value and be encoded in digital bitmap samples. The same properties apply to sound and digital audio samples.
A pixel is a physical device that can only render the amplitudes of 3 frequencies of light (Red, Green, Blue) at a time. A speaker is a physical device that can render the amplitudes of a wide range of frequencies (~40,000) at a time. The bit resolution of a sample (number of bits used to to store the value of a sample) mainly determines how many colors/tones can be rendered - the fidelity of the physical playback device.
Also, as patterns of pixels can be encoded or compressed, most patterns of sound samples are also encoded or compressed (or both).
The fundamental unit of signal processing (of which audio is a special case) would be the sample.
The frequency at which you need to sample a signal depends on the maximum frequency present in the waveform. Sampling theorem states that it is normally sufficient to sample at twice the frequency of the maximum frequency present in the signal.
http://en.wikipedia.org/wiki/Sampling_theorem
The human ear is sensitive to sounds up to around 20kHz (the upper frequency lowers with age). This is why music on CD is sampled at 44kHz.
It is often more useful to think of music as being comprised of individual frequencies.
http://www.phys.unsw.edu.au/jw/sound.spectrum.html
Most sound analysis and creation is based on this idea.
Related concepts:
Psychoacoustics: Human perception of sound. Relates to modern sound compression techniques such as mp3.
Fourier series: How complex waveforms are composed of individual frequencies.
I would say the basic unit of sound synthesis is the sine wave. But your definition of synthesis is perhaps different to what audio people would refer to sound synthesis. Sound systhesis is the creation of sound using the fundamental components of sound.
With sine waves, we can synthesise sounds using many techniques such as substractive synthesis, additive synthesis or FM synthesis.
Fourier theory states that every sound is a summation of sine waves of differing phases, frequencies and amplitudes.
OK, so how do we represent a sine wave on a computer? well, a sine wave will be generated using a buffer(array) of 'samples' that have been generated by a function or read from a table. The same technique applies to any sound captured on a computer.
A 'sample' is typically represented as number between -1 and 1 that directly correlates to the amplitude of a sound at a given moment in time. A typical sound recorded at 16 bit depth, would have 65536 (2pow16) possible amplitude values. When being recorded, typically, a sample will be captured 44.1k per second of sound. This is called the sampling frequency rate, or simply the sample rate.
Upon playback from you computer, each sample will pass though an Digital to Analogue converter and generate a vibration on your pc speaker and will in turn cause your ear to percieve the recorded sound.
Sound can be expressed as several different units, but the most common in synthesis/computer music is decibels (dB), which are a relative logarithmic measure of amplitude. Specifically they are normally relative to the maximum amplitude of the audio system.
When measuring sound in "real life", the units are normally A-weighted Decibels or dB(A).
The frequency of a sound (i.e. its pitch) is its amplitude over time, or in the digital world, its amplitude over samples. The number of samples per unit of real time is called the sampling rate; conventional hi-fi systems have sampling rates of 44 kHz (44,000 samples per second) and synthesis/recording software usually supports up to 96 kHz.
Everything sound in the digital domain can be represented as a waveform with the X-axis representing the time (or sample number) and the Y-axis representing the amplitude.
frequency and amplitude of the wave are what make up sound.
That is for a tone.
Music or for that matter most noise is a composite of multiple simultaneous sound waves superimposed on one another.
The unit for amplitute is the
Bel. (We use tenths of a Bel
therefore the term decibel)
The unit for frequency is the
Hertz.
That being said synthesis of music is a large field.
Bitmapped graphics are based on sampling the amplitude of light in a 2D space, where each sample is digitized to a given bit depth and often converted to a logarithmic representation at a different bit depth. The samples are always positive, since you can't be darker than pure black. Each of these samples is called a pixel.
Sound recording is most often based on sampling the magnitude of sound pressure at a microphone, where the samples are taken at constant time intervals. These samples can be positive or negative with respect to perfect silence. Most often these samples are not converted to a logarithm, even though sound is perceived in a logarithmic fashion just as light is. There is no special term to refer to these samples as there is with pixels.
The Bels and Decibels mentioned by others are useful in the context of measuring peak or average sound levels. They are not used to describe the individual sound samples.
You might also find it useful to know how sound file formats compare to image file formats. WAVE is an uncompressed format specific to Windows and is analogous to BMP. MP3 is a lossy compression analogous to JPEG. FLAC is a lossless compression analogous to 24-bit PNG.
If computer graphics are colored dots in 2 dimensional space representing a 3 dimensional space, then sound synthesis is amplitude values regularly partitioned in time representing musical events.
If you want your result to sound like music (the kind of music most people like at least), then you are either going to use some standard synthesis techniques, or literally waste decades of your life reinventing them from scratch.
The most basic techniques are additive synthesis, in which the individual elements are the frequencies, amplitudes, and phases of sine oscillators; subtractive synthesis, where you work with filter coefficients and a complex input waveform; frequency modulation synthesis, where you work with modulation depths and rates of stages of modulation; granular synthesis where short (hundredths to tenths of a second long) enveloped pieces of a recorded sound or an artificial waveform are combined in immense numbers. Each of these in practice uses parameters that evolve over the course of a note, and often you will mix elements of various techniques into a larger instrument.
I recommend this book, though it doesn't have the math for many concepts it at least lays the ground for the concepts used, and gives a nice overview of the techniques.
You wouldn't waste your time going sample by sample to do music in practice any more than you would waste your time going pixel by pixel to render 3d (in other words yeah go sample by sample if making a tool for other people to make music with, but that is way too low a level if you are interested in the task of making music).
Probably the envelope. A tone/note has a shape described by: attack decay sustain release
The byte, or word, depending on the bit-depth of the sound.