I need to create a software can capture sound (from NOAA Satellite with RTL-SDR). The problem is not capture the sound, the problem is how I converted the audio or waves into an image. I read many things, Fourier Fast Transformed, Hilbert Transform, etc... but I don't know how.
If you can give me an idea it would be fantastic. Thank you!
Over the past year I have been writing code which makes FFT calls and have amassed 15 pages of notes so the topic is vast however I can boil it down
Open up your WAV file ... parse the 44 byte header and note the given bit depth and endianness attributes ... then read across the payload which is everything after that header ... understand notion of bit depth as well as endianness ... typically a WAV file has a bit depth of 16 bits so each point on the audio curve will be stored across two bytes ... typically WAV file is little endian not big endian ... knowing what that means you take the next two bytes then bit shift one byte to the left (if little endian) then bit OR that pair of bytes into an integer then convert that int which typically varies from 0 to (2^16 - 1) into its floating point equivalent so your audio curve points now vary from -1 to +1 ... do that conversion for each set of bytes which corresponds to each sample of your payload buffer
Once you have the WAV audio curve as a buffer of floats which is called raw audio or PCM audio then perform your FFT api call ... all languages have such libraries ... output of FFT call will be a set of complex numbers ... pay attention to notion of the Nyquist Limit ... this will influence how your make use of output of your FFT call
Now you have a collection of complex numbers ... the index from 0 to N of that collection corresponds to frequency bins ... the size of your PCM buffer will determine how granular your frequency bins are ... lookup this equation ... in general more samples in your PCM buffer you send to the FFT api call will give you finer granularity in the output frequency bins ... essentially this means as you walk across this collection of complex numbers each index will increment the frequency assigned to that index
To visualize this just feed this into a 2D plot where X axis is frequency and Y axis is magnitude ... calculate this magnitude for each complex number using
curr_mag = 2.0 * math.Sqrt(curr_real*curr_real+curr_imag*curr_imag) / number_of_samples
For simplicity we will sweep under the carpet the phase shift information available to you in your complex number buffer
This only scratches the surface of what you need to master to properly render a WAV file into a 2D plot of its frequency domain representation ... there are libraries which perform parts or all of this however now you can appreciate some of the magic involved when the rubber hits the road
A great explanation of trade offs between frequency resolution and number of audio samples fed into your call to an FFT api https://electronics.stackexchange.com/questions/12407/what-is-the-relation-between-fft-length-and-frequency-resolution
Do yourself a favor and checkout https://www.sonicvisualiser.org/ which is one of many audio workstations which can perform what I described above. Just hit menu File -> Open -> choose a local WAV file -> Layer -> Add Spectrogram ... and it will render the visual representation of the Fourier Transform of your input audio file as such
Related
I have an FFT output from a microphone and I want to detect a specific animal's howl from that (it howls in a characteristic frequency spectrum). Is there any way to implement a pattern recognition algorithm in Arduino to do that?
I already have the FFT part of it working with 128 samples #2kHz sampling rate.
lookup audio fingerprinting ... essentially you probe the frequency domain output from the FFT call and take a snapshot of the range of frequencies together with the magnitude of each freq then compare this between known animal signal and unknown signal and output a measurement of those differences.
Naturally this difference will approach zero when unknown signal is your actual known signal
Here is another layer : For better fidelity instead of performing a single FFT of the entire audio available, do many FFT calls each with a subset of the samples ... for each call slide this window of samples further into the audio clip ... lets say your audio clip is 2 seconds yet here you only ever send into your FFT call 200 milliseconds worth of samples this gives you at least 10 such FFT result sets instead of just one had you gulped the entire audio clip ... this gives you the notion of time specificity which is an additional dimension with which to derive a more lush data difference between known and unknown signal ... experiment to see if it helps to slide the window just a tad instead of lining up each window end to end
To be explicit you have a range of frequencies say spread across X axis then along Y axis you have magnitude values for each frequency at different points in time as plucked from your audio clips as you vary your sample window as per above paragraph ... so now you have a two dimensional grid of data points
Again to beef up the confidence intervals you will want to perform all of above across several different audio clips of your known source animal howl against each of your unknown signals so now you have a three dimensional parameter landscape ... as you can see each additional dimension you can muster will give more traction hence more accurate results
Start with easily distinguished known audio against a very different unknown audio ... say a 50 Hz sin curve tone for known audio signal against a 8000 Hz sin wave for the unknown ... then try as your known a single strum of a guitar and use as unknown say a trumpet ... then progress to using actual audio clips
Audacity is an excellent free audio work horse of the industry - it easily plots a WAV file to show its time domain signal or FFT spectrogram ... Sonic Visualiser is also a top shelf tool to use
This is not a simple silver bullet however each layer you add to your solution can give you better results ... it is a process you are crafting not a single dimensional trigger to squeeze.
I want to know what a single sample of audio data (uncompressed PCM) represents.
It is a number, but what exactly is that number and how come it can be converted back to audio?
For example if it is a 4-bit sample, does 0 represent absolute silence and 15 represent max volume?
If it is volume, what frequency are we talking about? How is the information about the frequency stored?
In songs we can hear various instruments (frequencies) at the same time, meaning each frequency is somehow stored in a single sample. How is that done?
Audio is just a curve which wobbles up/down with time going left/right. At a given point in time a Sample is a measure of the curve height. Silence is when the curve does not wobble ... it just goes flatline ... at value zero with a Sample value of 0 (more accurately the middle value of its range from max to min) ... when curve reaches its maximum height up or down that stretch of audio is the loudest possible
The notion of normalization is important ... the absolute range of curve values (maximum up or down) is arbitrary ... could be anything ... lets say max is 15 and minimum is 0 ... remember silence is no wobble so middle of max up/down silence would be about 7
Curves can be encoded into any number of bits ... this roughly maps into how many horizontal lines you dice the curve into ... more lines more bits so greater accuracy in value of your Sample of curve height
A sin or cos curve is considered a pure tone ... Joseph Fourier proved an arbitrary curve (audio or otherwise) can be stored in the form of a set sin curves of (A) various volumes (max up/down) (B) various frequencies (C) various phase offsets ... interestingly this transformation works in either direction : from a curve of arbitrary shape into a set of above (A/B/C) or from a set of (A/B/C) back into synthesizing a curve of arbitrary shape (this is how audio synthesizers work)
Information about frequency storage is baked into the curve shape ... its all about how often the curve wobbles up/down ... lazy wobbles taking a long time to cross from below to above the middle line are low frequency ... a stretch of tightly spaced squiggles implies a high frequency squawk
When a microphone records multiple people all talking at once or various instruments all emitting their own sounds we have many simultaneous frequencies yet the recording somehow just works - How ? think of what happens inside the microphone ( or to your flat eardrum ) ... its coil can be considered as a flat surface (a 2D surface ) which can only get sloshed up or down period ... either only moves back and forth ... this is an arbitrary curve ... one curve which at a point in time has a value of its height as it progresses from max to min
I'm following the python challenge riddles, and I now need to analyse a wav file. I learn there is a python module that reads the frames, and that these frames are 16bit or 8bit.
What I don't understand, is what does this bits represent? Are these values directly transformed to a voltage applied to the speakers (say via factoring)?
The bits represent the voltage level of an electrical waveform at a specific moment in time.
To convert the electrical representation of a sound wave (an analog signal) into digital data, you sample the waveform at regular intervals, like this:
Each of the blue dots indicates the value of a four-bit number that represents the height of the analog signal at that point in time (the X axis being time, and the Y axis being voltage).
In .WAV files, these points are represented by 8-bit numbers (having 256 different possible values) or 16 bit numbers (having 65536 different possible values). The more bits you have in each number, the greater the accuracy of your digital sampling.
WAV files can actually contain all sorts of things, but it is most typically linear pulse-code modulation (LPCM). Each frame contains a sample for each channel. If you're dealing with a mono file, then each frame is a single sample. The sample rate specifies how many samples per second there are per channel. CD-quality audio is 16-bit samples taken 44,100 times per second.
These samples are actually measuring the pressure level for that point in time. Imagine a speaker compressing air in front of it to create sound, vibrating back and forth. For this example, you can equate the sample level to the position of the speaker cone.
In my application I'm using the sound library Beads (this question isn't specifically about that library).
In the library there's a class WavePlayer. It takes a Buffer, and produces a sound wave by iterating over the Buffer.
Buffers simply wrap a float[].
For example, here's a beginning of a buffer:
0.0 0.0015339801 0.0030679568 0.004601926 0.0061358847 0.007669829 0.009203754 0.010737659 0.012271538 0.0138053885 0.015339206 0.016872987 0.01840673 0.019940428 0.02147408 ...
It's size is 4096 float values.
Iterating over it with a WavePlayer creates a smooth "sine wave" sound. (This buffer is actually a ready-made 'preset' in the Buffer class, i.e. Buffer.SINE).
My question is:
What kind of data does a buffer like this represent? What kind of information does it contain that allows one to iterate over it and produce an audio wave?
read this post What's the actual data in a WAV file?
Sound is just a curve. You can represent this curve using integers or floats.
There are two important aspects : bit-depth and sample-rate. First let's discuss bit-depth. Each number in your list (int/floats) represents the height of the sound curve at a given point in time. For simplicity, when using floats the values typically vary from -1.0 to +1.0 whereas integers may vary from say 0 to 2^16 Importantly, each of these numbers must be stored into a sound file or audio buffer in memory - the resolution/fidelity you choose to represent each point of this curve influences the audio quality and resultant sound file size. A low fidelity recording may use 8 bits of information per curve height measurement. As you climb the fidelity spectrum, 16 bits, 24 bits ... are dedicated to store each curve height measurement. More bits equates with more significant digits for floats or a broader range of integers (16 bits means you have 2^16 integers (0 to 65535) to represent height of any given curve point).
Now to the second aspect sample-rate. As you capture/synthesize sound in addition to measuring the curve height, you must decide how often you measure (sample) the curve height. Typical CD quality records (samples) the curve height 44100 times per second, so sample-rate would be 44.1kHz. Lower fidelity would sample less often, ultra fidelity would sample at say 96kHz or more. So the combination of curve height measurement fidelity (bit-depth) coupled with how often you perform this measurement (sample-rate) together define the quality of sound synthesis/recording
As with many things these two attributes should be in balance ... if you change one you should change the other ... so if you lower sample rate you are reducing the information load and so are lowering the audio fidelity ... once you have done this you can then lower the bit depth as well without further compromising fidelity
I've seen the various FFT questions on here but I'm confused on part of the implementation. Instead of performing the FFT in real time, I want to do it offline. Lets say I have the raw data in float[] audio. The sampling rate is 44100 and so audio[0] to audio[44099] will contain 1 seconds worth of audio. If my FFT function handles the windowing (e.g. Hanning), do I simply put the entire audio buffer into the function in one go? Or, do I have to cut the audio into chunks of 4096 (my window size) and then input that into the FFT which will then perform the windowing function on top?
You may need to copy your input data to a separate buffer and get it in the correct format, e.g. if your FFT is in-place, or if it requires interleaved complex data (real/imaginary). However if your FFT routine can take a purely real input and is not in-place (i.e. non-destructive) then you may just be able to pass a pointer to the original sample data, along with an appropriate size parameter.
Typically for 1s of audio, e.g. speech or music, you would pick an FFT size which corresponds to a reasonably stationary chunk of audio, e.g. 10 ms or 20 ms. So at 44.1 kHz your FFT size might be say 512 or 1024. You would then generate successive spectra by advancing through your buffer and doing a new FFT at each starting point. Note that it's common practice to overlap these successive buffers, typically by 50%. So if N = 1024 your first FFT would be for samples 0..1023, your second would be for samples 512..1535, then 1024..2047, etc.
The choice of whether to calculate one FFT over the entire data set (in the OP's case, 44100 samples representing 1-second of data), or whether to do a series of FFT's over smaller subsets of the full data set, depends on the data, and on the intended purpose of the FFT.
If the data is relatively static spectrally over the full data set, then one FFT over the entire data set is probably all that's needed.
However, if the data is spectrally dynamic over the data set, then multiple sliding FFT's over small subsets of the data would create a more accurate time-frequency representation of the data.
The plot below shows the power spectrum of an acoustic guitar playing an A4 note. The audio signal was sampled at 44.1 KHz and the data set contains 131072 samples, almost 3 seconds of data. This data set was pre-multiplied with a Hann window function.
The plot below shows the power spectrum of a subset of 16384 samples (0 to 16383) taken from the full data set of the acoustic guitar A4 note. This subset was also pre-multiplied with a Hann window function.
Notice how the spectral energy distribution of the subset is significantly different from the spectral energy distribution of the full data set.
If we were to extract subsets from the full data set, using a sliding 16384 sample frame, and calculate the power spectrum of each frame, we would create an accurate time-frequency picture of the full data set.
References:
Real audio signal data, Hann window function, plots, FFT, and spectral analysis were done here:
Fast Fourier Transform, spectral analysis, Hann window function, audio data
The chunk size or window length you pick controls the frequency resolution and the time resolution of the FFT result. You have to determine which you want or what trade-off to make.
Longer windows give you better frequency resolution, but worse time resolution. Shorter windows, vice versa. Each FFT result bin will contain a frequency bandwidth of roughly 1 to 2 times the sample rate divided by the FFT length, depending on the window shape (rectangular, von Hann, etc.), not just one single frequency. If your entire data chunk is stationary (frequency content doesn't change), then you may not need any time resolution, and can go for 1 to 2 Hz frequency "resolution" in your 1 second of data. Averaging multiple short FFT windows might also help reduce the variance of your spectral estimations.