What does a sample of audio data represent? - audio

I want to know what a single sample of audio data (uncompressed PCM) represents.
It is a number, but what exactly is that number and how come it can be converted back to audio?
For example if it is a 4-bit sample, does 0 represent absolute silence and 15 represent max volume?
If it is volume, what frequency are we talking about? How is the information about the frequency stored?
In songs we can hear various instruments (frequencies) at the same time, meaning each frequency is somehow stored in a single sample. How is that done?

Audio is just a curve which wobbles up/down with time going left/right. At a given point in time a Sample is a measure of the curve height. Silence is when the curve does not wobble ... it just goes flatline ... at value zero with a Sample value of 0 (more accurately the middle value of its range from max to min) ... when curve reaches its maximum height up or down that stretch of audio is the loudest possible
The notion of normalization is important ... the absolute range of curve values (maximum up or down) is arbitrary ... could be anything ... lets say max is 15 and minimum is 0 ... remember silence is no wobble so middle of max up/down silence would be about 7
Curves can be encoded into any number of bits ... this roughly maps into how many horizontal lines you dice the curve into ... more lines more bits so greater accuracy in value of your Sample of curve height
A sin or cos curve is considered a pure tone ... Joseph Fourier proved an arbitrary curve (audio or otherwise) can be stored in the form of a set sin curves of (A) various volumes (max up/down) (B) various frequencies (C) various phase offsets ... interestingly this transformation works in either direction : from a curve of arbitrary shape into a set of above (A/B/C) or from a set of (A/B/C) back into synthesizing a curve of arbitrary shape (this is how audio synthesizers work)
Information about frequency storage is baked into the curve shape ... its all about how often the curve wobbles up/down ... lazy wobbles taking a long time to cross from below to above the middle line are low frequency ... a stretch of tightly spaced squiggles implies a high frequency squawk
When a microphone records multiple people all talking at once or various instruments all emitting their own sounds we have many simultaneous frequencies yet the recording somehow just works - How ? think of what happens inside the microphone ( or to your flat eardrum ) ... its coil can be considered as a flat surface (a 2D surface ) which can only get sloshed up or down period ... either only moves back and forth ... this is an arbitrary curve ... one curve which at a point in time has a value of its height as it progresses from max to min

Related

How can I convert an audio (.wav) to satellite image

I need to create a software can capture sound (from NOAA Satellite with RTL-SDR). The problem is not capture the sound, the problem is how I converted the audio or waves into an image. I read many things, Fourier Fast Transformed, Hilbert Transform, etc... but I don't know how.
If you can give me an idea it would be fantastic. Thank you!
Over the past year I have been writing code which makes FFT calls and have amassed 15 pages of notes so the topic is vast however I can boil it down
Open up your WAV file ... parse the 44 byte header and note the given bit depth and endianness attributes ... then read across the payload which is everything after that header ... understand notion of bit depth as well as endianness ... typically a WAV file has a bit depth of 16 bits so each point on the audio curve will be stored across two bytes ... typically WAV file is little endian not big endian ... knowing what that means you take the next two bytes then bit shift one byte to the left (if little endian) then bit OR that pair of bytes into an integer then convert that int which typically varies from 0 to (2^16 - 1) into its floating point equivalent so your audio curve points now vary from -1 to +1 ... do that conversion for each set of bytes which corresponds to each sample of your payload buffer
Once you have the WAV audio curve as a buffer of floats which is called raw audio or PCM audio then perform your FFT api call ... all languages have such libraries ... output of FFT call will be a set of complex numbers ... pay attention to notion of the Nyquist Limit ... this will influence how your make use of output of your FFT call
Now you have a collection of complex numbers ... the index from 0 to N of that collection corresponds to frequency bins ... the size of your PCM buffer will determine how granular your frequency bins are ... lookup this equation ... in general more samples in your PCM buffer you send to the FFT api call will give you finer granularity in the output frequency bins ... essentially this means as you walk across this collection of complex numbers each index will increment the frequency assigned to that index
To visualize this just feed this into a 2D plot where X axis is frequency and Y axis is magnitude ... calculate this magnitude for each complex number using
curr_mag = 2.0 * math.Sqrt(curr_real*curr_real+curr_imag*curr_imag) / number_of_samples
For simplicity we will sweep under the carpet the phase shift information available to you in your complex number buffer
This only scratches the surface of what you need to master to properly render a WAV file into a 2D plot of its frequency domain representation ... there are libraries which perform parts or all of this however now you can appreciate some of the magic involved when the rubber hits the road
A great explanation of trade offs between frequency resolution and number of audio samples fed into your call to an FFT api https://electronics.stackexchange.com/questions/12407/what-is-the-relation-between-fft-length-and-frequency-resolution
Do yourself a favor and checkout https://www.sonicvisualiser.org/ which is one of many audio workstations which can perform what I described above. Just hit menu File -> Open -> choose a local WAV file -> Layer -> Add Spectrogram ... and it will render the visual representation of the Fourier Transform of your input audio file as such

How can an audio wave be represented in a long array of floats?

In my application I'm using the sound library Beads (this question isn't specifically about that library).
In the library there's a class WavePlayer. It takes a Buffer, and produces a sound wave by iterating over the Buffer.
Buffers simply wrap a float[].
For example, here's a beginning of a buffer:
0.0 0.0015339801 0.0030679568 0.004601926 0.0061358847 0.007669829 0.009203754 0.010737659 0.012271538 0.0138053885 0.015339206 0.016872987 0.01840673 0.019940428 0.02147408 ...
It's size is 4096 float values.
Iterating over it with a WavePlayer creates a smooth "sine wave" sound. (This buffer is actually a ready-made 'preset' in the Buffer class, i.e. Buffer.SINE).
My question is:
What kind of data does a buffer like this represent? What kind of information does it contain that allows one to iterate over it and produce an audio wave?
read this post What's the actual data in a WAV file?
Sound is just a curve. You can represent this curve using integers or floats.
There are two important aspects : bit-depth and sample-rate. First let's discuss bit-depth. Each number in your list (int/floats) represents the height of the sound curve at a given point in time. For simplicity, when using floats the values typically vary from -1.0 to +1.0 whereas integers may vary from say 0 to 2^16 Importantly, each of these numbers must be stored into a sound file or audio buffer in memory - the resolution/fidelity you choose to represent each point of this curve influences the audio quality and resultant sound file size. A low fidelity recording may use 8 bits of information per curve height measurement. As you climb the fidelity spectrum, 16 bits, 24 bits ... are dedicated to store each curve height measurement. More bits equates with more significant digits for floats or a broader range of integers (16 bits means you have 2^16 integers (0 to 65535) to represent height of any given curve point).
Now to the second aspect sample-rate. As you capture/synthesize sound in addition to measuring the curve height, you must decide how often you measure (sample) the curve height. Typical CD quality records (samples) the curve height 44100 times per second, so sample-rate would be 44.1kHz. Lower fidelity would sample less often, ultra fidelity would sample at say 96kHz or more. So the combination of curve height measurement fidelity (bit-depth) coupled with how often you perform this measurement (sample-rate) together define the quality of sound synthesis/recording
As with many things these two attributes should be in balance ... if you change one you should change the other ... so if you lower sample rate you are reducing the information load and so are lowering the audio fidelity ... once you have done this you can then lower the bit depth as well without further compromising fidelity

How can I convert an audio file containing pure tones back to serialized data?

Signal processing is a new domain for me, and I don't quite know where to start looking for the solution to my problem.
I have a line graph that was converted to an audio file consisting of nothing but pure tones. I'm trying to convert it back to a line graph using Processing, although I think language is irrelevant. I'm vaguely aware that I may be needing to use a Fourier transform, but it's not something I'm familiar with.
I've looked at all of the examples provided with Processing using Minim and its spectrum analysis functionality, and I'm still lost as to how I should proceed, or what I should even look for.
I imagine modems and fax machines convert serialized data to audio form and back in much the same way, though I'm not sure how they manage to convert the data back from tonal form.
The basic way to do this is for each pixel in the area you are drawing to, you determine which data sample or samples to represent by drawing a pixel at a calculated height.
The gory details:
Disregarding the complexity of compressed audio,an audio file is a set of samples. The samples are recorded at a fixed rate. For audio, common sample rates are 44100, 48000, or 96000 samples per second. The audio file will usually specify this rate. To draw this data, you then map the audio samples to pixels.
For an easy example, say you have 1 second of ECG data recorded at 48000 samples per second. That's 48000 samples in the file. Let the samples be floating point values that range from 0 to 1, though often they are integer samples. And assume you are drawing to a 10 pixel high by 100 pixel wide rectangle.
Given all that, it means that each pixel will represent 480 samples of your data. You can average those 480 samples to get the value that you should draw in the first pixel. To figure out where to fill in the pixel you map the sample's range, 0 to 1, to the drawing rectangle, height 0 to 10. A 0 sample will draw at the bottom of your rectangle, a 1 sample will draw at the top, and a 0.5 sample will draw in the middle. Say that first 480 samples averages to 0.1. Then you'd draw a dot at 1 pixel up from the bottom at the left-most pixel in the drawing area, (0,1) relative to the bottom of the drawing rectangle.
Repeat this until you've determined where to draw the pixel for each pixel in your display area.
If you have fewer samples than you have pixels to display into, you'll interpolate values for each pixel. Given the same drawing area, 10 x 100, but only 10 data samples you'll interpolate nine pixel positions for each data sample.

"Winamp style" spectrum analyzer

I have a program that plots the spectrum analysis (Amp/Freq) of a signal, which is preety much the DFT converted to polar. However, this is not exactly the sort of graph that, say, winamp (right at the top-left corner), or effectively any other audio software plots. I am not really sure what is this sort of graph called (if it has a distinct name at all), so I am not sure what to look for.
I am preety positive about the frequency axis being base two exponential, the amplitude axis puzzles me though.
Any pointers?
Actually an interesting question. I know what you are saying; the frequency axis is certainly logarithmic. But what about the amplitude? In response to another poster, the amplitude can't simply be in units of dB alone, because dB has no concept of zero. This introduces the idea of quantization error, SNR, and dynamic range.
Assume that the received digitized (i.e., discrete time and discrete amplitude) time-domain signal, x[n], is equal to s[n] + e[n], where s[n] is the transmitted discrete-time signal (i.e., continuous amplitude) and e[n] is the quantization error. Suppose x[n] is represented with b bits, and for simplicity, takes values in [0,1). Then the maximum peak-to-peak amplitude of e[n] is one quantization level, i.e., 2^{-b}.
The dynamic range is the defined to be, in decibels, 20 log10 (max peak-to-peak |s[n]|)/(max peak-to-peak |e[n]|) = 20 log10 1/(2^{-b}) = 20b log10 2 = 6.02b dB. For 16-bit audio, the dynamic range is 96 dB. For 8-bit audio, the dynamic range is 48 dB.
So how might Winamp plot amplitude? My guesses:
The minimum amplitude is assumed to be -6.02b dB, and the maximum amplitude is 0 dB. Visually, Winamp draws the window with these thresholds in mind.
Another nonlinear map, such as log(1+X), is used. This function is always nonnegative, and when X is large, it approximates log(X).
Any other experts out there who know? Let me know what you think. I'm interested, too, exactly how this is implemented.
To generate a power spectrum you need to do the following steps:
apply window function to time domain data (e.g. Hanning window)
compute FFT
calculate log of FFT bin magnitudes for N/2 points of FFT (typically 10 * log10(re * re + im * im))
This gives log magnitude (i.e. dB) versus linear frequency.
If you also want a log frequency scale then you will need to accumulate the magnitude from appropriate ranges of bins (and you will need a fairly large FFT to start with).
Well I'm not 100% sure what you mean but surely its just bucketing the data from an FFT?
If you want to get the data such that you have (for a 44Khz file) frequency points at 22Khz, 11Khz 5.5Khz etc then you could use a wavelet decomposition, i guess ...
This thread may help ya a bit ...
Converting an FFT to a spectogram
Same sort of information as a spectrogram I'd guess ...
What you need is power spectrum graph. You have to compute DFT of your signal's current window. Then square each value.

Explain the FFT to me

I want to take audio PCM data and find peaks in it. Specifically, I want to return the frequency and time at which a peak occurs.
My understanding of this is that I have to take the PCM data and dump it into an array, setting it as the real values with the complex parts set to 0. I then take the FFT, and I get an array back. If each number in the array is a magnitude value, how do I get the frequency associated with each one? Also, do I take the magnitude of the real & complex part or just discard the complex values?
Finally, if I wanted to find the peaks in a single song, do I just set a small window to FFT and slide it across all of the audio? Any suggestions on how large that window should be?
If the samplerate of your PCM data is F, then the highest frequency component in the FFT is F/2. Suppose your PCM data was sampled at 44100Hz, then your FFT values will run from 0Hz (DC) to 22050Hz. If you start with N samples, (N being a power of 2), then the FFT may return N/2 values representing all positive frequencies from 0 to F/2, or it may return N values that also include the negative frequencies from -F/2 to 0. You should check the specification of your FFT algorithm to find out to which frequency each array item is mapped.
To find the peaks, you need to look at the magnitude of the FFT values. So you need to add the squared real and imaginary parts of each complex value.
Suppose your FFT of N PCM samples returns N/2 complex values representing positive frequencies. Then the distance between 2 complex samples is F/2N Hz. With F=44100Hz and N=1024 samples, this would be 21.5Hz. This is your frequency resolution. If you need to find lower frequency beats, the FFT window will need to be extended.
well,
A raw array of size 512 of complex numbers expressing the input wave, when processed with FFT we will replace the imaginary parts with zero (according to intended use), leaving the real parts, then pass the array to the FFT with Sample rate : 8192 Hz.
Now we have a 512 array of FFTed real values, each value is an irrational number, every irrational number express several useful values.
To get the fundamental frequency we have to divide the sample rate by the buffer size:
8192/512 = 32;
32 is the resolution of the FFT values means that we're getting to know the high amplitude frequencies near the numbers that are multiples of 32.
Like if we have a wave of
frequency : 3 48 23 128
Amplitude : 10 5 12 8 dB (ref = 1)
after FFT we get:
frequency : 0 32 64 128
Amplitude : 9 8 2 8
FFT is frequency domain means it arranges according to frequency
Time-domain on the other side means arranging by time we listen to music from second zero to second N.
FFT can only listen when it arranged by Frequency from frequency 0 to frequency N.
So it arranges frequencies in ascending order, since it didn't take all the actual samples from the audio (which are approaching infinite) like taking every nanosecond & less to the FFT, luckily this doesn't happen FFT takes samples from the audio, takes a sample every (1/sample rate) second. this samples get buffered (in our case : 512), each 512 samples buffered into FFT, the output is 512 FFT values.
Since FFT arranges frequencies, it messes with the time samples, samples now arranged according to their frequencies.
The frequencies shown on regular base which is the fundamental frequency which is sample rate divided by buffer size, which is in our case 8192/512 = 32.
So, frequencies power shown every 32 frequencies, the power of the nearest frequency is shown according to how much the power frequency is near to the index.
High resolution can be achieved by using higher sample rate.
To show frequencies we print the index in ascending corresponding to the Amplitude.
Amplitude = 20log10(output/ref)
Amplitudes printed next to each Index show the power of the frequency & get more accurate according to the precision of the resolution.
Conclusion, FFT produces an index of amplitudes, each amplitude expresses the power of its corresponding index (frequency).
You may actually be looking for a spectrogram, which is basically an FFT of the data in a small window that's slid along the time axis. If you have software that implements this, it might save you some effort. It's what's commonly used for analysing time varying acoustic signals, and is a very useful way to look at sounds. Also, there are some tricks, for example, with windowing data for FFTs, that the spectrogram will probably get right, but will be harder (though not very hard) for you to do correctly.

Resources