What's the actual data in a WAV file?

What's the actual data in a WAV file? - audio

I'm following the python challenge riddles, and I now need to analyse a wav file. I learn there is a python module that reads the frames, and that these frames are 16bit or 8bit.
What I don't understand, is what does this bits represent? Are these values directly transformed to a voltage applied to the speakers (say via factoring)?

The bits represent the voltage level of an electrical waveform at a specific moment in time.
To convert the electrical representation of a sound wave (an analog signal) into digital data, you sample the waveform at regular intervals, like this:
Each of the blue dots indicates the value of a four-bit number that represents the height of the analog signal at that point in time (the X axis being time, and the Y axis being voltage).
In .WAV files, these points are represented by 8-bit numbers (having 256 different possible values) or 16 bit numbers (having 65536 different possible values). The more bits you have in each number, the greater the accuracy of your digital sampling.

WAV files can actually contain all sorts of things, but it is most typically linear pulse-code modulation (LPCM). Each frame contains a sample for each channel. If you're dealing with a mono file, then each frame is a single sample. The sample rate specifies how many samples per second there are per channel. CD-quality audio is 16-bit samples taken 44,100 times per second.
These samples are actually measuring the pressure level for that point in time. Imagine a speaker compressing air in front of it to create sound, vibrating back and forth. For this example, you can equate the sample level to the position of the speaker cone.

Related

How is WAV data stored in principle? [duplicate]

This question already has answers here:
What do the bytes in a .wav file represent?
(6 answers)
Closed last year.
Each WAV files depends on a Sampling Rate and a Bit Depth. The former governs how many different samples are played per second, and the latter governs how many possibilities there are for each timeslot.
For sampling rate for example 1000 Hz and the bit depth is 8 then each 1/1000 of a second the audio device plays one of a possible $2^8$ different sounds.
Hence the bulk of the WAV file is a sequence of 8-bit numbers. There is also a header which contains the Sampling Rate and Bit Depth and other specifics of how the data should be read:
The above comes from running xxd on a wav file to view it in binary on the terminal. The first column is just increments of 6 in hexadecimal. The last one seems to say where the header ends. So the data looks like this:
Each of those 8-bit numbers is a sample. So the device reads left-to right and converts the samples in order into sounds. But how in principle can each number correspond to a sound. I would think each bit should somehow encode an amplitude and a pitch, with each coming from a finite range. But I can not find any reference to for example the first half of the bits being a pitch and the second being a frequency.
I have found references to the numbers encoding "signal strength" but I do not know what this means.Can anyone explain in principle how the data is read and converted to audio?

In your example, over the course of a second, 1000 values are sent to a DAC (Digital to Analog converter) where the discrete values are smoothed out into a waveform. The pitch is determined by the rate and pattern by which the stream of values (which get smoothed out to a wave) rise and fall.
Steve W. Smith gives some good diagrams and explanations in his chapter ADC and DCA from his very helpful book The Scientists and Engineers Guide to Digital Signal Processing.

How can I convert an audio (.wav) to satellite image

I need to create a software can capture sound (from NOAA Satellite with RTL-SDR). The problem is not capture the sound, the problem is how I converted the audio or waves into an image. I read many things, Fourier Fast Transformed, Hilbert Transform, etc... but I don't know how.
If you can give me an idea it would be fantastic. Thank you!

Over the past year I have been writing code which makes FFT calls and have amassed 15 pages of notes so the topic is vast however I can boil it down
Open up your WAV file ... parse the 44 byte header and note the given bit depth and endianness attributes ... then read across the payload which is everything after that header ... understand notion of bit depth as well as endianness ... typically a WAV file has a bit depth of 16 bits so each point on the audio curve will be stored across two bytes ... typically WAV file is little endian not big endian ... knowing what that means you take the next two bytes then bit shift one byte to the left (if little endian) then bit OR that pair of bytes into an integer then convert that int which typically varies from 0 to (2^16 - 1) into its floating point equivalent so your audio curve points now vary from -1 to +1 ... do that conversion for each set of bytes which corresponds to each sample of your payload buffer
Once you have the WAV audio curve as a buffer of floats which is called raw audio or PCM audio then perform your FFT api call ... all languages have such libraries ... output of FFT call will be a set of complex numbers ... pay attention to notion of the Nyquist Limit ... this will influence how your make use of output of your FFT call
Now you have a collection of complex numbers ... the index from 0 to N of that collection corresponds to frequency bins ... the size of your PCM buffer will determine how granular your frequency bins are ... lookup this equation ... in general more samples in your PCM buffer you send to the FFT api call will give you finer granularity in the output frequency bins ... essentially this means as you walk across this collection of complex numbers each index will increment the frequency assigned to that index
To visualize this just feed this into a 2D plot where X axis is frequency and Y axis is magnitude ... calculate this magnitude for each complex number using
curr_mag = 2.0 * math.Sqrt(curr_real*curr_real+curr_imag*curr_imag) / number_of_samples
For simplicity we will sweep under the carpet the phase shift information available to you in your complex number buffer
This only scratches the surface of what you need to master to properly render a WAV file into a 2D plot of its frequency domain representation ... there are libraries which perform parts or all of this however now you can appreciate some of the magic involved when the rubber hits the road
A great explanation of trade offs between frequency resolution and number of audio samples fed into your call to an FFT api https://electronics.stackexchange.com/questions/12407/what-is-the-relation-between-fft-length-and-frequency-resolution
Do yourself a favor and checkout https://www.sonicvisualiser.org/ which is one of many audio workstations which can perform what I described above. Just hit menu File -> Open -> choose a local WAV file -> Layer -> Add Spectrogram ... and it will render the visual representation of the Fourier Transform of your input audio file as such

How can an audio wave be represented in a long array of floats?

In my application I'm using the sound library Beads (this question isn't specifically about that library).
In the library there's a class WavePlayer. It takes a Buffer, and produces a sound wave by iterating over the Buffer.
Buffers simply wrap a float[].
For example, here's a beginning of a buffer:
0.0 0.0015339801 0.0030679568 0.004601926 0.0061358847 0.007669829 0.009203754 0.010737659 0.012271538 0.0138053885 0.015339206 0.016872987 0.01840673 0.019940428 0.02147408 ...
It's size is 4096 float values.
Iterating over it with a WavePlayer creates a smooth "sine wave" sound. (This buffer is actually a ready-made 'preset' in the Buffer class, i.e. Buffer.SINE).
My question is:
What kind of data does a buffer like this represent? What kind of information does it contain that allows one to iterate over it and produce an audio wave?

read this post What's the actual data in a WAV file?
Sound is just a curve. You can represent this curve using integers or floats.
There are two important aspects : bit-depth and sample-rate. First let's discuss bit-depth. Each number in your list (int/floats) represents the height of the sound curve at a given point in time. For simplicity, when using floats the values typically vary from -1.0 to +1.0 whereas integers may vary from say 0 to 2^16 Importantly, each of these numbers must be stored into a sound file or audio buffer in memory - the resolution/fidelity you choose to represent each point of this curve influences the audio quality and resultant sound file size. A low fidelity recording may use 8 bits of information per curve height measurement. As you climb the fidelity spectrum, 16 bits, 24 bits ... are dedicated to store each curve height measurement. More bits equates with more significant digits for floats or a broader range of integers (16 bits means you have 2^16 integers (0 to 65535) to represent height of any given curve point).
Now to the second aspect sample-rate. As you capture/synthesize sound in addition to measuring the curve height, you must decide how often you measure (sample) the curve height. Typical CD quality records (samples) the curve height 44100 times per second, so sample-rate would be 44.1kHz. Lower fidelity would sample less often, ultra fidelity would sample at say 96kHz or more. So the combination of curve height measurement fidelity (bit-depth) coupled with how often you perform this measurement (sample-rate) together define the quality of sound synthesis/recording
As with many things these two attributes should be in balance ... if you change one you should change the other ... so if you lower sample rate you are reducing the information load and so are lowering the audio fidelity ... once you have done this you can then lower the bit depth as well without further compromising fidelity

Basic unit of Sound?

If we consider computer graphics to be the art of image synthesis where the basic unit is a pixel.
What is the basic unit of sound synthesis?
[This relates to programming as I want to generate this via a computer program.]
Thanks!

The basic unit is a sample
In a WAVE file, the sample is just an integer specifying where to move the speaker head to.
The sample rate determines how often a new sample is fed to the speakers (I'm not entirely sure how this part works, but it does get converted to an analog signal first). The samples are typically laid out in the file one right after another.
When you plot all the samples with x-axis being time and y-axis being sample_value, you can see the waveform.
In a wave file, samples can (theoretically) be any bit-size from 0-65535, which remains constant throughout the wave file. But typically 16 or 24 bits are used.

Computer graphics can also have vector shapes as basic units, not just pixels. Generally, vector graphics are generated via computer tools while captured data tends to appear as a grid of pixels (corresponding to an array of sensors in a camera or other capture device). Obviously there is considerable crossover between those classifications.
Similarly, there are sampled (such as .WAV) and generative (such as .MIDI) forms of computer audio. In the sampled case, the smallest unit is a single sample. Just like an array of pixels in the brightness, x- and y-dimensions come together to form an image, an array of samples in the loudness and time dimensions come together to form a sound. In the generative case, it will be something more like a single tone rendered in a particular voice just like vector graphics have paths drawn with particular textures.

A pixel can have a value and be encoded in digital bitmap samples. The same properties apply to sound and digital audio samples.
A pixel is a physical device that can only render the amplitudes of 3 frequencies of light (Red, Green, Blue) at a time. A speaker is a physical device that can render the amplitudes of a wide range of frequencies (~40,000) at a time. The bit resolution of a sample (number of bits used to to store the value of a sample) mainly determines how many colors/tones can be rendered - the fidelity of the physical playback device.
Also, as patterns of pixels can be encoded or compressed, most patterns of sound samples are also encoded or compressed (or both).

The fundamental unit of signal processing (of which audio is a special case) would be the sample.
The frequency at which you need to sample a signal depends on the maximum frequency present in the waveform. Sampling theorem states that it is normally sufficient to sample at twice the frequency of the maximum frequency present in the signal.
http://en.wikipedia.org/wiki/Sampling_theorem
The human ear is sensitive to sounds up to around 20kHz (the upper frequency lowers with age). This is why music on CD is sampled at 44kHz.
It is often more useful to think of music as being comprised of individual frequencies.
http://www.phys.unsw.edu.au/jw/sound.spectrum.html
Most sound analysis and creation is based on this idea.
Related concepts:
Psychoacoustics: Human perception of sound. Relates to modern sound compression techniques such as mp3.
Fourier series: How complex waveforms are composed of individual frequencies.

I would say the basic unit of sound synthesis is the sine wave. But your definition of synthesis is perhaps different to what audio people would refer to sound synthesis. Sound systhesis is the creation of sound using the fundamental components of sound.
With sine waves, we can synthesise sounds using many techniques such as substractive synthesis, additive synthesis or FM synthesis.
Fourier theory states that every sound is a summation of sine waves of differing phases, frequencies and amplitudes.
OK, so how do we represent a sine wave on a computer? well, a sine wave will be generated using a buffer(array) of 'samples' that have been generated by a function or read from a table. The same technique applies to any sound captured on a computer.
A 'sample' is typically represented as number between -1 and 1 that directly correlates to the amplitude of a sound at a given moment in time. A typical sound recorded at 16 bit depth, would have 65536 (2pow16) possible amplitude values. When being recorded, typically, a sample will be captured 44.1k per second of sound. This is called the sampling frequency rate, or simply the sample rate.
Upon playback from you computer, each sample will pass though an Digital to Analogue converter and generate a vibration on your pc speaker and will in turn cause your ear to percieve the recorded sound.

Sound can be expressed as several different units, but the most common in synthesis/computer music is decibels (dB), which are a relative logarithmic measure of amplitude. Specifically they are normally relative to the maximum amplitude of the audio system.
When measuring sound in "real life", the units are normally A-weighted Decibels or dB(A).
The frequency of a sound (i.e. its pitch) is its amplitude over time, or in the digital world, its amplitude over samples. The number of samples per unit of real time is called the sampling rate; conventional hi-fi systems have sampling rates of 44 kHz (44,000 samples per second) and synthesis/recording software usually supports up to 96 kHz.
Everything sound in the digital domain can be represented as a waveform with the X-axis representing the time (or sample number) and the Y-axis representing the amplitude.

frequency and amplitude of the wave are what make up sound.
That is for a tone.
Music or for that matter most noise is a composite of multiple simultaneous sound waves superimposed on one another.
The unit for amplitute is the
Bel. (We use tenths of a Bel
therefore the term decibel)
The unit for frequency is the
Hertz.
That being said synthesis of music is a large field.

Bitmapped graphics are based on sampling the amplitude of light in a 2D space, where each sample is digitized to a given bit depth and often converted to a logarithmic representation at a different bit depth. The samples are always positive, since you can't be darker than pure black. Each of these samples is called a pixel.
Sound recording is most often based on sampling the magnitude of sound pressure at a microphone, where the samples are taken at constant time intervals. These samples can be positive or negative with respect to perfect silence. Most often these samples are not converted to a logarithm, even though sound is perceived in a logarithmic fashion just as light is. There is no special term to refer to these samples as there is with pixels.
The Bels and Decibels mentioned by others are useful in the context of measuring peak or average sound levels. They are not used to describe the individual sound samples.
You might also find it useful to know how sound file formats compare to image file formats. WAVE is an uncompressed format specific to Windows and is analogous to BMP. MP3 is a lossy compression analogous to JPEG. FLAC is a lossless compression analogous to 24-bit PNG.

If computer graphics are colored dots in 2 dimensional space representing a 3 dimensional space, then sound synthesis is amplitude values regularly partitioned in time representing musical events.
If you want your result to sound like music (the kind of music most people like at least), then you are either going to use some standard synthesis techniques, or literally waste decades of your life reinventing them from scratch.
The most basic techniques are additive synthesis, in which the individual elements are the frequencies, amplitudes, and phases of sine oscillators; subtractive synthesis, where you work with filter coefficients and a complex input waveform; frequency modulation synthesis, where you work with modulation depths and rates of stages of modulation; granular synthesis where short (hundredths to tenths of a second long) enveloped pieces of a recorded sound or an artificial waveform are combined in immense numbers. Each of these in practice uses parameters that evolve over the course of a note, and often you will mix elements of various techniques into a larger instrument.
I recommend this book, though it doesn't have the math for many concepts it at least lays the ground for the concepts used, and gives a nice overview of the techniques.
You wouldn't waste your time going sample by sample to do music in practice any more than you would waste your time going pixel by pixel to render 3d (in other words yeah go sample by sample if making a tool for other people to make music with, but that is way too low a level if you are interested in the task of making music).

Probably the envelope. A tone/note has a shape described by: attack decay sustain release

The byte, or word, depending on the bit-depth of the sound.

Sound pressure display for WAVE PCM data

The digital sound is playing using DirectSound device. It is necessary to display sound activity in decibels - like analog devices do.
What is the right way to calculate sound pressure from the WAVE PCM data (44100 Hz, 16-bit)?

if you just need an "idea" of the sound pressure, you can simply compute the log-energy on some time franmes of the signal: split the signal every N samples, compute 10*log(sum(xn**2)) where x are the N samples, and you get a value in the dB domain. If you need to precisely display a measure (that is your 0 dB matches say a mixtable 0dB), it is a bit more complicated.
See here for more details:
http://music.columbia.edu/pipermail/music-dsp/2002-April/048341.html

Sound pressure is a measure of force per unit area. To determine this you would have to have information about the speaker(s) on which the audio is played. You can obtain a decibel level with respect to an arbitrary reference (as opposed to the threshold of hearing) with the algorithm proposed by cournape.
Calculate the average signal power over a time interval, compute the base-10 logarithm and multiply by 19. The average power is calculated by averaging the the square of each sample over the interval. Note that positive and negative values are necessary (i.e. it must be an AC signal). So, make sure the PCM values are either floating-point, 2's complement or offset unsigned values accordingly.
Also, by applying Parseval's theorum and the Fourier transform you can also generate signal levels for different frequency bands.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string