WAV File: How is the Data Subchunk Stored - audio

I'm currently trying to learn about how WAV files are processed and stored. Most of the resources I've looked at clearly explain how the head chunk is processed, but not the data (this is the one I've found the most helpful). From the WAV file I'm inspecting I get:
NumChannels = 2
SampleRate = 44100
BitsPerSample = 16
Subchunk2Size = 2056192 (11.65s audio file).
NumSamples = 514048
So from my understanding, 44100 samples are played in a second and each sample is 16-bits. There is a total of 514048 samples in this recording. But what about the number of channels? How does that effect reading the data? The resource I mentioned shows:
But I don't quite understand what this means. Isn't this showing a sample being 32-bit? And what about the right and left channels? Wouldn't they alternate? Why are they in groups of 2 before changing to the other channel?

The diagram is somewhat unclear, but this is what I understand from it, plus the other information you gave:
each ellipse contains 16 bits (two bytes, four hex digits), so one sample;
there are pairs of samples;
the label "right channel samples" points to the right-hand sample of each pair;
similarly, "left channel samples" points to the left-hand samples.
So it looks to me that the left and right channel samples do alternate.
As for the numbering, I guess the intent was to show that the first pair of samples are each "sample 2" in their respective channels, followed by a pair that are "sample 3", and so on. I would have labelled them "sample pair 2" etc.

Related

How is WAV data stored in principle? [duplicate]

This question already has answers here:
What do the bytes in a .wav file represent?
(6 answers)
Closed last year.
Each WAV files depends on a Sampling Rate and a Bit Depth. The former governs how many different samples are played per second, and the latter governs how many possibilities there are for each timeslot.
For sampling rate for example 1000 Hz and the bit depth is 8 then each 1/1000 of a second the audio device plays one of a possible $2^8$ different sounds.
Hence the bulk of the WAV file is a sequence of 8-bit numbers. There is also a header which contains the Sampling Rate and Bit Depth and other specifics of how the data should be read:
The above comes from running xxd on a wav file to view it in binary on the terminal. The first column is just increments of 6 in hexadecimal. The last one seems to say where the header ends. So the data looks like this:
Each of those 8-bit numbers is a sample. So the device reads left-to right and converts the samples in order into sounds. But how in principle can each number correspond to a sound. I would think each bit should somehow encode an amplitude and a pitch, with each coming from a finite range. But I can not find any reference to for example the first half of the bits being a pitch and the second being a frequency.
I have found references to the numbers encoding "signal strength" but I do not know what this means.Can anyone explain in principle how the data is read and converted to audio?
In your example, over the course of a second, 1000 values are sent to a DAC (Digital to Analog converter) where the discrete values are smoothed out into a waveform. The pitch is determined by the rate and pattern by which the stream of values (which get smoothed out to a wave) rise and fall.
Steve W. Smith gives some good diagrams and explanations in his chapter ADC and DCA from his very helpful book The Scientists and Engineers Guide to Digital Signal Processing.

What's the actual data in a WAV file?

I'm following the python challenge riddles, and I now need to analyse a wav file. I learn there is a python module that reads the frames, and that these frames are 16bit or 8bit.
What I don't understand, is what does this bits represent? Are these values directly transformed to a voltage applied to the speakers (say via factoring)?
The bits represent the voltage level of an electrical waveform at a specific moment in time.
To convert the electrical representation of a sound wave (an analog signal) into digital data, you sample the waveform at regular intervals, like this:
Each of the blue dots indicates the value of a four-bit number that represents the height of the analog signal at that point in time (the X axis being time, and the Y axis being voltage).
In .WAV files, these points are represented by 8-bit numbers (having 256 different possible values) or 16 bit numbers (having 65536 different possible values). The more bits you have in each number, the greater the accuracy of your digital sampling.
WAV files can actually contain all sorts of things, but it is most typically linear pulse-code modulation (LPCM). Each frame contains a sample for each channel. If you're dealing with a mono file, then each frame is a single sample. The sample rate specifies how many samples per second there are per channel. CD-quality audio is 16-bit samples taken 44,100 times per second.
These samples are actually measuring the pressure level for that point in time. Imagine a speaker compressing air in front of it to create sound, vibrating back and forth. For this example, you can equate the sample level to the position of the speaker cone.

Signal Processing and Audio Beat Detection

I am trying to do some work with basic Beat Detection (in both C and/or Java) by following the guide from GameDev.net. I understand the logic behind the implementation of the algorithms, however I am confused as to how one would get the "sound amplitude" data for the left and right channels of a song (i.e. mp3 or wav).
For example, he starts with the following assumption:
In this model we will detect sound energy variations by computing the average sound energy of the signal and comparing it to the instant sound energy. Lets say we are working in stereo mode with two lists of values : (an) and (bn). (an) contains the list of sound amplitude values captured every Te seconds for the left channel, (bn) the list of sound amplitude values captured every Te seconds for the right channel.
He then proceeds to manipulate an and bn using his following algorithms. I am wondering how one would do the Signal Processing necessary to get an and bn every Te seconds for both channels, such that I can begin to follow his guide and mess around with some simple Beat Detection in songs.
An uncompressed audio file (a .wav or.aiff for example) is for the most part a long array of samples. Each sample consists of the amplitude at a given point in time. When music is recorded, many of these amplitude samples are taken each second.
For stereo (2-channel) audio files, the samples in the array usually alternate channels: [sample1 left, sample1 right, sample2 left, sample2 right, etc...].
Most audio parsing libraries will already have a way of returning the samples separately for each channel.
Once you have the sample array for each channel, it is easy to find the samples for a particular second, as long as you know the sample rate, or number of samples per second. For example, if the sample rate for your file is 44100 samples per second, and you want to capture the samples in n th second, you would use the part of your vector that is between (n * 44100 ) and ((n + 1) * 44100).

Correctly decoding/encoding raw PCM data

I'm writing my WAVE decoder/encoder in C++. I've managed to correctly convert between different sample sizes (8, 16 and 32), but I need some help with the channels and the frequency.
Channels:
If I want to convert from stereo to mono:
do I just take the data from one channel (which one? 1 or 2?)?
or do I take the average from channel 1 and 2 for the mono channel.
If I want to convert from mono to stereo:
(I know this is not very scientific)
can I simply add the samples from the single channels into both the stereo channels?
is there a more scientific method to do this (eg: interpolation)?
Sample rate:
How do I change the sample rate (resample), eg: from 44100 Hz to 22050 Hz:
do I simply take the average of 2 sequential samples for the new (lower frequency) value?
Any more scientific algorithms for this?
Stereo to mono - take the mean of the left and right samples, i.e. M = (L + R) / 2 - this works for the vast majority of stereo content, but note that there are some rare cases where you can get left/right cancellation.
Mono to stereo - put the mono sample in both left and right channels, i.e. L = R = M - this gives a sound image which is centered when played as stereo
Resampling - for a simple integer ratio downsampling as in your example above, the process is:
low pass filter to accommodate new Nyquist frequency, e.g. 10 kHz LPF for 22.05 kHz sample rate
decimate by required ratio (i.e. drop alternate samples for your 2x downsampling example)
Note that there are third party libraries such as libsamplerate which can handle resampling for you in the general case, so if you have more than one ratio you need to support, or you have some tricky non-integer ratio, then this might be a better approach

What is a "sample" in MP3?

It is said, that MP3 frame contains 1152 samples. What is a "sample" then? A pair of values for right AND left channel? Or an individual value fro right OR left channel?
The language that is used can get a little bit confusing. The just of it is that each frame will have 1152 (or 384, or 576 depending on MPEG version and layer) per audio channel. How that data actually gets stored is more complicated than a single value for each channel because of compression.
If you want to learn more I would recommend http://blog.bjrn.se/2008/10/lets-build-mp3-decoder.html for a nice, detailed blog that builds up the reader's understanding of the MP3 format for the sake of building a decoder.
You can also see http://wiki.hydrogenaudio.org/index.php?title=MP3#Polyphase_Filterbank_Formula for rather technical information. Link is anchored to a section that says specifically: "Audio is processed by frames of 1152 samples per audio channel" But the whole page describes aspects of the MP3 format.
MP3 takes in 2304 16 bit PCM samples, 1152 from each channel, and essentially performs an overlapped MDCT on it, such that you get 576 frequency domain components per channel. Because it is half overlapped, the next MDCT transform will include 756 new and 756 old samples per channel, and output 756 samples per channel, so you get a 1:1 sample mapping from the time to the frequency domain.
The psychoacoustic model is what performs the lossy compression, and I don't know the details. The output of this gets huffman coded (which is lossless compression).
Each MP3 frame contains 2 granules of 576 samples (that correspond to 576 new and 576 old PCM samples). This means 576 samples per channel, or 1152 samples total. Each frame therefore corresponds to 1152 new PCM samples per channel, so 2304 samples. Each granule contains huffman bits for both channels, scale factors for both channels. The side information in the frame is used by the huffman decoder.
Sample typically refers to a point in time, so this would include both the left and right channels, but you can separate them.

Resources