What is a "sample" in MP3? - audio

It is said, that MP3 frame contains 1152 samples. What is a "sample" then? A pair of values for right AND left channel? Or an individual value fro right OR left channel?

The language that is used can get a little bit confusing. The just of it is that each frame will have 1152 (or 384, or 576 depending on MPEG version and layer) per audio channel. How that data actually gets stored is more complicated than a single value for each channel because of compression.
If you want to learn more I would recommend http://blog.bjrn.se/2008/10/lets-build-mp3-decoder.html for a nice, detailed blog that builds up the reader's understanding of the MP3 format for the sake of building a decoder.
You can also see http://wiki.hydrogenaudio.org/index.php?title=MP3#Polyphase_Filterbank_Formula for rather technical information. Link is anchored to a section that says specifically: "Audio is processed by frames of 1152 samples per audio channel" But the whole page describes aspects of the MP3 format.

MP3 takes in 2304 16 bit PCM samples, 1152 from each channel, and essentially performs an overlapped MDCT on it, such that you get 576 frequency domain components per channel. Because it is half overlapped, the next MDCT transform will include 756 new and 756 old samples per channel, and output 756 samples per channel, so you get a 1:1 sample mapping from the time to the frequency domain.
The psychoacoustic model is what performs the lossy compression, and I don't know the details. The output of this gets huffman coded (which is lossless compression).
Each MP3 frame contains 2 granules of 576 samples (that correspond to 576 new and 576 old PCM samples). This means 576 samples per channel, or 1152 samples total. Each frame therefore corresponds to 1152 new PCM samples per channel, so 2304 samples. Each granule contains huffman bits for both channels, scale factors for both channels. The side information in the frame is used by the huffman decoder.
Sample typically refers to a point in time, so this would include both the left and right channels, but you can separate them.

Related

How is WAV data stored in principle? [duplicate]

This question already has answers here:
What do the bytes in a .wav file represent?
(6 answers)
Closed last year.
Each WAV files depends on a Sampling Rate and a Bit Depth. The former governs how many different samples are played per second, and the latter governs how many possibilities there are for each timeslot.
For sampling rate for example 1000 Hz and the bit depth is 8 then each 1/1000 of a second the audio device plays one of a possible $2^8$ different sounds.
Hence the bulk of the WAV file is a sequence of 8-bit numbers. There is also a header which contains the Sampling Rate and Bit Depth and other specifics of how the data should be read:
The above comes from running xxd on a wav file to view it in binary on the terminal. The first column is just increments of 6 in hexadecimal. The last one seems to say where the header ends. So the data looks like this:
Each of those 8-bit numbers is a sample. So the device reads left-to right and converts the samples in order into sounds. But how in principle can each number correspond to a sound. I would think each bit should somehow encode an amplitude and a pitch, with each coming from a finite range. But I can not find any reference to for example the first half of the bits being a pitch and the second being a frequency.
I have found references to the numbers encoding "signal strength" but I do not know what this means.Can anyone explain in principle how the data is read and converted to audio?
In your example, over the course of a second, 1000 values are sent to a DAC (Digital to Analog converter) where the discrete values are smoothed out into a waveform. The pitch is determined by the rate and pattern by which the stream of values (which get smoothed out to a wave) rise and fall.
Steve W. Smith gives some good diagrams and explanations in his chapter ADC and DCA from his very helpful book The Scientists and Engineers Guide to Digital Signal Processing.

How does an audio converter work?

I currently have the idea to code a small audio converter (e.g. FLAC to MP3 or m4a format) application in C# or Python but my problem is I do not know at all how audio conversion works.
After a research, I heard about Analog-to-digital / Digital-to-analog converter but I guess it would be a Digital-to-digital or something like that isn't it ?
If someone could precisely explain how it works, it would be greatly appreciated.
Thanks.
digital audio is called PCM which is the raw audio format fundamental to any audio processing system ... its uncompressed ... just a series of integers representing the height of the audio curve for each sample of the curve (the Y axis where time is the X axis along this curve)
... this PCM audio can be compressed using some codec then bundled inside a container often together with video or meta data channels ... so to convert audio from A to B you would first need to understand the container spec as well as the compressed audio codec so you can decompress audio A into PCM format ... then do the reverse ... compress the PCM into codec of B then bundle it into the container of B
Before venturing further into this I suggest you master the art of WAVE audio files ... beauty of WAVE is that its just a 44 byte header followed by the uncompressed integers of the audio curve ... write some code to read a WAVE file then parse the header (identify bit depth, sample rate, channel count, endianness) to enable you to iterate across each audio sample for each channel ... prove that its working by sending your bytes into an output WAVE file ... diff input WAVE against output WAVE as they should be identical ... once mastered you are ready to venture into your above stated goal ... do not skip over groking notion of interleaving stereo audio as well as spreading out a single audio sample which has a bit depth of 16 bits across two bytes of storage and the reverse namely stitching together multiple bytes into a single integer with a bit depth of 16, 24 or even 32 bits while keeping endianness squared away ... this may sound scary at first however all necessary details are on the net as its how I taught myself this level of detail
modern audio compression algorithms leverage knowledge of how people perceive sound to discard information which is indiscernible ( lossy ) as opposed to lossless algorithms which retain all the informational load of the source ... opus (http://opus-codec.org/) is a current favorite codec untainted by patents and is open source

Why review compositing work in MJPEG videos rather than (say) H.264?

I have received a request to encode DPX files to MOV/MJPEG rather than MOV/H.264 (which ffmpeg picks by default if you convert to output.mov). These is to review compositing renders (in motion), so color accuracy is critical.
Comparing a sample "ideal" MOV to the current (H.264) output I can see:
resolution: the same
ColorSpace/Primaries: Rec609 (SD) versus Rec709 (HD)
YUV: 4:2:0 versus 4:4:4
filesize: smaller
The ffmpeg default seems to be better quality and result in a smaller filesize. Is there something I'm missing?
Maybe it's because MJPEG frames are independent of each other, so any snippet of video can be decoded / copied in isolation. With an inter-frame compression algorithm like H.264, the software has to scan data for potentially numerous frames to reconstruct any given one.

How to know the bit depth of a mp3 file?

A MP3 file header only contain the sample rate and bit rate, so the decoder can't figure out the bit depth from the header. Maybe it can only guess from the bit rate? But the bit rate is varying from frame to frame.
Here is another way to ask this question: If I encoder an 24 bit WAV to mp3, so how does the 24-bit info stored in this mp3?
When the source WAV is compressed, the original bit depth information is "thrown away". This is by design in any compressed audio codec since the whole point is to use the least bits possible to store the "same" audio.
Internally, MP3 uses Huffman symbols to store the processed audio data. As such, there's no real "bit depth" to report.
During the encoding process, the samples are quantized, so the original bit depth information is lost.
MP3 decoders either choose a bitdepth they operate at, or allow the end-user/application to dictate it. The bitdepth is determined during "re-quantization".
Have a read of http://blog.bjrn.se/2008/10/lets-build-mp3-decoder.html which is rather enlightening

Correctly decoding/encoding raw PCM data

I'm writing my WAVE decoder/encoder in C++. I've managed to correctly convert between different sample sizes (8, 16 and 32), but I need some help with the channels and the frequency.
Channels:
If I want to convert from stereo to mono:
do I just take the data from one channel (which one? 1 or 2?)?
or do I take the average from channel 1 and 2 for the mono channel.
If I want to convert from mono to stereo:
(I know this is not very scientific)
can I simply add the samples from the single channels into both the stereo channels?
is there a more scientific method to do this (eg: interpolation)?
Sample rate:
How do I change the sample rate (resample), eg: from 44100 Hz to 22050 Hz:
do I simply take the average of 2 sequential samples for the new (lower frequency) value?
Any more scientific algorithms for this?
Stereo to mono - take the mean of the left and right samples, i.e. M = (L + R) / 2 - this works for the vast majority of stereo content, but note that there are some rare cases where you can get left/right cancellation.
Mono to stereo - put the mono sample in both left and right channels, i.e. L = R = M - this gives a sound image which is centered when played as stereo
Resampling - for a simple integer ratio downsampling as in your example above, the process is:
low pass filter to accommodate new Nyquist frequency, e.g. 10 kHz LPF for 22.05 kHz sample rate
decimate by required ratio (i.e. drop alternate samples for your 2x downsampling example)
Note that there are third party libraries such as libsamplerate which can handle resampling for you in the general case, so if you have more than one ratio you need to support, or you have some tricky non-integer ratio, then this might be a better approach

Resources