Are DC and AC coefficient magnitude not huffman compressed in JPEG? - jpeg

According to what I have read:
DC coefficient per block, we create a byte storing difference magnitude category as shown in Annex F Table F.1 of ITU-T T.81. The actual DC coefficient which stores a difference is stored in raw bits following this huffman coded magnitude category information byte.
Similarly for AC coefficients,
AC coefficients are first encoded to contain zero-run-lengths. Then, we huffman encode these bytes where upper 4 bits are zero-run-length and lower 4 bits are the AC coefficient magnitude category as shown in Annex F Table F.2 of ITU-T T.81. The huffman encoded byte that contains zero-length and magnitude category data is followed by raw bits that contain the actual AC coefficient magnitude.
My question is fundamentally this, in both cases, why do we store unencoded-uncompressed raw bits for the coefficients but the magnitude category information is huffman encoded? WHY? This makes no sense.

Here's another way of looking at it. When you compress bit values of variable length you need to encode the number of bits and the bits themselves. The coefficient lengths have a relatively small range of values while the coefficients have a wide range of values.
If you were to Huffman encode the coefficient values, the code lengths could be quite large and the tables hard to manageable.
JPEG then Huffman encodes the length part of the coefficients but not the coefficients themselves. Half the data gets compressed at this stage.

It does make sense to store raw bits in these situations.
When the data you're trying to compress are close enough to 'random' (a flat/uniform probability distribution), then entropy coding will not give you much coding gain. This is particularly true for simple entropy coding method such as Huffman encoder. In this case, skipping entropy coding will give you similar compression ratios, and will reduce the time complexity.

The way I see it, classifying the DC difference magnitudes into these "buckets", splits these values into a byte that will always be compressed into at most 4 bits (DC Huffman coding tables encode 12 possible values at most), followed by string of at most 11 bits where its length has a uniform probability distribution.
The other alternative could've been to use Huffman encoding directly on the full DC coefficient difference. If the values are unlikely to repeat, doing this would produce a different Huffman code for each one, which wouldn't produce much compression gains.
My guess is that people writing the spec did experimental testing on some image data set and concluded 12 magnitude categories yielded good enough compression. They probably also tested what you say about being agnostic to the data format and came to the conclusion their method compressed images better. So far, I haven't read the papers backing up the specification, but maybe this experimental data can be found there.
Note: When using 12 bit sample precision, there would be 16 magnitude categories, but they can still be encoded with at most 4 bits using Huffman coding.

Related

How is a 24-bit audio stream delivered to the graph?

This is probably a very silly question, but after searching for a while, I couldn't find a straight answer.
If a source filter (such as the LAV Audio codec) is processing a 24-bit integral audio stream, how are individual audio samples delivered to the graph?
(for simplicity lets consider a monophonic stream)
Are they stored individually on a 32-bit integer with the most-significant bits unused, or are they stored in a packed form, with the least significant bits of the next sample occupying the spare, most-significant bits of the current sample?
The format is similar to 16-bit PCM: the values are signed integers, little endian.
With 24-bit audio you normally define the format with the help of WAVEFORMATEXTENSIBLE structure, as opposed to WAVEFORMATEX (well, the latter is also possible in terms of being accepted by certain filters, but in general you are expected to use the former).
The structure has two values: number of bits per sample and number of valid bits per sample. So it's possible to have the 24-bit data represented as 24-bit values, and also as 24-bit meaningful bits of 32-bit values. The payload data should match the format.
There is no mix of bits of different samples within a byte:
However, wBitsPerSample is the container size and must be a multiple of 8, whereas wValidBitsPerSample can be any value not exceeding the container size. For example, if the format uses 20-bit samples, wBitsPerSample must be at least 24, but wValidBitsPerSample is 20.
To my best knowledge it's typical to have just 24-bit values, that is three bytes per PCM sample.
Non-PCM formats might define different packing and use "unused" bits more efficiently, so that, for example, to samples of 20-bit audio consume 5 bytes.

Why can a textual representation of pi be compressed?

A random string should be incompressible.
pi = "31415..."
pi.size # => 10000
XZ.compress(pi).size # => 4540
A random hex string also gets significantly compressed. A random byte string, however, does not get compressed.
The string of pi only contains the bytes 48 through 57. With a prefix code on the integers, this string can be heavily compressed. Essentially, I'm wasting space by representing my 9 different characters in bytes (or 16, in the case of the hex string). Is this what's going on?
Can someone explain to me what the underlying method is, or point me to some sources?
It's a matter of information density. Compression is about removing redundant information.
In the string "314159", each character occupies 8 bits, and can therefore have any of 28 or 256 distinct values, but only 10 of those values are actually used. Even a painfully naive compression scheme could represent the same information using 4 bits per digit; this is known as Binary Coded Decimal. More sophisticated compression schemes can do better than that (a decimal digit is effectively log210, or about 3.32, bits), but at the expense of storing some extra information that allows for decompression.
In a random hexadecimal string, each 8-bit character has 4 meaningful bits, so compression by nearly 50% should be possible. The longer the string, the closer you can get to 50%. If you know in advance that the string contains only hexadecimal digits, you can compress it by exactly 50%, but of course that loses the ability to compress anything else.
In a random byte string, there is no opportunity for compression; you need the entire 8 bits per character to represent each value. If it's truly random, attempting to compress it will probably expand it slightly, since some additional information is needed to indicate that the output is compressed data.
Explaining the details of how compression works is beyond both the scope of this answer and my expertise.
In addition to Keith Thompson's excellent answer, there's another point that's relevant to LZMA (which is the compression algorithm that the XZ format uses). The number pi does not consist of a single repeating string of digits, but neither is it completely random. It does contain substrings of digits which are repeated within the larger sequence. LZMA can detect these and store only a single copy of the repeated substring, reducing the size of the compressed data.

How can an audio wave be represented in a long array of floats?

In my application I'm using the sound library Beads (this question isn't specifically about that library).
In the library there's a class WavePlayer. It takes a Buffer, and produces a sound wave by iterating over the Buffer.
Buffers simply wrap a float[].
For example, here's a beginning of a buffer:
0.0 0.0015339801 0.0030679568 0.004601926 0.0061358847 0.007669829 0.009203754 0.010737659 0.012271538 0.0138053885 0.015339206 0.016872987 0.01840673 0.019940428 0.02147408 ...
It's size is 4096 float values.
Iterating over it with a WavePlayer creates a smooth "sine wave" sound. (This buffer is actually a ready-made 'preset' in the Buffer class, i.e. Buffer.SINE).
My question is:
What kind of data does a buffer like this represent? What kind of information does it contain that allows one to iterate over it and produce an audio wave?
read this post What's the actual data in a WAV file?
Sound is just a curve. You can represent this curve using integers or floats.
There are two important aspects : bit-depth and sample-rate. First let's discuss bit-depth. Each number in your list (int/floats) represents the height of the sound curve at a given point in time. For simplicity, when using floats the values typically vary from -1.0 to +1.0 whereas integers may vary from say 0 to 2^16 Importantly, each of these numbers must be stored into a sound file or audio buffer in memory - the resolution/fidelity you choose to represent each point of this curve influences the audio quality and resultant sound file size. A low fidelity recording may use 8 bits of information per curve height measurement. As you climb the fidelity spectrum, 16 bits, 24 bits ... are dedicated to store each curve height measurement. More bits equates with more significant digits for floats or a broader range of integers (16 bits means you have 2^16 integers (0 to 65535) to represent height of any given curve point).
Now to the second aspect sample-rate. As you capture/synthesize sound in addition to measuring the curve height, you must decide how often you measure (sample) the curve height. Typical CD quality records (samples) the curve height 44100 times per second, so sample-rate would be 44.1kHz. Lower fidelity would sample less often, ultra fidelity would sample at say 96kHz or more. So the combination of curve height measurement fidelity (bit-depth) coupled with how often you perform this measurement (sample-rate) together define the quality of sound synthesis/recording
As with many things these two attributes should be in balance ... if you change one you should change the other ... so if you lower sample rate you are reducing the information load and so are lowering the audio fidelity ... once you have done this you can then lower the bit depth as well without further compromising fidelity

How can a jpeg encoder become more efficient

Earlier I read about mozjpeg. A project from Mozilla to create a jpeg encoder that is more efficient, i.e. creates smaller files.
As I understand (jpeg) codecs, a jpeg encoder would need to create files that use an encoding scheme that can also be decoded by other jpeg codecs. So how is it possible to improve the codec without breaking compatibility with other codecs?
Mozilla does mention that the first step for their encoder is to add functionality that can detect the most efficient encoding scheme for a certain image, which would not break compatibility. However, they intend to add more functionality, first of which is "trellis quantization", which seems to be a highly technical algorithm to do something (I don't understand).
I'm also not entirely sure this quetion belongs on stack overflow, it might also fit superuser, since the question is not specifically about programming. So if anyone feels it should be on superuser, feel free to move this question
JPEG is somewhat unique in that it involves a series of compression steps. There are two that provide the most opportunities for reducing the size of the image.
The first is sampling. In JPEG one usually converts from RGB to YCbCR. In RGB, each component is equal in value. In YCbCr, the Y component is much more important than the Cb and Cr components. If you sample the later at 4 to 1, a 4x4 block of pixels gets reduced from 16+16+16 to 16+1+1. Just by sampling you have reduced the size of the data to be compressed by nearly 1/3.
The other is quantization. You take the sampled pixel values, divide them into 8x8 blocks and perform the Discrete Cosine transform on them. In 8bpp this takes 8x8 8-bit data and converts it to 8x8 16 bit data (inverse compression at that point).
The DCT process tends to produce larger values in the upper right corner and smaller values (close to zero) towards the lower left corner. The upper right coefficients are more valuable than the lower left coefficients.
The 16-bit values are then "quantized" (division in plain english).
The compression process defines an 8x8 quantization matrix. Divide the corresponding entry in the DCT coefficients by the value in the quantization matrix. Because this is integer division, the small values will go to zero. Long runs of zero values are combined using run-length compression. The more consecutive zeros you get, the better the compression.
Generally, the quantization values are much higher at the lower left than in the upper right. You try to force these DCT coefficients to be zero unless they are very large.
This is where much of the loss (not all of it though) comes from in JPEG.
The trade off is to get as many zeros as you can without noticeably degrading the image.
The choice of quantization matrices is the major factor in compression. Most JPEG libraries present a "quality" setting to the user. This translates into the selection of a quantization matrices in the encoder. If someone could devise better quantization matrices, you could get better compression.
This book explains the JPEG process in plain English:
http://www.amazon.com/Compressed-Image-File-Formats-JPEG/dp/0201604434/ref=sr_1_1?ie=UTF8&qid=1394252187&sr=8-1&keywords=0201604434
JPEG provides you multiple options. E.g. you can use standard Huffman tables or you can generate Huffman tables optimal for a specific image. The same goes for quantization tables. You can also switch to using arithmetic coding instead of Huffman coding for entropy encoding. The patents covering arithmetic coding as used in JPEG have expired. All of these options are lossless (no additional loss of data). One of the options used by Mozilla is instead of using baseline JPEG compression they use progressive JPEG compression. You can play with how many frequencies you have in each scan (SS, spectral selection) as well as number of bits used for each frequency (SA, successive approximation). Consecutive scans will have additional frequencies and or addition bits for each frequency. Again all of these different options are lossless. For the standard images used for JPEG switching to progressive encoding improved compression from 41 KB per image to 37 KB. But that is just for one setting of SS and SA. Given the speed of computers today you could automatically try many many different options and choose the best one.
Although hardly used the original JPEG standard had a lossless mode. There were 7 different choices for predictors. Today you would compress using each of the 7 choices and pick the best one. Use the same principle for what I outlined above. And remember non of them encounter additional loss of data. Switching between them is lossless.

Working with 24-bit audio samples

What is the "standard way" of working with 24-bit audio? Well, there are no 24-bit data types available, really. Here are the methods that come into my mind:
Represent 24-bit audio samples as 32-bit ints and ignore the upper eight bits.
Just like (1) but ignore the lower eight bits.
Represent 24-bit audio samples as 32-bit floats.
Represent the samples as structs of 3 bytes (acceptable for C/C++, but bad for Java).
How do you work this out?
Store them them as 32- or 64-bit signed ints or float or double unless you are space conscious and care about packing them into the smallest space possible.
Audio samples often appear as 24-bits to and from audio hardware since this is commonly the resolution of the DACs and ADCs - although on most computer hardware, don't be surprised to find the bottom 3 of 4 bits banging away randomly with noise.
Digital signal processing operations - which is what usually happens downstream from the acquisition of samples - all involve addition of weighted sums of samples. A sample stored in an integer type can be considered to be fixed-point binary with an implied binary point at some arbitrary point - the position of which you can chose strategically to maintain as many bits of precision as possible.
For instance, the sum of two 24-bit integer yields a result of 25 bits. After 8 such additions, the 32-bit type would overflow and you would need to re-normalize by rounding and shifting right.
Therefore, if you're using integer types to store your samples, use the largest you can and start with the samples in the least significant 24 bits.
Floating point types of course take care of this detail for you, although you get less choice about when renormalisation takes place. They are the usual choice for audio processing where hardware support is available. A single precision float has a 24-bit mantissa, so can hold a 24-bit sample without loss of precision.
Usually floating point samples are stored in the range -1.0f < x < 1.0f.

Resources