Please explain, how does the technology -scans file in libjpeg
In progressive JPEG encoding there is a practically infinite number of possibilities on how the image can be encoded. The amount of complexity is so great that it does not lend itself to parameter passing or command line arguments. LibJpeg allows you to specify a file to indicate how this is done.
In sequential JPEG, each component is encoded in a single scan. A scan can contain multiple components, in which case it is "interleaved".
In progressive JPEG, each component is encoded in 2 or more scans. As in sequential JPEG, a scan may or may not be interleaved.
The DCT produces 64 coefficients. The first is referred to as the "DC" coefficient. The others are the "AC" coefficients.
A progressive scan can divide the DCT data up in two wages.
1. By coefficient range (aka spectral selection). This can be either the DC coefficient or a range of contiguous AC coefficients. (You must send some DC data before sending any AC).
2. Sending the bits of the coefficients in different scans (calls successive approximation)
Your choices in a scan are then:
1. Which components
2. Spectral selection (0 or a range within 1 .. 63)
3. Successive approximation (a range within 0 .. 13)
There are semantic rules as well. You must have a DC scan for each component before an AC scan. You cannot send any data twice.
If you have a grayscale image (one component), you could send the image in as many as 64*14 =896 separate scans or as few as two.
There are so many choices that Libjpeg uses a file to specify them.
Related
I checked and roughly understood the JPEG encoding algorithm from the wiki page about it.
Transform to YCbCr, downsample
Split into 8x8 blocks
Apply DCT on blocks
Divide resulting matrices by quantization table entries
Entropy encoding
I understand that the quantization table in a file depends on what created the image, e.g. a camera manufacturer likely has their own proprietary QT algorithms, photoshop etc have their own QTs, there are public ones, etc.
Now, if one opens 'real' JPEG files they may contain several quantization tables. How can this be? I'd assume the decoding algorithm looks like this:
Decode entropy encoding, receive blocks
Multiply blocks by quantization table entries
revert other operations
What does the second/third/... QT do/when is it used? Is there an upper limit on the number of QTs in a JPEG file? When does it happen that a second QT is added to a JPEG file?
The quantization tables are used for different color components.
Like you already know, in a first step the image is transformed into YCbCr Color Space. In this color space you have three colors: Luminance (Y), Chrominance blue (Cb) and Chrominance red (Cr). As the human eye is less sensitive to colors but very sensitive for brightness, multiple quantization tables are used for the different components.
The Quantization Tables used for the Luminance consists of "lower" values, such that the dividing and rounding will not loose to much information on this component. Blue and red on the other hand have "higher" values as information is not needed that much.
Goal: I want to grab the best frame from an animated GIF and use it as a static preview image. I believe the best frame is one that shows the most content - not necessarily the first or last frame.
Take this GIF for example:
--
This is the first frame:
--
Here is the 28th frame:
It's clear that frame 28th represents the entire GIF well.
How could I programmatically determine if one frame has more pixel/content over another? Any thoughts, ideas, packages/modules, or articles that you can point me to would be greatly appreciated.
One straightforward way this could be accomplished would be to estimate the entropy of each image and choose the frame with maximal entropy.
In information theory, entropy can be thought of as the "randomness" of the image. An image of a single color is very predictable, the flatter the distribution, the more random. This is highly related to the compression method described by Arthur-R as entropy is the lower bound on how much data can be losslessly compressed.
Estimating Entropy
One way to estimate the entropy is to approximate the probability mass function for pixel intensities using a histogram. To generate the plot below I first convert the image to grayscale, then compute the histogram using a bin spacing of 1 (for pixel values from 0 to 255). Then, normalize the histogram so that the bins sum to 1. This normalized histogram is an approximation of the pixel probability mass function.
Using this probability mass function we can easily estimate the entropy of the grayscale image which is described by the following equation
H = E[-log(p(x))]
Where H is entropy, E is the expected value, and p(x) is the probability that any given pixel takes the value x.
Programmatically H can be estimated by simply computing -p(x)*log(p(x)) for each value p(x) in the histogram and then adding them together.
Plot of entropy vs. frame number for your example.
with frame 21 (the 22nd frame) having the highest entropy.
Observations
The entropy computed here is not equal to the true entropy of the
image because it makes the assumption that each pixel is independently sampled from the same distribution. To get the true entropy we would need to know
the joint distribution of the image which we won't be able to know without
understanding the underlying random process that generated the images
(which would include human interaction). However, I don't think the true entropy would be very useful and this measure should
give a reasonable estimate of how much content is in the image.
This method will fail if some not-so-interesting frame
contains much more noise (randomly colored pixels) than the most
interesting frame because noise results in a high entropy. For example, the
following image is pure uniform noise and therefore has maximum entropy (H = 8 bits), i.e. no compression is possible.
Ruby Implementation
I don't know ruby but it looks like one of the answers to this question refers to a package for computing entropy of an image.
From m. simon borg's comment
FWIW, using Ruby's File.size() returns 1904 bytes for the 28th frame
image and 946 bytes for the first frame image – m. simon borg
File.size() should be roughly proportional to entropy.
As an aside, if you check the size of the 200x200 noise image on disk you will see that the file is 40,345 bytes even after compression, but the uncompressed data is only 40,000 bytes. Information theory tells us that no compression scheme can ever losslessly compress such images on average.
There are a couple ways I might go about this. My first thought (this may not be the most practical solution, but it seems theoretically interesting!) would be to try losslessly compressing each frame, and in theory, the frame with the least repeatable content (and thus the most unique content) would have the largest size, so you could then compare the size in bytes/bits of each compressed frame. The accuracy of this solution would probably be highly dependent on the photo passed in.
A more realistic/ practical solution might be to grab the predominant color in the GIF (so in the example, the background color), and then iterate through each pixel and increment a counter each time the color of the current pixel doesn't match the color of the background.
I'm thinking about some more optimized/ sample based solutions, and will edit my response to include them a little later, if performance is a concern for you.
I think that you can choose an API such as Restful Web Service for do that because without it that's so hard.
For example,these are some famous API's:
https://cloud.google.com/vision/
https://www.clarifai.com/
https://vize.ai
https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/
https://imagga.com
Example:
Given n number of images marked 1 to n where n is unknown, I can calculate a property of every image which is a scalar quantity. Now I have to represent this property of all images in a fixed size vector (say 5 or 10).
One naive approach can be this vector- [avg max min std_deviation]
And I also want to include the effect of relative positions of those images.
What your are looking for is called feature extraction.
There are many techniques for the same, for images:
For your purpose try:
PCA
Auto-encoders
Convolutional Auto-encoders, 1 & 2
You could also look into conventional (old) methods like SIFT, HOG, Edge Detection, but they all will need an extra step for making them to a smaller-fixed size.
According to what I have read:
DC coefficient per block, we create a byte storing difference magnitude category as shown in Annex F Table F.1 of ITU-T T.81. The actual DC coefficient which stores a difference is stored in raw bits following this huffman coded magnitude category information byte.
Similarly for AC coefficients,
AC coefficients are first encoded to contain zero-run-lengths. Then, we huffman encode these bytes where upper 4 bits are zero-run-length and lower 4 bits are the AC coefficient magnitude category as shown in Annex F Table F.2 of ITU-T T.81. The huffman encoded byte that contains zero-length and magnitude category data is followed by raw bits that contain the actual AC coefficient magnitude.
My question is fundamentally this, in both cases, why do we store unencoded-uncompressed raw bits for the coefficients but the magnitude category information is huffman encoded? WHY? This makes no sense.
Here's another way of looking at it. When you compress bit values of variable length you need to encode the number of bits and the bits themselves. The coefficient lengths have a relatively small range of values while the coefficients have a wide range of values.
If you were to Huffman encode the coefficient values, the code lengths could be quite large and the tables hard to manageable.
JPEG then Huffman encodes the length part of the coefficients but not the coefficients themselves. Half the data gets compressed at this stage.
It does make sense to store raw bits in these situations.
When the data you're trying to compress are close enough to 'random' (a flat/uniform probability distribution), then entropy coding will not give you much coding gain. This is particularly true for simple entropy coding method such as Huffman encoder. In this case, skipping entropy coding will give you similar compression ratios, and will reduce the time complexity.
The way I see it, classifying the DC difference magnitudes into these "buckets", splits these values into a byte that will always be compressed into at most 4 bits (DC Huffman coding tables encode 12 possible values at most), followed by string of at most 11 bits where its length has a uniform probability distribution.
The other alternative could've been to use Huffman encoding directly on the full DC coefficient difference. If the values are unlikely to repeat, doing this would produce a different Huffman code for each one, which wouldn't produce much compression gains.
My guess is that people writing the spec did experimental testing on some image data set and concluded 12 magnitude categories yielded good enough compression. They probably also tested what you say about being agnostic to the data format and came to the conclusion their method compressed images better. So far, I haven't read the papers backing up the specification, but maybe this experimental data can be found there.
Note: When using 12 bit sample precision, there would be 16 magnitude categories, but they can still be encoded with at most 4 bits using Huffman coding.
Earlier I read about mozjpeg. A project from Mozilla to create a jpeg encoder that is more efficient, i.e. creates smaller files.
As I understand (jpeg) codecs, a jpeg encoder would need to create files that use an encoding scheme that can also be decoded by other jpeg codecs. So how is it possible to improve the codec without breaking compatibility with other codecs?
Mozilla does mention that the first step for their encoder is to add functionality that can detect the most efficient encoding scheme for a certain image, which would not break compatibility. However, they intend to add more functionality, first of which is "trellis quantization", which seems to be a highly technical algorithm to do something (I don't understand).
I'm also not entirely sure this quetion belongs on stack overflow, it might also fit superuser, since the question is not specifically about programming. So if anyone feels it should be on superuser, feel free to move this question
JPEG is somewhat unique in that it involves a series of compression steps. There are two that provide the most opportunities for reducing the size of the image.
The first is sampling. In JPEG one usually converts from RGB to YCbCR. In RGB, each component is equal in value. In YCbCr, the Y component is much more important than the Cb and Cr components. If you sample the later at 4 to 1, a 4x4 block of pixels gets reduced from 16+16+16 to 16+1+1. Just by sampling you have reduced the size of the data to be compressed by nearly 1/3.
The other is quantization. You take the sampled pixel values, divide them into 8x8 blocks and perform the Discrete Cosine transform on them. In 8bpp this takes 8x8 8-bit data and converts it to 8x8 16 bit data (inverse compression at that point).
The DCT process tends to produce larger values in the upper right corner and smaller values (close to zero) towards the lower left corner. The upper right coefficients are more valuable than the lower left coefficients.
The 16-bit values are then "quantized" (division in plain english).
The compression process defines an 8x8 quantization matrix. Divide the corresponding entry in the DCT coefficients by the value in the quantization matrix. Because this is integer division, the small values will go to zero. Long runs of zero values are combined using run-length compression. The more consecutive zeros you get, the better the compression.
Generally, the quantization values are much higher at the lower left than in the upper right. You try to force these DCT coefficients to be zero unless they are very large.
This is where much of the loss (not all of it though) comes from in JPEG.
The trade off is to get as many zeros as you can without noticeably degrading the image.
The choice of quantization matrices is the major factor in compression. Most JPEG libraries present a "quality" setting to the user. This translates into the selection of a quantization matrices in the encoder. If someone could devise better quantization matrices, you could get better compression.
This book explains the JPEG process in plain English:
http://www.amazon.com/Compressed-Image-File-Formats-JPEG/dp/0201604434/ref=sr_1_1?ie=UTF8&qid=1394252187&sr=8-1&keywords=0201604434
JPEG provides you multiple options. E.g. you can use standard Huffman tables or you can generate Huffman tables optimal for a specific image. The same goes for quantization tables. You can also switch to using arithmetic coding instead of Huffman coding for entropy encoding. The patents covering arithmetic coding as used in JPEG have expired. All of these options are lossless (no additional loss of data). One of the options used by Mozilla is instead of using baseline JPEG compression they use progressive JPEG compression. You can play with how many frequencies you have in each scan (SS, spectral selection) as well as number of bits used for each frequency (SA, successive approximation). Consecutive scans will have additional frequencies and or addition bits for each frequency. Again all of these different options are lossless. For the standard images used for JPEG switching to progressive encoding improved compression from 41 KB per image to 37 KB. But that is just for one setting of SS and SA. Given the speed of computers today you could automatically try many many different options and choose the best one.
Although hardly used the original JPEG standard had a lossless mode. There were 7 different choices for predictors. Today you would compress using each of the 7 choices and pick the best one. Use the same principle for what I outlined above. And remember non of them encounter additional loss of data. Switching between them is lossless.