How to compare images and determine which has more content? - node.js

Goal: I want to grab the best frame from an animated GIF and use it as a static preview image. I believe the best frame is one that shows the most content - not necessarily the first or last frame.
Take this GIF for example:
--
This is the first frame:
--
Here is the 28th frame:
It's clear that frame 28th represents the entire GIF well.
How could I programmatically determine if one frame has more pixel/content over another? Any thoughts, ideas, packages/modules, or articles that you can point me to would be greatly appreciated.

One straightforward way this could be accomplished would be to estimate the entropy of each image and choose the frame with maximal entropy.
In information theory, entropy can be thought of as the "randomness" of the image. An image of a single color is very predictable, the flatter the distribution, the more random. This is highly related to the compression method described by Arthur-R as entropy is the lower bound on how much data can be losslessly compressed.
Estimating Entropy
One way to estimate the entropy is to approximate the probability mass function for pixel intensities using a histogram. To generate the plot below I first convert the image to grayscale, then compute the histogram using a bin spacing of 1 (for pixel values from 0 to 255). Then, normalize the histogram so that the bins sum to 1. This normalized histogram is an approximation of the pixel probability mass function.
Using this probability mass function we can easily estimate the entropy of the grayscale image which is described by the following equation
H = E[-log(p(x))]
Where H is entropy, E is the expected value, and p(x) is the probability that any given pixel takes the value x.
Programmatically H can be estimated by simply computing -p(x)*log(p(x)) for each value p(x) in the histogram and then adding them together.
Plot of entropy vs. frame number for your example.
with frame 21 (the 22nd frame) having the highest entropy.
Observations
The entropy computed here is not equal to the true entropy of the
image because it makes the assumption that each pixel is independently sampled from the same distribution. To get the true entropy we would need to know
the joint distribution of the image which we won't be able to know without
understanding the underlying random process that generated the images
(which would include human interaction). However, I don't think the true entropy would be very useful and this measure should
give a reasonable estimate of how much content is in the image.
This method will fail if some not-so-interesting frame
contains much more noise (randomly colored pixels) than the most
interesting frame because noise results in a high entropy. For example, the
following image is pure uniform noise and therefore has maximum entropy (H = 8 bits), i.e. no compression is possible.
Ruby Implementation
I don't know ruby but it looks like one of the answers to this question refers to a package for computing entropy of an image.
From m. simon borg's comment
FWIW, using Ruby's File.size() returns 1904 bytes for the 28th frame
image and 946 bytes for the first frame image – m. simon borg
File.size() should be roughly proportional to entropy.
As an aside, if you check the size of the 200x200 noise image on disk you will see that the file is 40,345 bytes even after compression, but the uncompressed data is only 40,000 bytes. Information theory tells us that no compression scheme can ever losslessly compress such images on average.

There are a couple ways I might go about this. My first thought (this may not be the most practical solution, but it seems theoretically interesting!) would be to try losslessly compressing each frame, and in theory, the frame with the least repeatable content (and thus the most unique content) would have the largest size, so you could then compare the size in bytes/bits of each compressed frame. The accuracy of this solution would probably be highly dependent on the photo passed in.
A more realistic/ practical solution might be to grab the predominant color in the GIF (so in the example, the background color), and then iterate through each pixel and increment a counter each time the color of the current pixel doesn't match the color of the background.
I'm thinking about some more optimized/ sample based solutions, and will edit my response to include them a little later, if performance is a concern for you.

I think that you can choose an API such as Restful Web Service for do that because without it that's so hard.
For example,these are some famous API's:
https://cloud.google.com/vision/
https://www.clarifai.com/
https://vize.ai
https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/
https://imagga.com

Related

How can a jpeg encoder become more efficient

Earlier I read about mozjpeg. A project from Mozilla to create a jpeg encoder that is more efficient, i.e. creates smaller files.
As I understand (jpeg) codecs, a jpeg encoder would need to create files that use an encoding scheme that can also be decoded by other jpeg codecs. So how is it possible to improve the codec without breaking compatibility with other codecs?
Mozilla does mention that the first step for their encoder is to add functionality that can detect the most efficient encoding scheme for a certain image, which would not break compatibility. However, they intend to add more functionality, first of which is "trellis quantization", which seems to be a highly technical algorithm to do something (I don't understand).
I'm also not entirely sure this quetion belongs on stack overflow, it might also fit superuser, since the question is not specifically about programming. So if anyone feels it should be on superuser, feel free to move this question
JPEG is somewhat unique in that it involves a series of compression steps. There are two that provide the most opportunities for reducing the size of the image.
The first is sampling. In JPEG one usually converts from RGB to YCbCR. In RGB, each component is equal in value. In YCbCr, the Y component is much more important than the Cb and Cr components. If you sample the later at 4 to 1, a 4x4 block of pixels gets reduced from 16+16+16 to 16+1+1. Just by sampling you have reduced the size of the data to be compressed by nearly 1/3.
The other is quantization. You take the sampled pixel values, divide them into 8x8 blocks and perform the Discrete Cosine transform on them. In 8bpp this takes 8x8 8-bit data and converts it to 8x8 16 bit data (inverse compression at that point).
The DCT process tends to produce larger values in the upper right corner and smaller values (close to zero) towards the lower left corner. The upper right coefficients are more valuable than the lower left coefficients.
The 16-bit values are then "quantized" (division in plain english).
The compression process defines an 8x8 quantization matrix. Divide the corresponding entry in the DCT coefficients by the value in the quantization matrix. Because this is integer division, the small values will go to zero. Long runs of zero values are combined using run-length compression. The more consecutive zeros you get, the better the compression.
Generally, the quantization values are much higher at the lower left than in the upper right. You try to force these DCT coefficients to be zero unless they are very large.
This is where much of the loss (not all of it though) comes from in JPEG.
The trade off is to get as many zeros as you can without noticeably degrading the image.
The choice of quantization matrices is the major factor in compression. Most JPEG libraries present a "quality" setting to the user. This translates into the selection of a quantization matrices in the encoder. If someone could devise better quantization matrices, you could get better compression.
This book explains the JPEG process in plain English:
http://www.amazon.com/Compressed-Image-File-Formats-JPEG/dp/0201604434/ref=sr_1_1?ie=UTF8&qid=1394252187&sr=8-1&keywords=0201604434
JPEG provides you multiple options. E.g. you can use standard Huffman tables or you can generate Huffman tables optimal for a specific image. The same goes for quantization tables. You can also switch to using arithmetic coding instead of Huffman coding for entropy encoding. The patents covering arithmetic coding as used in JPEG have expired. All of these options are lossless (no additional loss of data). One of the options used by Mozilla is instead of using baseline JPEG compression they use progressive JPEG compression. You can play with how many frequencies you have in each scan (SS, spectral selection) as well as number of bits used for each frequency (SA, successive approximation). Consecutive scans will have additional frequencies and or addition bits for each frequency. Again all of these different options are lossless. For the standard images used for JPEG switching to progressive encoding improved compression from 41 KB per image to 37 KB. But that is just for one setting of SS and SA. Given the speed of computers today you could automatically try many many different options and choose the best one.
Although hardly used the original JPEG standard had a lossless mode. There were 7 different choices for predictors. Today you would compress using each of the 7 choices and pick the best one. Use the same principle for what I outlined above. And remember non of them encounter additional loss of data. Switching between them is lossless.

Maximum file size of JPEG image with known dimensions

I'm going to let users upload images of 300x300 compressed with JPEG. Is there a way to determine what the maximum file size of such an image would be?
I can imagine this can be tried by compressing random noise at 100 quality, but is there a theoretical maximum?
Say that the image is totally uncompressable random noise, could it be 3 bytes per pixel (24-bits colour) and a margin for the metadata? Or could such an image turn out larger than the original when compressed?
From wikipedia:
For highest quality images (Q=100), about 8.25 bits per color pixel is required
http://en.wikipedia.org/wiki/JPEG#Sample_photographs
So, for Q=100 on an 300x300 image, that would result in (300 * 300) px * 8.25 bits/px = 742,500 bits = ~ 93 kB
There are also lossless JPEG coding modes, which are practically not used (last sentence, second paragraph). But they would have the RGB typical 24 bits/pixel.
There is no limit on jpeg metadata size, which means there's no limit to the size of a jpeg file. See this answer I've linked for an explanation of why and also for an example of a realistic situation where the metadata might get large: What is the maximum size of JPEG metadata?
So if assuming a maximum practical/realistic size suits your purpose, then you should factor that example into your calculations. In many contexts it would be fine to just reject things outside of that maximum as outside the domain of your program's intended usage.
But if you absolutely must rely on theoretical, then unfortunately it's a big bold ∞
Note: I do not have a huge amount of personal experience with the jpeg specification, so I am going off of what people have said about repeated fields being allowed, as well as multiple comment fields. Please correct me if you find evidence to the contrary.

Help with the theory behind a pixelate algorithm?

So say I have an image that I want to "pixelate". I want this sharp image represented by a grid of, say, 100 x 100 squares. So if the original photo is 500 px X 500 px, each square is 5 px X 5 px. So each square would have a color corresponding to the 5 px X 5 px group of pixels it swaps in for...
How do I figure out what this one color, which is best representative of the stuff it covers, is? Do I just take the R G and B numbers for each of the 25 pixels and average them? Or is there some obscure other way I should know about? What is conventionally used in "pixelation" functions, say like in photoshop?
If you want to know about the 'theory' of pixelation, read up on resampling (and downsampling in particular). Pixelation algorithms are simply downsampling an image (using some downsampling method) and then upsampling it using nearest-neighbour interpolation. Note that in code these two steps may be fused into one.
For downsampling in general, to downsample by a factor of n the image is first filtered by an appropriate low-pass filter, and then one sample out of every n is taken. An "ideal" filter to use is the sinc filter, but because of issues with implementing it, the Lanczos filter is often used as a close alternative.
However, for almost all purposes when doing pixelization, using a simple box blur should work fine, and is very simple to implement. This is just an average of nearby pixels.
If you don't need to change the output size of the image, then this means you divide the image into blocks (the big resulting pixels) which are k×k pixels, and then replace all the pixels in each block with the average value of the pixels in that block.
when the source and target grids are so evenly divisible and aligned, most algorigthms give similar results. if the grids are fixed, go for simple averages.
in other cases, especially when resizing by a small percentage, the quality difference is quite evident. the simplest enhancement over simple average is weighting each pixel value considering how much of it's contained in the target pixel's area.
for more algorithms, check multivariate interpolation

"Winamp style" spectrum analyzer

I have a program that plots the spectrum analysis (Amp/Freq) of a signal, which is preety much the DFT converted to polar. However, this is not exactly the sort of graph that, say, winamp (right at the top-left corner), or effectively any other audio software plots. I am not really sure what is this sort of graph called (if it has a distinct name at all), so I am not sure what to look for.
I am preety positive about the frequency axis being base two exponential, the amplitude axis puzzles me though.
Any pointers?
Actually an interesting question. I know what you are saying; the frequency axis is certainly logarithmic. But what about the amplitude? In response to another poster, the amplitude can't simply be in units of dB alone, because dB has no concept of zero. This introduces the idea of quantization error, SNR, and dynamic range.
Assume that the received digitized (i.e., discrete time and discrete amplitude) time-domain signal, x[n], is equal to s[n] + e[n], where s[n] is the transmitted discrete-time signal (i.e., continuous amplitude) and e[n] is the quantization error. Suppose x[n] is represented with b bits, and for simplicity, takes values in [0,1). Then the maximum peak-to-peak amplitude of e[n] is one quantization level, i.e., 2^{-b}.
The dynamic range is the defined to be, in decibels, 20 log10 (max peak-to-peak |s[n]|)/(max peak-to-peak |e[n]|) = 20 log10 1/(2^{-b}) = 20b log10 2 = 6.02b dB. For 16-bit audio, the dynamic range is 96 dB. For 8-bit audio, the dynamic range is 48 dB.
So how might Winamp plot amplitude? My guesses:
The minimum amplitude is assumed to be -6.02b dB, and the maximum amplitude is 0 dB. Visually, Winamp draws the window with these thresholds in mind.
Another nonlinear map, such as log(1+X), is used. This function is always nonnegative, and when X is large, it approximates log(X).
Any other experts out there who know? Let me know what you think. I'm interested, too, exactly how this is implemented.
To generate a power spectrum you need to do the following steps:
apply window function to time domain data (e.g. Hanning window)
compute FFT
calculate log of FFT bin magnitudes for N/2 points of FFT (typically 10 * log10(re * re + im * im))
This gives log magnitude (i.e. dB) versus linear frequency.
If you also want a log frequency scale then you will need to accumulate the magnitude from appropriate ranges of bins (and you will need a fairly large FFT to start with).
Well I'm not 100% sure what you mean but surely its just bucketing the data from an FFT?
If you want to get the data such that you have (for a 44Khz file) frequency points at 22Khz, 11Khz 5.5Khz etc then you could use a wavelet decomposition, i guess ...
This thread may help ya a bit ...
Converting an FFT to a spectogram
Same sort of information as a spectrogram I'd guess ...
What you need is power spectrum graph. You have to compute DFT of your signal's current window. Then square each value.

Imaging Question: How to determine image quality?

I'm looking for ways to determine the quality of a photography (jpg). The first thing that came into my mind was to compare the file-size to the amount of pixel stored within. Are there any other ways, for example to check the amount of noise in a jpg? Does anyone have a good reading link on this topic or any experience? By the way, the project I'm working on is written in C# (.net 3.5) and I use the Aurigma Graphics Mill for image processing.
Thanks in advance!
I'm not entirely clear what you mean by "quality", if you mean the quality setting in the JPG compression algorithm then you may be able to extract it from the EXIF tags of the image (relies on the capture device putting them in and no-one else overwriting them) for your library see here:
http://www.aurigma.com/Support/DocViewer/30/JPEGFileFormat.htm.aspx
If you mean any other sort of "quality" then you need to come up with a better definition of quality. For example, over-exposure may be a problem in which case hunting for saturated pixels would help determine that specific sort of quality. Or more generally you could look at statistics (mean, standard deviation) of the image histogram in the 3 colour channels. The image may be out of focus, in which case you could look for a cutoff in the spatial frequencies of the image Fourier transform. If you're worried about speckle noise then you could try applying a median filter to the image and comparing back to the original image (more speckle noise would give a larger change) - I'm guessing a bit here.
If by "quality" you mean aesthetic properties of composition etc then - good luck!
The 'quality' of an image is not measurable, because it doesn't correspond to any particular value.
If u take it as number of pixels in the image of specific size its not accurate. You might talk about a photograph taken in bad light conditions as being of 'bad quality', even though it has exactly the same number of pixels as another image taken in good light conditions. This term is often used to talk about the overall effect of an image, rather than its technical specifications.
I wanted to do something similar, but wanted the "Soylent Green" option and used people to rank images by performing comparisons. See the question responses here.
I think you're asking about how to determine the quality of the compression process itself. This can be done by converting the JPEG to a BMP and comparing that BMP to the original bitmap from with the JPEG was created. You can iterate through the bitmaps pixel-by-pixel and calculate a pixel-to-pixel "distance" by summing the differences between the R, G and B values of each pair of pixels (i.e. the pixel in the original and the pixel in the JPEG) and dividing by the total number of pixels. This will give you a measure of the average difference between the original and the JPEG.
Reading the number of pixels in the image can tell you the "megapixel" size(#pixels/1000000), which can be a crude form of programatic quality check, but that wont tell you if the photo is properly focused, assuming it is supposed to be focused (think fast-motion objects, like trains), nor weather or not there is something in the pic worth looking at, that will require a human, or pigeon if you prefer.

Resources