I am trying to compress a 64 MByte text file which includes only 7 different letters, randomly distributed and which appear at approximately similar frequencies.
I noticed an interesting behavior of various compression software programs such as 7Zip and WinRar.
Both applications achieve a compression ratio of about 35%.
When the file includes 8 different letters, also randomly distributed and appearing at approximately similar frequencies, the compression ratio is less than 0.3%!
Can anyone explain this?
Thanks.
It can only be explained by a mistake in how you constructed the file with "8 different letters". Your construction of the file with seven different letters appears correct, as the compression ratio should be log2(7)/8, which is 0.351. For the same thing with eight letters, the compression ratio is log2(8)/8, which is 0.375.
Perhaps your file has a repeating pattern in the eight letters.
Update:
You are using rand() to generate your "random" distribution. Unfortunately, the classic implementation of rand() has very poor randomness in the low few bits, with repeating patterns. Your % 7 uses all of the bits from rand(), but % 8 uses only the low three bits. % 8 is equivalent to & 7.
Use random() instead, which generates random numbers for which the low bits, as well as any of the bits, have good random behavior.
Related
A random string should be incompressible.
pi = "31415..."
pi.size # => 10000
XZ.compress(pi).size # => 4540
A random hex string also gets significantly compressed. A random byte string, however, does not get compressed.
The string of pi only contains the bytes 48 through 57. With a prefix code on the integers, this string can be heavily compressed. Essentially, I'm wasting space by representing my 9 different characters in bytes (or 16, in the case of the hex string). Is this what's going on?
Can someone explain to me what the underlying method is, or point me to some sources?
It's a matter of information density. Compression is about removing redundant information.
In the string "314159", each character occupies 8 bits, and can therefore have any of 28 or 256 distinct values, but only 10 of those values are actually used. Even a painfully naive compression scheme could represent the same information using 4 bits per digit; this is known as Binary Coded Decimal. More sophisticated compression schemes can do better than that (a decimal digit is effectively log210, or about 3.32, bits), but at the expense of storing some extra information that allows for decompression.
In a random hexadecimal string, each 8-bit character has 4 meaningful bits, so compression by nearly 50% should be possible. The longer the string, the closer you can get to 50%. If you know in advance that the string contains only hexadecimal digits, you can compress it by exactly 50%, but of course that loses the ability to compress anything else.
In a random byte string, there is no opportunity for compression; you need the entire 8 bits per character to represent each value. If it's truly random, attempting to compress it will probably expand it slightly, since some additional information is needed to indicate that the output is compressed data.
Explaining the details of how compression works is beyond both the scope of this answer and my expertise.
In addition to Keith Thompson's excellent answer, there's another point that's relevant to LZMA (which is the compression algorithm that the XZ format uses). The number pi does not consist of a single repeating string of digits, but neither is it completely random. It does contain substrings of digits which are repeated within the larger sequence. LZMA can detect these and store only a single copy of the repeated substring, reducing the size of the compressed data.
Earlier I read about mozjpeg. A project from Mozilla to create a jpeg encoder that is more efficient, i.e. creates smaller files.
As I understand (jpeg) codecs, a jpeg encoder would need to create files that use an encoding scheme that can also be decoded by other jpeg codecs. So how is it possible to improve the codec without breaking compatibility with other codecs?
Mozilla does mention that the first step for their encoder is to add functionality that can detect the most efficient encoding scheme for a certain image, which would not break compatibility. However, they intend to add more functionality, first of which is "trellis quantization", which seems to be a highly technical algorithm to do something (I don't understand).
I'm also not entirely sure this quetion belongs on stack overflow, it might also fit superuser, since the question is not specifically about programming. So if anyone feels it should be on superuser, feel free to move this question
JPEG is somewhat unique in that it involves a series of compression steps. There are two that provide the most opportunities for reducing the size of the image.
The first is sampling. In JPEG one usually converts from RGB to YCbCR. In RGB, each component is equal in value. In YCbCr, the Y component is much more important than the Cb and Cr components. If you sample the later at 4 to 1, a 4x4 block of pixels gets reduced from 16+16+16 to 16+1+1. Just by sampling you have reduced the size of the data to be compressed by nearly 1/3.
The other is quantization. You take the sampled pixel values, divide them into 8x8 blocks and perform the Discrete Cosine transform on them. In 8bpp this takes 8x8 8-bit data and converts it to 8x8 16 bit data (inverse compression at that point).
The DCT process tends to produce larger values in the upper right corner and smaller values (close to zero) towards the lower left corner. The upper right coefficients are more valuable than the lower left coefficients.
The 16-bit values are then "quantized" (division in plain english).
The compression process defines an 8x8 quantization matrix. Divide the corresponding entry in the DCT coefficients by the value in the quantization matrix. Because this is integer division, the small values will go to zero. Long runs of zero values are combined using run-length compression. The more consecutive zeros you get, the better the compression.
Generally, the quantization values are much higher at the lower left than in the upper right. You try to force these DCT coefficients to be zero unless they are very large.
This is where much of the loss (not all of it though) comes from in JPEG.
The trade off is to get as many zeros as you can without noticeably degrading the image.
The choice of quantization matrices is the major factor in compression. Most JPEG libraries present a "quality" setting to the user. This translates into the selection of a quantization matrices in the encoder. If someone could devise better quantization matrices, you could get better compression.
This book explains the JPEG process in plain English:
http://www.amazon.com/Compressed-Image-File-Formats-JPEG/dp/0201604434/ref=sr_1_1?ie=UTF8&qid=1394252187&sr=8-1&keywords=0201604434
JPEG provides you multiple options. E.g. you can use standard Huffman tables or you can generate Huffman tables optimal for a specific image. The same goes for quantization tables. You can also switch to using arithmetic coding instead of Huffman coding for entropy encoding. The patents covering arithmetic coding as used in JPEG have expired. All of these options are lossless (no additional loss of data). One of the options used by Mozilla is instead of using baseline JPEG compression they use progressive JPEG compression. You can play with how many frequencies you have in each scan (SS, spectral selection) as well as number of bits used for each frequency (SA, successive approximation). Consecutive scans will have additional frequencies and or addition bits for each frequency. Again all of these different options are lossless. For the standard images used for JPEG switching to progressive encoding improved compression from 41 KB per image to 37 KB. But that is just for one setting of SS and SA. Given the speed of computers today you could automatically try many many different options and choose the best one.
Although hardly used the original JPEG standard had a lossless mode. There were 7 different choices for predictors. Today you would compress using each of the 7 choices and pick the best one. Use the same principle for what I outlined above. And remember non of them encounter additional loss of data. Switching between them is lossless.
I'm testing string search algorithms from this site: EXACT STRING MATCHING ALGORITHMS. Christian Charras, Thierry Lecroq. Test text is a random sequence of DNA bases (ACGT) of 1 GByte size. Test patterns are a list of random sequences of random size (1kB max). Test system is a AMD Phenom II x4 955 at 3.2 GHz, 4 GB of RAM and Windows 7 64 bits. Code witten in C and compiled with MinGW with -O3 flag.
Naive search algorithm takes 4 seconds for short patterns to 8 seconds for 1kB patterns. Deterministic finite state machine takes 2 seconds for short patterns to 4 seconds for 1kB patterns. Boyer-Moore algorithm takes 4 seconds for very short patters, about 1/2 second for short pattherns and 2 seconds for 1kB patterns. The remaining algorithm performance is worst than naive search algorithm.
How can be naive search algorithm search algorithm faster than most other algorithms?
How can a deterministic finite state machine implemented with a transition table (O(n) execution time always) be 2 to 8 times slower than Boyer-Moore algorithm? Yes, BM best case is O(n/m), but his average case is O(n) and worst case is O(nm).
There is no perfect string matching algorithm which is best for all circumstances.
Boyer-Moore (and Horspool, Sunday etc.) work by creating jump tables ('How far can I move the search pointer when the characters do not match? The more distinct letters in the strings, the better the positive impact. You can imagine, that a string with only 4 distinct letters creates a jump table with a maximum of 3 shifts per mismatch. Whereas searching an english word with case sensitive may result in a jumptable with (A-Z + a-z + punctiation) max. approx 55 shifts per mismatch.
On the other hand, there is a negative impact on both preparation (i.e. calculating the jump tables) and looping itself. So these algorithms perform poor on short strings (preparation creates an overhead) and strings with only a few distict letters (as mentioned before)
The naive search algorithm is very compact and there are very little operations inside the loop, so loop runs fast. As there is no overhead it performs better when searching short strings.
The (compared to the naive search) quite complex loop operations of a BM algorithm take much longer per loop run. This (partly) compensates for the positive performance impact of the jump tables.
So although you are using long strings, the small alphabet (=small jump tables) makes BM perform poorly. A KMP has less overhead in the loop (the jump table is smaller in general, but is similar to the BM with small alphabets) and so the KMP performs so well.
Theoretically good algorithms (lower time complexity) often have high bookkeeping costs that can overwhelm that of a naive algorithm for small problem sizes. Also implementation details matter. By optimizing an implementation you can sometimes improve runtime by factors of 2 or more.
The naive implementation actually has a linear expected running time (same as BM/KMP, etc) for random input data. I could not write a full proof here but it's accessible from Algorithms Design Techniques and Analysis.
Most exact matching algorithms are optimized version of the naive implementation to prevent being slowed down by certain patterns. For instance, suppose we are searching for:
aaaaaaaaaaaaaaaaaaaaaaaab
on a stream of:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
It fails at the b for lots of times. KMP/BM implementations are contrived to prevent repeatedly comparing the as. However, if the sequence is random by itself, such conditions are almost impossible to appear and the naive implementation is likely to work better due to its lower overhead in bookkeeping or possibly better spatial/temporal locality.
And, yeah, I'm not sure DNA sequences are random. Or alternatively are repetitions common in them. Anyway there's no way to examine this carefully without representative data.
What is the "standard way" of working with 24-bit audio? Well, there are no 24-bit data types available, really. Here are the methods that come into my mind:
Represent 24-bit audio samples as 32-bit ints and ignore the upper eight bits.
Just like (1) but ignore the lower eight bits.
Represent 24-bit audio samples as 32-bit floats.
Represent the samples as structs of 3 bytes (acceptable for C/C++, but bad for Java).
How do you work this out?
Store them them as 32- or 64-bit signed ints or float or double unless you are space conscious and care about packing them into the smallest space possible.
Audio samples often appear as 24-bits to and from audio hardware since this is commonly the resolution of the DACs and ADCs - although on most computer hardware, don't be surprised to find the bottom 3 of 4 bits banging away randomly with noise.
Digital signal processing operations - which is what usually happens downstream from the acquisition of samples - all involve addition of weighted sums of samples. A sample stored in an integer type can be considered to be fixed-point binary with an implied binary point at some arbitrary point - the position of which you can chose strategically to maintain as many bits of precision as possible.
For instance, the sum of two 24-bit integer yields a result of 25 bits. After 8 such additions, the 32-bit type would overflow and you would need to re-normalize by rounding and shifting right.
Therefore, if you're using integer types to store your samples, use the largest you can and start with the samples in the least significant 24 bits.
Floating point types of course take care of this detail for you, although you get less choice about when renormalisation takes place. They are the usual choice for audio processing where hardware support is available. A single precision float has a 24-bit mantissa, so can hold a 24-bit sample without loss of precision.
Usually floating point samples are stored in the range -1.0f < x < 1.0f.
I am trying to use Excel's (2007) built in FFT feature, however, it requires that I have 2^n data points - which I do not have.
I have tried two things, both give different results:
Pad the data values by zeros so that N (the number of data points) reach the closest power of 2
Use a divide-and-conquer approach i.e. if I have 112 data points, then I do a FFT for 64, then 32, then 16 (112=64+32+16)
Which is the better approach? I am comfortable writing VBA macros but I am looking for an algorithm which does not require the constraint of N being power of 2. Can anyone help?
Splitting your data into smaller bits will result in erroneous output, especially for smaller numbers of data points.
Padding with zeroes is a much better idea, and the general approach for FFTs. If you are interested in an alternative way of doing the FFT, octave will do it for you, and most of the Matlab documentation applies so you should have no trouble with it.
Padding with zeros is the right direction, but keep in mind that if you're doing the transform in order to estimate frequency content, you will need a window function, and that should be applied to the short block (i.e., if you have 2000 points, apply a 2000 point Hann window, then pad to 2048 and calculate the transform).
If you're developing an add-in, you might consider using one of the many FFT libraries out there. I'm a big fan of KISS FFT by Marc Borgerding. It offers fast transforms for many blocksizes, essentially any blocksize that can be factored into the numbers 2,3,4, and/or 5. It doesn't handle prime number sized blocks though. It's written in very plain C, so should be easy to port to C#. Or, this SO question suggests some libraries that can be used in .NET.
pad out with zeros
2^n is a requirement of the FFT algorithm.
Maybe a test of a known time series (e.g., simple sine or cosine of a single frequency). When you FFT that, you should get a single frequency (Dirac delta function). Anything else is an error. Do it with an integer power of two, padded with zeroes, etc.
You can pad with zeros, or you can use an FFT library that supports arbitrary sizes. One such library is https://github.com/altomani/XL-FFT.
It implements the FFT as a pure formula with LAMBDA functions (i.e. without any VBA code).
For power of two length it uses a recursive radix-2 Cooley-Tukey algorithm
and for other length a version of Bluestein's algorithm that reduces the calculation to a power of two case.