I'm trying to figure out how to prove that the Lempel ZIV 77 algorithm for Compression is really gives optimal compression.
I found the following info:
So how well does the Lempel-Ziv algorithm work? In these notes, we’ll
calculate two quantities. First, how well it works in the worst case, and
second, how well it works in the random case where each letter of the message
is chosen uniformly and independently from a probability distribution, where
the ith letter appears with probability pi
. In both cases, the compression
is asymptotically optimal. That is, in the worst case, the length of the
encoded string of bits is n + o(n). Since there is no way to compress all
length-n strings to fewer than n bits, this can be counted as asymptotically
optimal. In the second case, the source is compressed to length
α
H(p1, p2, . . . , pα)n + o(n) = n∑(-pi log2 pi) + O(n)
i=1
which is to first order the Shannon bound.
What is meant here?
And why there is no way to compress alllength-n strings to fewer than n bits?
Thanks all.
There are 2^n different random strings of length n. In order to decompress them, the compression algorithm must compress them all to different compressed versions: if two different n-long strings compressed to the same sequence, there would be no way to tell which of them to decompress to. If all were compressed to strings of length k < n there would be only 2^k<2^n different compressed strings, and so there would have to be some cases where two different strings compressed to the same value.
Note that there is in no practical guaranteed optimal scheme for all circumstances. If I know that a long apparently random sequence is the output of a stream cipher with a key I also know I can describe it by giving only the design of the cipher and the key, but it might take a great deal of time for a compression algorithm to work out that the long apparently random sequence can be compressed hugely in this way.
Related
I'm using this with a length of 20 for uuid. Is it common practice to not check if the uuid generated has not been used already if it's used for a persistent unique value?
Or is it best practice to verify it's not already being used by some part of your application if it's essential to retain uniqueness.
You can calculate the probability of a collision using this formula from Wikipedia::
where n(p; H) is the smallest number of samples you have to choose in order to find a collision with a probability of at least p, given H possible outputs with equal probability.
The same article also provides Python source code that you can use to calculate this value:
from math import log1p, sqrt
def birthday(probability_exponent, bits):
probability = 10. ** probability_exponent
outputs = 2. ** bits
return sqrt(2. * outputs * -log1p(-probability))
So if you're generating UUIDs with 20 bytes (160 bits) of random data, how sure can you be that there won't be any collisions? Let's suppose you want there to be a probability of less than one in a quintillion (10–18) that a collision will occur:
>>> birthday(-18,160)
1709679290002018.5
This means that after generating about 1.7 quadrillion UUIDs with 20 bytes of random data each, there is only a one in 1 a quintillion chance that two of these UUIDs will be the same.
Basically, 20 bytes is perfectly adequate.
crypto.RandomBytes is safe enough for most applications. If you want it to by completely secure, use a length of 16. Once there is a length of 16 there will likely never be a collision in the nearest century. And it is definitely not a good idea to check an entire database for any duplicates, because the odds are so low that the performance debuff outweighs the security.
I'm trying to invent programming exercise on Suffix Arrays. I learned O(n*log(n)^2) algorithm for constructing it and then started playing with random input strings of varying length in order to find out when naive approach becomes too slow. E.g. I wanted to choose string length so that people will need to implement "advanced" algorithm.
Suddenly I found that naive algorithm (with using logarithmic sort on all suffixes) is not as slow as O(n^2 * log(n)) means. After thinking a bit, I understand that comparison of suffixes of a randomly generated string is not O(n) amortized. Really, we usually only compare few first characters before we come to difference and there we return from comparison function. This of course depends on the size of the alphabet, but anyway it does not depend much on the length of suffixes.
I tried simple implementation in PHP processing 50000-characters string in 2 seconds (despite slowness of scripting language). If it will work at least as O(n^2) we'll expect it to work at least several minutes (with 1e7 operations per second and ~1e9 operations total).
So I understand that even if it is O(n^2 * log(n)) then the constant factor is a very small fraction of 1, really something close to 0. Or we should say about such complexity as worst-case only, right?
But what is the amortized time complexity of the naive approach? I'm bit bewildered about how to assess it.
You seem to be confusing amortized and expected complexity. In this case you are talking about expected complexity. And yes the stated complexity is computed assuming that the suffix comparison takes O(n). This will be the worst case for suffix comparison and for random generated input you will only perform constant number of comparisons in most cases. Thus O(n^2*log(n)) is worst case complexity.
One more note - on a modern computer you can perform a few billion elementary instructions in a second and it is possible that you execute in the order of 50000^2 in 2 seconds. The correct way to benchmark complexity of an algorithm is to measure the time it takes to complete e.g. for input of size N, N*2, N*4,...(as many as you can go) and then to interpolate the function that would describe the computational complexity
A random string should be incompressible.
pi = "31415..."
pi.size # => 10000
XZ.compress(pi).size # => 4540
A random hex string also gets significantly compressed. A random byte string, however, does not get compressed.
The string of pi only contains the bytes 48 through 57. With a prefix code on the integers, this string can be heavily compressed. Essentially, I'm wasting space by representing my 9 different characters in bytes (or 16, in the case of the hex string). Is this what's going on?
Can someone explain to me what the underlying method is, or point me to some sources?
It's a matter of information density. Compression is about removing redundant information.
In the string "314159", each character occupies 8 bits, and can therefore have any of 28 or 256 distinct values, but only 10 of those values are actually used. Even a painfully naive compression scheme could represent the same information using 4 bits per digit; this is known as Binary Coded Decimal. More sophisticated compression schemes can do better than that (a decimal digit is effectively log210, or about 3.32, bits), but at the expense of storing some extra information that allows for decompression.
In a random hexadecimal string, each 8-bit character has 4 meaningful bits, so compression by nearly 50% should be possible. The longer the string, the closer you can get to 50%. If you know in advance that the string contains only hexadecimal digits, you can compress it by exactly 50%, but of course that loses the ability to compress anything else.
In a random byte string, there is no opportunity for compression; you need the entire 8 bits per character to represent each value. If it's truly random, attempting to compress it will probably expand it slightly, since some additional information is needed to indicate that the output is compressed data.
Explaining the details of how compression works is beyond both the scope of this answer and my expertise.
In addition to Keith Thompson's excellent answer, there's another point that's relevant to LZMA (which is the compression algorithm that the XZ format uses). The number pi does not consist of a single repeating string of digits, but neither is it completely random. It does contain substrings of digits which are repeated within the larger sequence. LZMA can detect these and store only a single copy of the repeated substring, reducing the size of the compressed data.
Given a set of 125,000 strings, table size of 250,000 (so load factor .5), and also given that these strings never change, what is a good process for finding a better hash function?
Strings are 1-59 characters long, contain 72 unique characters (typical ascii values), average length and median length is 7 characters.
Approaches tried so far (hash always eventually mod table size)
(suggested by someone) md5 with linear probing (48)
Python built-in hash (max 40 probes per search)
Custom hash with quadratic probing (25)
Polynomial with prime coefficient, double hash with different prime coefficient, search primes 1-1000 for optimal pair (13)
Do previous 5 probes deep, then generate an array of size 256 that contains largest contiguous blocks of free space left in table, then use those mod 256 with linear probing (11)
Cuckoo hashing with three independent hash functions, but haven't found any combination of hash functions to avoid infinite loops
Given that the load factor is .5, is there some theoretical limit on how well the hash function can work? Can it ever be perfect without a very massive additional lookup table?
I have read that minimal pefect hashing requires ~1.6 bits/key, and current best results are ~2.5 bits/key. But this is for minimal (table size = # keys). Surely in my situation we can get very close to perfect, if not perfect, with quite a small lookup table?
Speed of hash function is immaterial in this case by the way.
Have you thought about using two independent hash functions? Variants of cuckoo hashing can build hash tables with surprisingly high load factors using only two hash functions.
Unmodified cuckoo hashing (each item hashes to exactly one of its two locations) attains a load factor of .5 with constant probability. If you modify it to use buckets of size two (so each item hashes to one of two buckets, so one of four locations, and you evict the oldest element of a bucket), I believe you can get load factors of around 0.8 or 0.9 without unreasonably long worst-case insertion times.
In your question as posed, there are 250000^125000 possible mappings from strings to table cells. 250000*249999*...*125001 of them are injective ("perfect hash functions"). Approximate the latter number using Stirling; taking the difference of the logs of these two numbers, you see that a randomly-chosen function will be a perfect hash with probability about 2^(-55000). Meaning that (with astonishingly high probability) there exists a 55-kilobit table that specifies a perfect hash function whose size is "only" 55 kilobits and also there isn't anything substantially smaller. (Finding this table is another matter. Also, note that this information-theoretic approach assumes that no probing whatsoever is done.)
I am working on an audio fingerprinting system and have gone through some papers and research recently and this page in particular: c# AudioFingerprinting and Locality Sensitive Hashing
I have now got a series of fingerprints for every 32ms of audio. What I want to do is hash these individual fingerprints (and not a sequence of them together) using LSH or some other similarity preserving method. From what I have understood about LSH, it works on multidimensional vectors and produces binary strings which can then be compared in the Hamming space.
My problem here is that the fingerprints that I have are not multidimensional. They are just single long integers. How do I hash these using LSH? Is there any method to hash (in a similarity preserving manner) single dimensional scalars?
Got late in replying, but here is the thing, it was quite simple indeed but don't know how I missed it.
LSH will use random projection vectors to project a vector or a scalar to a different dimensional space while preserving similarity. Check a good answer here https://stackoverflow.com/a/12967538/858467
So all I had to do is create a random projection matrix of order [n x 1] and then multiply it with the scalar [1 x 1] or a vector of scalar [1 x m] to get the projections [n x 1] or [n x m]. Thereafter thresholding it to get the binary vectors seems to do it.
Although this is I believe the correct believe way to do it (have done it the same way previously too,) I can't seem to get good binary vectors with this as of now. I will probably post another question when I get some more depth into the problem.