Shrink string encoding algorithm - string

How do we shrink/encode a 20 letter string to 6 letters. I found few algorithms address data compression like RLE, Arithmetic coding, Universal code but none of them guarantees 6 letters.
The original string can contain the characters A-Z (upper case), 0-9 ans a dash.

If your goal is to losslessly compress or hash an random input string of 20 characters (each character could be [A-Z], [0-9] or -) to an output string of 6 characters. It's theoretically impossible.
In information theory, given a discrete random variable X={x|x1,...,xn}, the Shannon entropy H(X) is defined as:
where p(xi) is the probablity of X = xi. In your case, X has 20 of 37 possible characters, so it could be {x|x1,...,xn} where n = 37^20. Supposing the 37 characters have the same probability of being (aka the input string is random), then p(xi) = 1/37^20. So the Shannon entropy of the input is:
. A char in common computer can hold 8 bit, so that 6 chars can hold 48 bit. There's no way to hold 104 bit information by 6 chars. You need at least 15 chars to hold it instead.
If you do allow the loss and have to hash the 20 chars into 6 chars, then your are trying to hash 37^20 values to 128^6 keys. It could be done, but you would got plenty of hash collisions.
In your case, supposing you hash them with the most uniformity (otherwise it would be worse), for each input value, there would be by average of 5.26 other input values sharing the same hash key with it. By a birthday attack, we could expect to find a collision within approximately 200 million trials. It could be done in less than 10 seconds by a common laptop. So I don't think this would be a safe hashing.
However if you insist to do that, you might want to read Hash function algorithms. It lists a lot of algorithms for your choice. Good luck!

Related

base64. Ending is not basically

I thought, I know the base64 encoding, but I often see so encoded text: iVBORw0KGgoAAAANSUhEUgAAAXwAAAG4C ... YPQr/w8B0CBr+DAkGQAAAABJRU5ErkJggg==. I got mean, It overs by double =. Why does it push second void byte, if 8 bits enough satisfy empty bits of encoded text?
I found. Length of base64-encoded data should be divisible by 4. The remainder of division length by 4 is filled by =-chars. That's because conflict of bit rate. In modern systems it's used 8-bits bytes, but in base64 is used 6-bits chars. So the lowest common multiple is 24 bits or 3 bytes or 4 base64-chars. A lack is filled by =.

LZ 77 Compression algorithm

I'm trying to figure out how to prove that the Lempel ZIV 77 algorithm for Compression is really gives optimal compression.
I found the following info:
So how well does the Lempel-Ziv algorithm work? In these notes, we’ll
calculate two quantities. First, how well it works in the worst case, and
second, how well it works in the random case where each letter of the message
is chosen uniformly and independently from a probability distribution, where
the ith letter appears with probability pi
. In both cases, the compression
is asymptotically optimal. That is, in the worst case, the length of the
encoded string of bits is n + o(n). Since there is no way to compress all
length-n strings to fewer than n bits, this can be counted as asymptotically
optimal. In the second case, the source is compressed to length
α
H(p1, p2, . . . , pα)n + o(n) = n∑(-pi log2 pi) + O(n)
i=1
which is to first order the Shannon bound.
What is meant here?
And why there is no way to compress alllength-n strings to fewer than n bits?
Thanks all.
There are 2^n different random strings of length n. In order to decompress them, the compression algorithm must compress them all to different compressed versions: if two different n-long strings compressed to the same sequence, there would be no way to tell which of them to decompress to. If all were compressed to strings of length k < n there would be only 2^k<2^n different compressed strings, and so there would have to be some cases where two different strings compressed to the same value.
Note that there is in no practical guaranteed optimal scheme for all circumstances. If I know that a long apparently random sequence is the output of a stream cipher with a key I also know I can describe it by giving only the design of the cipher and the key, but it might take a great deal of time for a compression algorithm to work out that the long apparently random sequence can be compressed hugely in this way.

Why can a textual representation of pi be compressed?

A random string should be incompressible.
pi = "31415..."
pi.size # => 10000
XZ.compress(pi).size # => 4540
A random hex string also gets significantly compressed. A random byte string, however, does not get compressed.
The string of pi only contains the bytes 48 through 57. With a prefix code on the integers, this string can be heavily compressed. Essentially, I'm wasting space by representing my 9 different characters in bytes (or 16, in the case of the hex string). Is this what's going on?
Can someone explain to me what the underlying method is, or point me to some sources?
It's a matter of information density. Compression is about removing redundant information.
In the string "314159", each character occupies 8 bits, and can therefore have any of 28 or 256 distinct values, but only 10 of those values are actually used. Even a painfully naive compression scheme could represent the same information using 4 bits per digit; this is known as Binary Coded Decimal. More sophisticated compression schemes can do better than that (a decimal digit is effectively log210, or about 3.32, bits), but at the expense of storing some extra information that allows for decompression.
In a random hexadecimal string, each 8-bit character has 4 meaningful bits, so compression by nearly 50% should be possible. The longer the string, the closer you can get to 50%. If you know in advance that the string contains only hexadecimal digits, you can compress it by exactly 50%, but of course that loses the ability to compress anything else.
In a random byte string, there is no opportunity for compression; you need the entire 8 bits per character to represent each value. If it's truly random, attempting to compress it will probably expand it slightly, since some additional information is needed to indicate that the output is compressed data.
Explaining the details of how compression works is beyond both the scope of this answer and my expertise.
In addition to Keith Thompson's excellent answer, there's another point that's relevant to LZMA (which is the compression algorithm that the XZ format uses). The number pi does not consist of a single repeating string of digits, but neither is it completely random. It does contain substrings of digits which are repeated within the larger sequence. LZMA can detect these and store only a single copy of the repeated substring, reducing the size of the compressed data.

Compression using Ascii, trying to figure out how many bits to store the following efficiently

I am trying to learn the basics of compression using only ASCII.
If I am sending an email of strings of lower-case letters. If the file has n
characters each stored as an 8-bit extended ASCII code, then we need 8n bits.
But according to Guiding principle of compression: we discard the unimportant information.
So using that we don't need all ASCII codes to code strings of lowercase letters: they use only 26 characters. We can make our own code with only 5-bit codewords (25 = 32 > 26), code the file using this coding scheme and then decode the email once received.
The size has decreased by 8n - 5n = 3n, i.e. a 37.5% reduction.
But what IF the email was formed with lower-case letters (26), upper-case letters, and extra m characters and they have to be stored efficiently?
If you have n symbols of equal probability, then it is possible to code each symbol using log2(n) bits. This is true even if log2(n) is fractional, using arithmetic or range coding. If you limit it to Huffman (fixed number of bits per symbol) coding, you can get close to log2(n), with still a fractional number of bits per symbol on average.
For example, you can encode ten symbols (e.g. decimal digits) in very close to 3.322 bits per symbol with arithmetic coding. With Huffman coding, you can code six of the symbols with three bits and four of the symbols with four bits, for an average of 3.4 bits per symbol.
The use of shift-up and shift-down operations can be beneficial since in English text you expect to have strings of lower case characters with occasional upper case characters. Now you are getting into both higher order models and unequal frequency distributions.

What length password equals 256bits of entropy

I'm using encryption on my entire HDD (aes 256) and i'm wondering what length password i would need so that the password also is 256 bits. As we all know the password is usually the weak link with encryption so i think this is good thing to know. The password will be made up of letters (capital and small) numbers and punctuation and be random. Thanks.
If the password is truly random (aka non-memorizable), then with the characters described, you are getting about 6 bits of randomness per 8-bit byte of password. Therefore, you need about (256 / 6) = 43 characters in the password to contain about 256 bits of randomness. If the password is memorable, you need many more characters to attain the 256 bits of randomness. Running English text has less than 4 bits of randomness per byte.
You might do better to take a long pass-phrase and generate a 256-bit hash of that (SHA-256, perhaps). Your pass-phrase might be a miniature essay - maybe 80-128 characters long; more would not hurt.
If you're using only letters and numbers, then you've got a total of 26 × 2 + 10 = 62 possible values per character. That's close to 64, so you have just under 6 bits of entropy per character.
If you want 256 bits, then you need about 43 characters from your character set.
Further reading: http://en.wikipedia.org/wiki/Password_strength

Resources