DEFLATE: dynamic block structure - deflate

I am trying to construct an explicit example of a dynamic block. Please let me know if this is wrong.
Considering this example of lit/len alphabet:
A(0), B(0), C(1), D(0), E(2), F(2), G(2), H(2)
and the rest of symbols having zero code lengths.
The sequence(SQ) of code lengths would be 0...,0,0,1,0,2,2,2,2,...0.
Then we have to compress it further with run-length encoding. So we have to calculate number of repetitions and either use flag 16 to copy the previous code length, or 17 or 18 to repeat code length 0 (using extra bits).
My problem is this. After sending the header information and the sequence of code-length code lengths in the right order 16,17,18,..., the next sequence of information would be something like:
18, some extra bits value,1,0,2,16, some extra bits value 0,18, some extra bits value. (Probably there would be another 18 flag since the maximum repeat count is 138.)
Then we have the same thing with the distance alphabet and finally the inputs data encoded with the canonical Huffman, and extra bits if necessary.
Is it necessary to send the code lengths of 0? If so, why?
If yes, why is it necessary to have hclit and hcdist and not only hclen, knowing that the lengths of the sequences are 286 for lit/len and 30 for distances?
If not what would be the real solution?
Another problem:
In this case we have code length 2 with repetitions (3) extra bits value of 0.
Is this last number also included in the code length tree construction?
If yes I can't understand how: flag 18 has next a maximum possible extra bits value of 127 (1111111) representing 138 repetitions and it couldn't be included into the alphabet symbols of 0-18.
P.S When I say extra bits in this case I mean the factor that it is used to know how many repetitions of the previous length are used.
More precisely 0 - 15 we have 0 bit factor repetition, for 16,17,18 we have 2,3,7 bits repetitions factor. The value of those bits is what I mean with extra bits value.
I think I'm missing something about what Huffman codes are generated by the Huffman code-length alphabet.

First off, your example code is invalid, since it oversubscribes the available bit patterns. 2,2,2,2 would use all of the bit patterns, since there are only four possible two-bit patterns. You can't have one more code, of any length. Possible valid codes for five symbols is 1,2,3,4,4, or 2,2,2,3,3.
To answer your questions in order:
You need to send the leading zeros, but you do not need to send the trailing zeros. The HLIT and HDIST counts determine how many lengths of each code type are in the header, where any after those are taken to be zero. You need to send the zeros, since the code lengths are associated with the corresponding symbol by their position in the list.
It saves space in header to have the HLIT and HDIST counts, so that you don't need to provide lengths for all 316 codes in every header.
I don't understand this question, but I guess it doesn't apply.
If I understand your question, extra bits have nothing to do with the descriptions of the Huffman codes in the headers. The extra bits are implied by the symbol. In any case, a repeated length is encoded with code 16, not code 18. So the four twos would be encoded as 2, 16(0), where the (0) represents two extra bits that are zeros.

Related

Encoding binary strings into arbitrary alphabets

If you have a set of binary strings that are limited to some normally-small size such as 256 or up to 512 bytes like some of the hashing algorithms, then if you want to encode those bits of 1's and 0's into say hex (a 16-character alphabet), then you take the whole string at once into memory and convert it into hex. At least that's what I think it means.
I don't have this question fully formulated, but what I'm wondering is if you can convert an arbitrarily long binary string into some alphabet, without needing to read the whole string into memory. The reason this isn't fully formed question is because I'm not exactly sure if you typically do read the whole string into memory to create the encoded version.
So if you have something like this:
1011101010011011011101010011011011101010011011110011011110110110111101001100101010010100100000010111101110101001101101110101001101101110101001101111001101111011011011110100110010101001010010000001011110111010100110110111010100110110111010100110111100111011101010011011011101010011011011101010100101010010100100000010111101110101001101101110101001101101111010011011110011011110110110111101001100101010010100100000010111101110101001101101101101101101101111010100110110111010100110110111010100110111100110111101101101111010011001010100101001000000101111011101010011011011101010011011011101010011011110011011110110110111101001100 ... 10^50 longer
Something like the whole genetic code or a million billion times that, it would be too large to read into memory and too slow to wait to dynamically create an encoding of it into hex if you have to stream the whole thing through memory before you can figure out the final encoding.
So I'm wondering three things:
If you do have to read something fully in order to encode it into some other alphabet.
If you do, then why that is the case.
If you don't, then how it works.
The reason I'm asking is because looking at a string like 1010101, if I were to encode it as hex there are a few ways:
One character at a time, so it would essentially stay 1010101 unless the alphabet was {a, b} then it would be abababa. This is the best case because you don't have to read anything more than 1 character into memory to figure out the encoding. But it limits you to a 2-character alphabet. (Anything more than 2 character alphabets and I start getting confused)
By turning it into an integer, then converting that into a hex value. But this would require reading the whole value to compute the final (big)integer size. So that's where I get confused.
I feel like the third way (3) would be to read partial chunks of the input bytes somehow, like 1010 then010, but that would not work if the encoding was integers because 1010 010 = A 2 in hex, but 2 = 10 not 2 = 010. So it's like you would need to break it by having a 1 at the beginning of each chunk. But then what if you wanted to have each chunk no longer than 10 hex characters, but you have a long string of 1000 0's, then you need some other trick perhaps like having the encoded hex value tell you how many preceding zeroes you have, etc. So it seems like it gets complicated, wondering if there are already some systems established that have figured out how to do this. Hence the above questions.
For an example, say I wanted to encode the above binary string into an 8-bit alphabet, so like ASCII. Then I might have aBc?D4*&((!.... But then to deserialize this into the bits is one part, and to serialize the bits into this is another (these characters aren't the actual characters mapped to the above bit example).
But then what if you wanted to have each chunk no longer than 10 hex characters, but you have a long string of 1000 0's, then you need some other trick perhaps like having the encoded hex value tell you how many preceding zeroes you have, etc. So it seems like it gets complicated, wondering if there are already some systems established that have figured out how to do this
Yes you're way over-complicating it. To start simple, consider bit strings whose length is by definition a multiple of 4. They can be represented in hexadecimal by just grouping the bits up by 4 and remapping that to hexadecimal digits:
raw: 11011110101011011011111011101111
group: 1101 1110 1010 1101 1011 1110 1110 1111
remap: D E A D B E E F
So 11011110101011011011111011101111 -> DEADBEEF. That all the nibbles had their top bit set was a coincidence resulting from choosing an example that way. By definition the input is divided up into groups of four, and every hexadecimal digit is later decoded to a group of four bits, including leading zeroes if applicable. This is all that you need for typical hash codes which have a multiple of 4 bits.
The problems start when we want encode bit strings that are of variable length and not necessarily a multiple of 4 long, then there will have to be some padding somewhere, and the decoder needs to know how much padding there was (and where, but the location is a convention that you choose). This is why your example seemed so ambiguous: it is. Extra information needs to be added to tell the decoder how many bits to discard.
For example, leaving aside the mechanism that transmits the number of padding bits, we could encode 1010101 as A5 or AA or 5A (and more!) depending on the location we choose for the padding, whichever convention we choose the decoder needs to know that there is 1 bit of padding. To put that back in terms of bits, 1010101 could be encoded as any of these:
x101 0101
101x 0101
1010 x101
1010 101x
Where x marks the bit which is inserted in the encoder and discarded in the decoder. The value of that bit doesn't actually matter because it is discarded, so DA is also a fine encoding and so on.
All of the choices of where to put the padding still enable the bit string to be encoded incrementally, without storing the whole bit string in memory, though putting the padding in the first hexadecimal digit requires knowing the length of the bit string up front.
If you are asking this in the context of Huffman coding, you wouldn't want to calculate the length of the bit string in advance so the padding has to go at the end. Often an extra symbol is added to the alphabet that signals the end of the stream, which usually makes it unnecessary to explicitly store how much padding bits there are (there might be any number of them, but as they appear after the STOP symbol, the decoder automatically disregards them).

Base64 encoding and equal sign at the end, instead of A (base64 value of number 0)

According to wikipedia:
When the number of bytes to encode is not divisible by three (that is,
if there are only one or two bytes of input for the last 24-bit
block), then the following action is performed:
Add extra bytes with value zero so there are three bytes, and perform
the conversion to base64.
However, if we got an extra \0 character at the end, the last 6 bits of the input have a value of 0. And the number 0 must be base64-codified as A. The character = doesn't even belong to the base64 encoding table.
I know that those extra null characters doesn't belong to the original binary string, so, we use a different character (=) to avoid confussions, but anyway, the Wikipedia article and other thousand sites doesn't say that. They say that the newly constructed string must be base64-encoded (sentence which strictly implies the use of the transformation table).
Are all of these sites wrong?
Any sequence of four characters chosen from the main base64 set will represent precisely three octets worth of data. Consequently, If the total length of the file to be encoded it will be necessary to either:
Allow the encoded file to have a length which is not a multiple of 4.
Allow the encoded file to have characters outside the main set of 64.
If the former approach were used, then concatenating of files whose length
was not a multiple of three would be likely to yield a file that might
appear valid but would contain bogus information. For example, a file
with length 32 would expand to ten groups of four base64 characters plus
three more for the final pair of octets (total 43). Concatenating another
file with length 32 would yield a total of 86 characters which might look
valid, but information from the second half would not decode correctly.
Using the latter approach, concatenation of files whose length was not a
multiple of three would yield a result that could be unambiguously parsed
or, at worst, recognized as invalid (the base64 Standard does not regard
as valid a file that contains "=" anywhere but at the end, but one could
write a decoder that could process such files unambiguously). In any case,
having such a file be regarded as invalid would be better than having a file
which appeared valid but which produces incorrect data when decoded.

How to uniquely identify a set of strings using an integer

Here my problem statement:
I have a set of strings that match a regular expression. let's say it matches [A-Z][0-9]{3} (i.e. 1 letter and 3 digits).
I can have any number of strings between 1 and 30. For example I could have:
{A123}
{A123, B456}
{Z789, D752, E147, ..., Q665}
...
I need to generate an integer (actually I can use 256 bits) that would be unique for any set of strings regardless of the number of elements (although the number of elements could be used to generate the integer)
What sort of algorithm could I use?
My first idea would be to convert my strings to number and then do operations (I thought of hash functions) on them but I am not sure what formula would be give me could results.
Any suggestion?
You have 2^333 possible input sets ((26 * 10^3) choose 30).
This means you would need a 333 bit wide integer to represent all possibilities. You only have a maximum of 256 bits, so there will be collisions.
This is a typical application for a hash function. There are hashes for various purposes, so it's important to select the right type:
A simple hash function for use in bucket based data structures (dictionaries) must be fast. Collisions are not only tolerated but wanted. The hash's size (in bits) usually is small. Due to collisions this type of hash is not suited for your purpose.
A checksum tries to avoid collisions and is reasonably fast. If it's large enough this might be enough for your case.
Cryptographic hashes have the characteristic that it's not possible (or very hard) to find a collision (even when both input and hash are known). Also they are not invertible (from the hash it's not possible to find the input). These are usually computationally expensive and overkill for your use case.
Hashes to uniquely identify arbitrary inputs, like CityHash and SpookyHash are designed for fast hashing and collision free identification.
SpookyHash seems like a good candidate for your use case. It's 128 bits wide, which means that you need 2^64 differing inputs to get a 50% chance of a single collision.
It's also fast: three bytes per cycle is orders of magnitude faster than md5 or sha1. SpookyHash is available in the public domain (see link above).
To apply any hash on your use case you could convert the items in your list to numbers, but it seems easier to just feed them as strings. You have to settle for an encoding in this case (ASCII would do).
I'm usually using UTF8 or so, when I18N is an issue. Then it's sometimes important to care for canonicalization. But this does not apply to your simple use case.
A hash is not going to work, since it could produce collisions. Every significant input bit must be mapped to an output bit.
For the letter, you have 90 - 65 = 25 different values, so you can use 5 bits to represent the letter.
The 3-digit number has 1000 different values, so you need 10 bits for this.
If you combine these bits, you have a unique mapping from the input to a 15-bit number.
This approach is simple, but it could wastes some bits. If the output must be as short as possible, you could map as follows:
output = (L - 'A')*1000 + N
where L is the letter value, 'A' is the value of the letter A, N is the 3-digit number. Then you can use as few bits as are necessary to represent the complete range of output, which is 25*1000 - 1 = 24999. Here it is 15 bits again, so the simple approach does not waste space.
If there are fewer output bits than input bits, a hash function is needed. I would strongly recommend to map the strings to binary data like above, and use a simple function to map the input to the output, for this reason:
A general-purpose hash function can not differentiate the input bits, because it knows nothing about their meaning.
For 256 output bits, after hashing 5.7e38 values, the chance of a collision is 75%. Source: Birthday Attack.
5.7e38 seems huge, but it corresponds to only 129 bits (2^129 = 6.8e38). In this case it means that there is a chance of over 75% that there is a pair of strings with 9 (129/15 = 8.6) elements that collide.
On the other hand, if you use a very simple mapping function like:
truncate the input to 256 bits (use the first 17 elements of 15 bits each)
make a 256 bit xor value of all the 15-bit elements
you can guaratee there is no collision between any two strings with at most 17 elements.
The hash functions wich are optimized for generating unique IDs likely perform better than a general-purpose hash as compared here, but I would doubt that they can guarantee collision-free hashing of all 256-bit values.
Conclusion: If most of the input strings have less than 17 elements, I would prefer this to a hash.

Security: longer keys versus more available characters

I apologize if this has been answered before, but I was not able to find anything. This question was inspired by a comment on another security-related question here on SO:
How to generate a random, long salt for use in hashing?
The specific comment is as follows (sixth comment of accepted answer):
...Second, and more importantly, this will only return hexadecimal
characters - i.e. 0-9 and A-F. It will never return a letter higher
than an F. You're reducing your output to just 16 possible characters
when there could be - and almost certainly are - many other valid
characters.
– AgentConundrum Oct 14 '12 at 17:19
This got me thinking. Say I had some arbitrary series of bytes, with each byte being randomly distributed over 2^(8). Let this key be A. Now suppose I transformed A into its hexadecimal string representation, key B (ex. 0xde 0xad 0xbe 0xef => "d e a d b e e f").
Some things are readily apparent:
len(B) = 2 len(A)
The symbols in B are limited to 2^(4) discrete values while the symbols in A range over 2^(8)
A and B represent the same 'quantities', just using different encoding.
My suspicion is that, in this example, the two keys will end up being equally as secure (otherwise every password cracking tool would just convert one representation to another for quicker attacks). External to this contrived example, however, I suspect there is an important security moral to take away from this; especially when selecting a source of randomness.
So, in short, which is more desirable from a security stand point: longer keys or keys whose values cover more discrete symbols?
I am really interested in the theory behind this, so an extra bonus gold star (or at least my undying admiration) to anyone who can also provide the math / proof behind their conclusion.
If the number of different symbols usable in your password is x, and the length is y, then the number of different possible passwords (and therefore the strength against brute-force attacks) is x ** y. So you want to maximize x ** y. Both adding to x or adding to y will do that, Which one makes the greater total depends on the actual numbers involved and what your practical limits are.
But generally, increasing x gives only polynomial growth while adding to y gives exponential growth. So in the long run, length wins.
Let's start with a binary string of length 8. The possible combinations are all permutations from 00000000 and 11111111. This gives us a keyspace of 2^8, or 256 possible keys. Now let's look at option A:
A: Adding one additional bit.
We now have a 9-bit string, so the possible values are between 000000000 and 111111111, which gives us a keyspace size of 2^9, or 512 keys. We also have option B, however.
B: Adding an additional value to the keyspace (NOT the keyspace size!):
Now let's pretend we have a trinary system, where the accepted numbers are 0, 1, and 2. Still assuming a string of length 8, we have 3^8, or 6561 keys...clearly much higher.
However! Trinary does not exist!
Let's look at your example. Please be aware I will be clarifying some of it, which you may have been confused about. Begin with a 4-BYTE (or 32-bit) bitstring:
11011110 10101101 10111110 11101111 (this is, btw, the bitstring equivalent to 0xDEADBEEF)
Since our possible values for each digit are 0 or 1, the base of our exponent is 2. Since there are 32 bits, we have 2^32 as the strength of this key. Now let's look at your second key, DEADBEEF. Each "digit" can be a value from 0-9, or A-F. This gives us 16 values. We have 8 "digits", so our exponent is 16^8...which also equals 2^32! So those keys are equal in strength (also, because they are the same thing).
But we're talking about REAL passwords, not just those silly little binary things. Consider an alphabetical password with only lowercase letters of length 8: we have 26 possible characters, and 8 of them, so the strength is 26^8, or 208.8 billion (takes about a minute to brute force). Adding one character to the length yields 26^9, or 5.4 trillion combinations: 20 minutes or so.
Let's go back to our 8-char string, but add a character: the space character. now we have 27^8, which is 282 billion....FAR LESS than adding an additional character!
The proper solution, of course, is to do both: for instance, 27^9 is 7.6 trillion combinations, or about half an hour of cracking. An 8-character password using upper case, lower case, numbers, special symbols, and the space character would take around 20 days to crack....still not nearly strong enough. Add another character, and it's 5 years.
As a reference, I usually make my passwords upwards of 16 characters, and they have at least one Cap, one space, one number, and one special character. Such a password at 16 characters would take several (hundred) trillion years to brute force.

Encoding name strings into an unique number

I have a large set of names (millions in number). Each of them has a first name, an optional middle name, and a lastname. I need to encode these names into a number that uniquely represents the names. The encoding should be one-one, that is a name should be associated with only one number, and a number should be associated with only one name.
What is a smart way of encoding this? I know it is easy to tag each alphabet of the name according to its position in the alphabet set (a-> 1, b->2.. and so on) and so a name like Deepa would get -> 455161, but again here I cannot make out if the '16' is really 16 or a combination of 1 and 6.
So, I am looking for a smart way of encoding the names.
Furthermore, the encoding should be such that the number of digits in the output numeral for any name should have fixed number of digits, i.e., it should be independent of the length. Is this possible?
Thanks
Abhishek S
To get the same width numbers, can't you just zero-pad on the left?
Some options:
Sort them. Count them. The 10th name is number 10.
Treat each character as a digit in a base 26 (case insensitive, no
digits) or 52 (case significant, no digits) or 36 (case insensitive
with digits) or 62 (case sensitive with digits) number. Compute the
value in an int. EG, for a name of "abc", you'd have 0 * 26^2 + 1 *
26^1 + 2 * 20^0. Sometimes Chinese names may use digits to indicate tonality.
Use a "perfect hashing" scheme: http://en.wikipedia.org/wiki/Perfect_hash_function
This one's mostly suggested in fun: use goedel numbering :). So
"abc" would be 2^0 * 3^1 * 5^2 - it's a product of powers of primes.
Factoring the number gives you back the characters. The numbers
could get quite large though.
Convert to ASCII, if you aren't already using it. Then treat each
ordinal of a character as a digit in a base-256 numbering system.
So "abc" is 0*256^2 + 1*256^1 + 2*256^0.
If you need to be able to update your list of names and numbers from time to time, #2, #4 and #5 should work. #1 and #3 would have problems. #5 is probably the most future-proofed, though you may find you need unicode at some point.
I believe you could do unicode as a variant of #5, using powers of 2^32 instead of 2^8 == 256.
What you are trying to do there is actually hashing (at least if you have a fixed number of digits). There are some good hashing algorithms with few collisions. Try out sha1 for example, that one is well tested and available for modern languages (see http://en.wikipedia.org/wiki/Sha1) -- it seems to be good enough for git, so it might work for you.
There is of course a small possibility for identical hash values for two different names, but that's always the case with hashing and can be taken care of. With sha1 and such you won't have any obvious connection between names and IDs, which can be a good or a bad thing, depending on your problem.
If you really want unique ids for sure, you will need to do something like NealB suggested, create IDs yourself and connect names and IDs in a Database (you could create them randomly and check for collisions or increment them, starting at 0000000000001 or so).
(improved answer after giving it some thought and reading the first comments)
You can use the BigInteger for encoding arbitrary strings like this:
BigInteger bi = new BigInteger("some string".getBytes());
And for getting the string back use:
String str = new String(bi.toByteArray());
I've been looking for a solution to a problem very similar to the one you proposed and this is what I came up with:
def hash_string(value):
score = 0
depth = 1
for char in value:
score += (ord(char)) * depth
depth /= 256.
return score
If you are unfamiliar with Python, here's what it does.
The score is initially 0 and the depth are set to 1
For every character add the ord value * the depth
The ord function returns the UTF-8 value (0-255) for each character
Then it's multiplied by the 'depth'.
Finally the depth is divided by 256.
Essentially, the way that it works is that the initial characters add more to the score while later characters contribute less and less. If you need an integer, multiply the end score by 2**64. Otherwise you will have a decimal value between 0-256. This encoding scheme works for binary data as well as there are only 256 possible values in a byte/char.
This method works great for smaller string values, however, for longer strings you will notice that the decimal value requires more precision than a regular double (64-bit) can provide. In Java, you can use the 'BigDecimal' and in Python use the 'decimal' module for added precision. A bonus to using this method is that the values returned are in sorted order so they can be searched 'efficiently'.
Take a look at https://en.wikipedia.org/wiki/Huffman_coding. That is the standard approach.
You can translate it, if every character (plus blank, at least) will occupy a position.
Therefore ABC, which is 1,2,3 has to be translated to
1*(2*26+1)² + 2*(53) + 3
This way, you could encode arbitrary strings, but if the length of the input isn't limited (and how should it?), you aren't guaranteed to have an upper limit for the length.

Resources