Encoding binary strings into arbitrary alphabets

Encoding binary strings into arbitrary alphabets - string

If you have a set of binary strings that are limited to some normally-small size such as 256 or up to 512 bytes like some of the hashing algorithms, then if you want to encode those bits of 1's and 0's into say hex (a 16-character alphabet), then you take the whole string at once into memory and convert it into hex. At least that's what I think it means.
I don't have this question fully formulated, but what I'm wondering is if you can convert an arbitrarily long binary string into some alphabet, without needing to read the whole string into memory. The reason this isn't fully formed question is because I'm not exactly sure if you typically do read the whole string into memory to create the encoded version.
So if you have something like this:
1011101010011011011101010011011011101010011011110011011110110110111101001100101010010100100000010111101110101001101101110101001101101110101001101111001101111011011011110100110010101001010010000001011110111010100110110111010100110110111010100110111100111011101010011011011101010011011011101010100101010010100100000010111101110101001101101110101001101101111010011011110011011110110110111101001100101010010100100000010111101110101001101101101101101101101111010100110110111010100110110111010100110111100110111101101101111010011001010100101001000000101111011101010011011011101010011011011101010011011110011011110110110111101001100 ... 10^50 longer
Something like the whole genetic code or a million billion times that, it would be too large to read into memory and too slow to wait to dynamically create an encoding of it into hex if you have to stream the whole thing through memory before you can figure out the final encoding.
So I'm wondering three things:
If you do have to read something fully in order to encode it into some other alphabet.
If you do, then why that is the case.
If you don't, then how it works.
The reason I'm asking is because looking at a string like 1010101, if I were to encode it as hex there are a few ways:
One character at a time, so it would essentially stay 1010101 unless the alphabet was {a, b} then it would be abababa. This is the best case because you don't have to read anything more than 1 character into memory to figure out the encoding. But it limits you to a 2-character alphabet. (Anything more than 2 character alphabets and I start getting confused)
By turning it into an integer, then converting that into a hex value. But this would require reading the whole value to compute the final (big)integer size. So that's where I get confused.
I feel like the third way (3) would be to read partial chunks of the input bytes somehow, like 1010 then010, but that would not work if the encoding was integers because 1010 010 = A 2 in hex, but 2 = 10 not 2 = 010. So it's like you would need to break it by having a 1 at the beginning of each chunk. But then what if you wanted to have each chunk no longer than 10 hex characters, but you have a long string of 1000 0's, then you need some other trick perhaps like having the encoded hex value tell you how many preceding zeroes you have, etc. So it seems like it gets complicated, wondering if there are already some systems established that have figured out how to do this. Hence the above questions.
For an example, say I wanted to encode the above binary string into an 8-bit alphabet, so like ASCII. Then I might have aBc?D4*&((!.... But then to deserialize this into the bits is one part, and to serialize the bits into this is another (these characters aren't the actual characters mapped to the above bit example).

But then what if you wanted to have each chunk no longer than 10 hex characters, but you have a long string of 1000 0's, then you need some other trick perhaps like having the encoded hex value tell you how many preceding zeroes you have, etc. So it seems like it gets complicated, wondering if there are already some systems established that have figured out how to do this
Yes you're way over-complicating it. To start simple, consider bit strings whose length is by definition a multiple of 4. They can be represented in hexadecimal by just grouping the bits up by 4 and remapping that to hexadecimal digits:
raw: 11011110101011011011111011101111
group: 1101 1110 1010 1101 1011 1110 1110 1111
remap: D E A D B E E F
So 11011110101011011011111011101111 -> DEADBEEF. That all the nibbles had their top bit set was a coincidence resulting from choosing an example that way. By definition the input is divided up into groups of four, and every hexadecimal digit is later decoded to a group of four bits, including leading zeroes if applicable. This is all that you need for typical hash codes which have a multiple of 4 bits.
The problems start when we want encode bit strings that are of variable length and not necessarily a multiple of 4 long, then there will have to be some padding somewhere, and the decoder needs to know how much padding there was (and where, but the location is a convention that you choose). This is why your example seemed so ambiguous: it is. Extra information needs to be added to tell the decoder how many bits to discard.
For example, leaving aside the mechanism that transmits the number of padding bits, we could encode 1010101 as A5 or AA or 5A (and more!) depending on the location we choose for the padding, whichever convention we choose the decoder needs to know that there is 1 bit of padding. To put that back in terms of bits, 1010101 could be encoded as any of these:
x101 0101
101x 0101
1010 x101
1010 101x
Where x marks the bit which is inserted in the encoder and discarded in the decoder. The value of that bit doesn't actually matter because it is discarded, so DA is also a fine encoding and so on.
All of the choices of where to put the padding still enable the bit string to be encoded incrementally, without storing the whole bit string in memory, though putting the padding in the first hexadecimal digit requires knowing the length of the bit string up front.
If you are asking this in the context of Huffman coding, you wouldn't want to calculate the length of the bit string in advance so the padding has to go at the end. Often an extra symbol is added to the alphabet that signals the end of the stream, which usually makes it unnecessary to explicitly store how much padding bits there are (there might be any number of them, but as they appear after the STOP symbol, the decoder automatically disregards them).

Related

DEFLATE: dynamic block structure

I am trying to construct an explicit example of a dynamic block. Please let me know if this is wrong.
Considering this example of lit/len alphabet:
A(0), B(0), C(1), D(0), E(2), F(2), G(2), H(2)
and the rest of symbols having zero code lengths.
The sequence(SQ) of code lengths would be 0...,0,0,1,0,2,2,2,2,...0.
Then we have to compress it further with run-length encoding. So we have to calculate number of repetitions and either use flag 16 to copy the previous code length, or 17 or 18 to repeat code length 0 (using extra bits).
My problem is this. After sending the header information and the sequence of code-length code lengths in the right order 16,17,18,..., the next sequence of information would be something like:
18, some extra bits value,1,0,2,16, some extra bits value 0,18, some extra bits value. (Probably there would be another 18 flag since the maximum repeat count is 138.)
Then we have the same thing with the distance alphabet and finally the inputs data encoded with the canonical Huffman, and extra bits if necessary.
Is it necessary to send the code lengths of 0? If so, why?
If yes, why is it necessary to have hclit and hcdist and not only hclen, knowing that the lengths of the sequences are 286 for lit/len and 30 for distances?
If not what would be the real solution?
Another problem:
In this case we have code length 2 with repetitions (3) extra bits value of 0.
Is this last number also included in the code length tree construction?
If yes I can't understand how: flag 18 has next a maximum possible extra bits value of 127 (1111111) representing 138 repetitions and it couldn't be included into the alphabet symbols of 0-18.
P.S When I say extra bits in this case I mean the factor that it is used to know how many repetitions of the previous length are used.
More precisely 0 - 15 we have 0 bit factor repetition, for 16,17,18 we have 2,3,7 bits repetitions factor. The value of those bits is what I mean with extra bits value.
I think I'm missing something about what Huffman codes are generated by the Huffman code-length alphabet.

First off, your example code is invalid, since it oversubscribes the available bit patterns. 2,2,2,2 would use all of the bit patterns, since there are only four possible two-bit patterns. You can't have one more code, of any length. Possible valid codes for five symbols is 1,2,3,4,4, or 2,2,2,3,3.
To answer your questions in order:
You need to send the leading zeros, but you do not need to send the trailing zeros. The HLIT and HDIST counts determine how many lengths of each code type are in the header, where any after those are taken to be zero. You need to send the zeros, since the code lengths are associated with the corresponding symbol by their position in the list.
It saves space in header to have the HLIT and HDIST counts, so that you don't need to provide lengths for all 316 codes in every header.
I don't understand this question, but I guess it doesn't apply.
If I understand your question, extra bits have nothing to do with the descriptions of the Huffman codes in the headers. The extra bits are implied by the symbol. In any case, a repeated length is encoded with code 16, not code 18. So the four twos would be encoded as 2, 16(0), where the (0) represents two extra bits that are zeros.

How can I combine nom parsers to get a more bit-oriented interface to the data?

I'm working on decoding AIS messages in Rust using nom.
AIS messages are made up of a bit vector; the various fields in each message are an arbitrary number of bits long, and they don't always align on byte boundaries.
This bit vector is then ASCII encoded, and embedded in an NMEA sentence.
From http://catb.org/gpsd/AIVDM.html:
The data payload is an ASCII-encoded bit vector. Each character represents six bits of data. To recover the six bits, subtract 48 from the ASCII character value; if the result is greater than 40 subtract 8. According to [IEC-PAS], the valid ASCII characters for this encoding begin with "0" (64) and end with "w" (87); however, the intermediate range "X" (88) to "_" (95) is not used.
Example
!AIVDM,1,1,,A,D03Ovk1T1N>5N8ffqMhNfp0,0*68 is the NMEA sentence
D03Ovk1T1N>5N8ffqMhNfp0 is the encoded AIS data
010100000000000011011111111110110011000001100100000001011110001110000101011110001000101110101110111001011101110000011110101110111000000000 is the decoded AIS data as a bit vector
Problems
I list these together because I think they may be related...
1. Decoding ASCII to bit vector
I can do this manually, by iterating over the characters, subtracting the appropriate values, and building up a byte array by doing lots of work bitshifting, and so on. That's fine, but it seems like I should be able to do this inside nom, and chain it with the actual AIS bit parser, eliminating the interim byte array.
2. Reading arbitrary number of bits
It's possible to read, say, 3 bits from a byte array in nom. But, each call to bits! seems to consume a full byte at once (if reading into a u8).
For example:
named!(take_3_bits<u8>, bits!(take_bits!(u8, 3)));
will read 3 bits into a u8. But if I run take_3_bits twice, I'll have consumed 16 bits of my stream.
I can combine reads:
named!(get_field_1_and_2<(u8, u8)>, bits!(pair!(take_bits!(u8, 2), take_bits!(u8, 3))));
Calling get_field_1_and_2 will get me a (u8, u8) tuple, where the first item contains the first 2 bits, and the second item contains the next 3 bits, but nom will then still advance a full byte after that read.
I can use peek to prevent the nom's read pointer from advancing, and then manually manage it, but again, that seems like unnecessary extra work.

How does UTF16 encode characters?

EDIT
Since it seems I'm not going to get an answer to the general question. I'll restrict it to one detail: Is my understanding of the following, correct?
That surrogates work as follows:
If the first pair of bytes is not between D800 and DBFF - there
will not be a second pair.
If it is between D800 and DBFF - a) there will be a second pair b)
the second pair will be in the range of DC00 and DFFF.
There is no single pair UTF16 character with a value between D800
and DBFF.
There is no single pair UTF16 character with a value between DC00
and DFFF.
Is this right?
Original question
I've tried reading about UTF16 but I can't seem to understand it. What are "planes" and "surrogates" etc.? Is a "plane" the first 5 bits of the first byte? If so, then why not 32 planes since we're using those 5 bits anyway? And what are surrogates? Which bits do they correspond to?
I do understand that UTF16 is a way to encode Unicode characters, and that it sometimes encodes characters using 16 bits, and sometimes 32 bits, no more no less. I assume that there is some list of values for the first 2 bytes (which are the most significant ones?) which indicates that a second 2 bytes will be present.
But instead of me going on about what I don't understand, perhaps someone can make some order in this?

Yes on all four.
To clarify, the term "pair" in UTF-16 refers to two UTF-16 code units, the first in the range D800-DBFF, the second in DC00-DFFF.
A code unit is 16-bits (2 bytes), typically written as an unsigned integer in hexadecimal (0x000A). The order of the bytes (0x00 0x0A or 0x0A 0x00) is specified by the author or indicated with a BOM (0xFEFF) at the beginning of the file or stream. (The BOM is encoded with the same algorithm as the text but is not part of the text. Once the byte order is determined and the bytes are reordered to the native ordering of the system, it typically is discarded.)

Compress bytes into a readable string (no null or endofline)

I'm searching for the most appropriated encoding or method to compress bytes into character that can be read with a ReadLine-like command that only recognizes readable char and terminates on end of line char. There is probably a common practice to achieve it, but I don't know a lot about encoding.
Currently, I'm outputing bytes as a string of hex, so I need 2 bytes to represent 1 byte. It works well, but it is slow. Ex: byte with a value 255 is represented as 'FF'.
I'm sure it could be 3 or 4 times smaller, though there's a limit since I'm outputing MP3 data, but I don't know how. Should I just ZIP my string or there would be too much overhead on it?
Will ASCII85 contains random null bytes and EndOfLine or I'm safe with it?

Don't zip mp3 files, that will not gain much (or anything at all).
I'm a bit disappointed that you did not read up on Ascii85 before asking as I think the Wikipedia article explains fairly clearly that it uses only printable ASCII characters; so, no line endings or null bytes. It is efficient and the conversion is also fairly simple and quick - split your data to 4-byte ints; you will convert these to just five Ascii85 digits by repeatedly dividing the int value by 85 and taking ASCII value of the modulo + 33.
You can also consider using Base64 or UUEncode. These are fairly popular (e.g. used in email attachments) so you will find many libraries preparing these. But they are less efficient.

Encoding a 5 character string into a unique and repeatable 32bit Integer

I've not given this much thought yet, so I might turn out to be a silly question.
How can I take unique 5 ASCII character string and convert into a unique and reproducable (i.e needs to be the same every time) 32 bit integer?
Any ideas?

Assuming it is in fact ASCII (i.e., no characters with ordinal values greater than 127), you have five characters of 7 bits, or 35 bits of information. There is no way to generate a 32-bit code from 35 bits that is guaranteed to be unique; you're missing three bits, so each code will also represent 7 other valid ASCII strings. However, you can make it very, very unlikely that you will ever see a collision by being careful in how you calculate the code so that input strings that are very similar have very different codes. I see another answer has suggested CRC-32. You could also use a hash function such as MD5 or SHA-1 and use only the first 32 bits; this is probably best because hash functions are specifically designed for this purpose.
If you can further constrain the values of the input string (say, only alphanumeric, no lowercase, no control characters, or something of the sort), you can probably eliminate that extra data and generate guaranteed unique 32-bit codes for each string.

If they're guaranteed to be alphanumeric only, and case-insensitive ([A-Z][0-9]) you can treat it as a base-36 number.

If all five characters will belong to a set of 84 or fewer distinct characters, then you can squish five of them into a longword. Convert each character into a value 0..83, then
intvalue = ((((char4*84+char1)*83+char2)*82+char3)*81+char0)
char0 = intvalue % 84
char1 = (intvalue / 84) % 84;
char2 = (intvalue / (84*84)) % 84;
char3 = (intvalue / (84*84L*84)) % 84;
char4 = (intvalue / (84*84L*84*84L) % 84;
BTW, I wonder if anyone uses base-84 encoding as a standard; on many platforms it could be easier to handle than base-64, and the results would be more compact.

If you need to handle extended ASCII you are out of luck, as you would need 5 full chars which is 40 bits. Even with non-extended chars (top bit not used), you are still out of luck as you are trying to encode 35 bits of ASCII data into 32 bits of integer.

ascii goes from 0-255, which takes 8 bits... In 32 bits, you have 4 of those, not 5. So, to make it short and sweet, you can't do this.
Even if you are willing to ignore the high-order (values 128-255) ascii (use only ascii characters 0-127) and just use 7 bits per character, you are still 3 bits short (7*5 = 35 and you only have 32 available.

One way is to treat the 5 characters as numerals in base N, where N is the number of characters in your alphabet (the set of allowed characters). From there on, it's just simple base conversion.
Given that you have 32 bits available, and 5 characters to store, that means you can have 32^(1/5)=84 characters in your alphabet.
Assuming you only include basic ASCII, not extended ASCII (>127), you have 7 bits of information in a single character, so that's a bit of a problem - there are too many possibilities to create unique values for every string. However, the first 32 characters, as well as the last character, are control characters, and if you exclude those, you're down to 95 characters.
You still have to cut 11 characters, though. Wikipedia has a nice chart of the characters in ASCII which you can use to determine which characters you need.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string