I have strings (millions of them) of 2,475 characters in size each. These strings are consisting of 0 & 1. I am converting each string to ASCII and back, so 8 initial chars become 1. This give me a much shorter length of 310 chars. But as this length is still big enough I have tried some additional compression of the already shortened string. I have used Huffman Encoding/Decoding with not so important results. I have also tried an RLE approach with better results (encoding between 205 to 212) chars over the already existing strings. But here is my problem! As I do not know the strings beforehand I am looking for a compression/decompression algorithm that produces fixed length output. Does something like that exist? I have searched also about Endless compression but without finding any solid suggestions/algorithms. Any idea will be welcomed.
If the only thing you know about the strings is that they each consist of 2475 characters and that every character is 0 or 1, then there is no fixed-length compression scheme that does better than 2475 bits (310 bytes, with 5 bits ignored). It's simple to prove that no such compression scheme can exist, since there are 22475 possible strings, and they all need to have different codes (if the compression is to be reversible). However, the shortest bit sequences which has 22475 different possible values is 2475 bits long. QED.
Of course, if some 2475-character sequences are not possible, then you can compress more by not reserving any compressed value for illegal sequences. However, in order to create an appropriate compression algorithm, you need to know what sequences are impossible, and customize the compression algorithm accordingly. So there is no general purpose algorithm.
General purpose compression algorithms do not have fixed length output because they stochastically compress certain strings to varying degrees, while other strings are compressed negatively (that is, expanded). The assumption is that all strings have some sort of internal pattern, typically a repetition pattern, and the compression can take advantage of repetitions to reduce length. To compensate, a non-repeating string will end up being expanded.
Related
If you have a set of binary strings that are limited to some normally-small size such as 256 or up to 512 bytes like some of the hashing algorithms, then if you want to encode those bits of 1's and 0's into say hex (a 16-character alphabet), then you take the whole string at once into memory and convert it into hex. At least that's what I think it means.
I don't have this question fully formulated, but what I'm wondering is if you can convert an arbitrarily long binary string into some alphabet, without needing to read the whole string into memory. The reason this isn't fully formed question is because I'm not exactly sure if you typically do read the whole string into memory to create the encoded version.
So if you have something like this:
1011101010011011011101010011011011101010011011110011011110110110111101001100101010010100100000010111101110101001101101110101001101101110101001101111001101111011011011110100110010101001010010000001011110111010100110110111010100110110111010100110111100111011101010011011011101010011011011101010100101010010100100000010111101110101001101101110101001101101111010011011110011011110110110111101001100101010010100100000010111101110101001101101101101101101101111010100110110111010100110110111010100110111100110111101101101111010011001010100101001000000101111011101010011011011101010011011011101010011011110011011110110110111101001100 ... 10^50 longer
Something like the whole genetic code or a million billion times that, it would be too large to read into memory and too slow to wait to dynamically create an encoding of it into hex if you have to stream the whole thing through memory before you can figure out the final encoding.
So I'm wondering three things:
If you do have to read something fully in order to encode it into some other alphabet.
If you do, then why that is the case.
If you don't, then how it works.
The reason I'm asking is because looking at a string like 1010101, if I were to encode it as hex there are a few ways:
One character at a time, so it would essentially stay 1010101 unless the alphabet was {a, b} then it would be abababa. This is the best case because you don't have to read anything more than 1 character into memory to figure out the encoding. But it limits you to a 2-character alphabet. (Anything more than 2 character alphabets and I start getting confused)
By turning it into an integer, then converting that into a hex value. But this would require reading the whole value to compute the final (big)integer size. So that's where I get confused.
I feel like the third way (3) would be to read partial chunks of the input bytes somehow, like 1010 then010, but that would not work if the encoding was integers because 1010 010 = A 2 in hex, but 2 = 10 not 2 = 010. So it's like you would need to break it by having a 1 at the beginning of each chunk. But then what if you wanted to have each chunk no longer than 10 hex characters, but you have a long string of 1000 0's, then you need some other trick perhaps like having the encoded hex value tell you how many preceding zeroes you have, etc. So it seems like it gets complicated, wondering if there are already some systems established that have figured out how to do this. Hence the above questions.
For an example, say I wanted to encode the above binary string into an 8-bit alphabet, so like ASCII. Then I might have aBc?D4*&((!.... But then to deserialize this into the bits is one part, and to serialize the bits into this is another (these characters aren't the actual characters mapped to the above bit example).
But then what if you wanted to have each chunk no longer than 10 hex characters, but you have a long string of 1000 0's, then you need some other trick perhaps like having the encoded hex value tell you how many preceding zeroes you have, etc. So it seems like it gets complicated, wondering if there are already some systems established that have figured out how to do this
Yes you're way over-complicating it. To start simple, consider bit strings whose length is by definition a multiple of 4. They can be represented in hexadecimal by just grouping the bits up by 4 and remapping that to hexadecimal digits:
raw: 11011110101011011011111011101111
group: 1101 1110 1010 1101 1011 1110 1110 1111
remap: D E A D B E E F
So 11011110101011011011111011101111 -> DEADBEEF. That all the nibbles had their top bit set was a coincidence resulting from choosing an example that way. By definition the input is divided up into groups of four, and every hexadecimal digit is later decoded to a group of four bits, including leading zeroes if applicable. This is all that you need for typical hash codes which have a multiple of 4 bits.
The problems start when we want encode bit strings that are of variable length and not necessarily a multiple of 4 long, then there will have to be some padding somewhere, and the decoder needs to know how much padding there was (and where, but the location is a convention that you choose). This is why your example seemed so ambiguous: it is. Extra information needs to be added to tell the decoder how many bits to discard.
For example, leaving aside the mechanism that transmits the number of padding bits, we could encode 1010101 as A5 or AA or 5A (and more!) depending on the location we choose for the padding, whichever convention we choose the decoder needs to know that there is 1 bit of padding. To put that back in terms of bits, 1010101 could be encoded as any of these:
x101 0101
101x 0101
1010 x101
1010 101x
Where x marks the bit which is inserted in the encoder and discarded in the decoder. The value of that bit doesn't actually matter because it is discarded, so DA is also a fine encoding and so on.
All of the choices of where to put the padding still enable the bit string to be encoded incrementally, without storing the whole bit string in memory, though putting the padding in the first hexadecimal digit requires knowing the length of the bit string up front.
If you are asking this in the context of Huffman coding, you wouldn't want to calculate the length of the bit string in advance so the padding has to go at the end. Often an extra symbol is added to the alphabet that signals the end of the stream, which usually makes it unnecessary to explicitly store how much padding bits there are (there might be any number of them, but as they appear after the STOP symbol, the decoder automatically disregards them).
Here my problem statement:
I have a set of strings that match a regular expression. let's say it matches [A-Z][0-9]{3} (i.e. 1 letter and 3 digits).
I can have any number of strings between 1 and 30. For example I could have:
{A123}
{A123, B456}
{Z789, D752, E147, ..., Q665}
...
I need to generate an integer (actually I can use 256 bits) that would be unique for any set of strings regardless of the number of elements (although the number of elements could be used to generate the integer)
What sort of algorithm could I use?
My first idea would be to convert my strings to number and then do operations (I thought of hash functions) on them but I am not sure what formula would be give me could results.
Any suggestion?
You have 2^333 possible input sets ((26 * 10^3) choose 30).
This means you would need a 333 bit wide integer to represent all possibilities. You only have a maximum of 256 bits, so there will be collisions.
This is a typical application for a hash function. There are hashes for various purposes, so it's important to select the right type:
A simple hash function for use in bucket based data structures (dictionaries) must be fast. Collisions are not only tolerated but wanted. The hash's size (in bits) usually is small. Due to collisions this type of hash is not suited for your purpose.
A checksum tries to avoid collisions and is reasonably fast. If it's large enough this might be enough for your case.
Cryptographic hashes have the characteristic that it's not possible (or very hard) to find a collision (even when both input and hash are known). Also they are not invertible (from the hash it's not possible to find the input). These are usually computationally expensive and overkill for your use case.
Hashes to uniquely identify arbitrary inputs, like CityHash and SpookyHash are designed for fast hashing and collision free identification.
SpookyHash seems like a good candidate for your use case. It's 128 bits wide, which means that you need 2^64 differing inputs to get a 50% chance of a single collision.
It's also fast: three bytes per cycle is orders of magnitude faster than md5 or sha1. SpookyHash is available in the public domain (see link above).
To apply any hash on your use case you could convert the items in your list to numbers, but it seems easier to just feed them as strings. You have to settle for an encoding in this case (ASCII would do).
I'm usually using UTF8 or so, when I18N is an issue. Then it's sometimes important to care for canonicalization. But this does not apply to your simple use case.
A hash is not going to work, since it could produce collisions. Every significant input bit must be mapped to an output bit.
For the letter, you have 90 - 65 = 25 different values, so you can use 5 bits to represent the letter.
The 3-digit number has 1000 different values, so you need 10 bits for this.
If you combine these bits, you have a unique mapping from the input to a 15-bit number.
This approach is simple, but it could wastes some bits. If the output must be as short as possible, you could map as follows:
output = (L - 'A')*1000 + N
where L is the letter value, 'A' is the value of the letter A, N is the 3-digit number. Then you can use as few bits as are necessary to represent the complete range of output, which is 25*1000 - 1 = 24999. Here it is 15 bits again, so the simple approach does not waste space.
If there are fewer output bits than input bits, a hash function is needed. I would strongly recommend to map the strings to binary data like above, and use a simple function to map the input to the output, for this reason:
A general-purpose hash function can not differentiate the input bits, because it knows nothing about their meaning.
For 256 output bits, after hashing 5.7e38 values, the chance of a collision is 75%. Source: Birthday Attack.
5.7e38 seems huge, but it corresponds to only 129 bits (2^129 = 6.8e38). In this case it means that there is a chance of over 75% that there is a pair of strings with 9 (129/15 = 8.6) elements that collide.
On the other hand, if you use a very simple mapping function like:
truncate the input to 256 bits (use the first 17 elements of 15 bits each)
make a 256 bit xor value of all the 15-bit elements
you can guaratee there is no collision between any two strings with at most 17 elements.
The hash functions wich are optimized for generating unique IDs likely perform better than a general-purpose hash as compared here, but I would doubt that they can guarantee collision-free hashing of all 256-bit values.
Conclusion: If most of the input strings have less than 17 elements, I would prefer this to a hash.
I'm searching for the most appropriated encoding or method to compress bytes into character that can be read with a ReadLine-like command that only recognizes readable char and terminates on end of line char. There is probably a common practice to achieve it, but I don't know a lot about encoding.
Currently, I'm outputing bytes as a string of hex, so I need 2 bytes to represent 1 byte. It works well, but it is slow. Ex: byte with a value 255 is represented as 'FF'.
I'm sure it could be 3 or 4 times smaller, though there's a limit since I'm outputing MP3 data, but I don't know how. Should I just ZIP my string or there would be too much overhead on it?
Will ASCII85 contains random null bytes and EndOfLine or I'm safe with it?
Don't zip mp3 files, that will not gain much (or anything at all).
I'm a bit disappointed that you did not read up on Ascii85 before asking as I think the Wikipedia article explains fairly clearly that it uses only printable ASCII characters; so, no line endings or null bytes. It is efficient and the conversion is also fairly simple and quick - split your data to 4-byte ints; you will convert these to just five Ascii85 digits by repeatedly dividing the int value by 85 and taking ASCII value of the modulo + 33.
You can also consider using Base64 or UUEncode. These are fairly popular (e.g. used in email attachments) so you will find many libraries preparing these. But they are less efficient.
I know that I can encode numbers to a base like 65 to decrease the size of the character display (even if the number is smaller in binary).
However, is there a way to encode UTF-8 text to another base with more characters than our standard 26 letter English alphabet? In other words, Instead of requiring 4 "characters" for the word "four" - I can create a representation or hash using only, maybe 2 (i.e. "6$")?
I believe the point of Base64 is you can easily convert any binary data into "human readable" letters and numbers. It makes it easy to transcribe arbitrary data to newsgroups or transmit them over text based protocols.
If you want to further "compress" this data, you need to figure out how many characters you want to allow. There's only so many combinations of 8 bits. The most efficient would be to use all of them, in which case why just not use gzip?
Your question seems related to Order-0 entropy coding :
http://en.wikipedia.org/wiki/Entropy_encoding
The most famous algorithm is this family is Huffman coding :
http://en.wikipedia.org/wiki/Huffman_coding
Huffman will not only tells you that only 64 characters are used and therefore only 6 bits per characters are necessary : it will also make a difference between frequent characters, such as (space), and rare ones, such as (;). It will then create a code in which frequent characters use less bits than rarer ones, resulting in better compression (typically 4.5bits per character on English texts).
Huffman coding is an all-around compression technique, used as part of many compression algorithms, including zip.
You can find a demo program which only applies one pass of Huffman compression here (Huff0), it will help you determine how much can be gained by using this technique for your sample inputs :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html
I'd like to do some kind of "search and replace" algorithm which will, in an efficient manner if possible, identify a substring of a string which occurs more than once and replace all occurrences of that substring with a token.
For example, given a string "AbcAdAefgAbijkAblmnAbAb", notice that "A" recurs, so reduce in pass one to "#1bc#1d#1efg#1bijk#1blmn#1b#1b" where #_ is an indexed pattern (we note the patterns in an indexed table), then notice that "#1b" recurs so reduce to "#2c#1d#1efg#2ijk#2lmn#2#2". No more patterns occur in the string so we're done.
I have found some information on "longest common subsequences" and compression algorithms, but nothing that seems to do this. They either are for comparing two string or for getting some kind of storage-optimal result.
My objective, on the other hand, is to reduce the genome to its "words" instead of "letters". ie, instead of gatcatcgatc I want to see 2c1c2c. I could do some regex afterwards to find things like "#42*#42"; it would be cool to see recurring brackets in dna.
If I could just find that online I would skip doing it myself but I can't see this question answered before in terms I could uncover. To anyone who can point me in the right direction many thanks.
The byte pair encoding does something pretty close to what you want.
Rather than searching directly for the longest repeated string (top-down),
each pass of byte pair encoding searches for repeated byte pairs (bottom-up).
But eventually it discovers the longest repeated string(*).
gatcatcgatc
1=at g1c1cg1c
2=atc g22g2
3=gatc 2=atc 323
As you can see, it has found the longest repeated string "gatc".
(*) byte pair encoding either eventually finds the longest repeated string,
or else it stops early after making (2^8 - uniquechars(source) ) substitutions.
I suspect it may be possible to tweak byte pair encoding so that the early-stop condition is relaxed a little -- perhaps (2^9 - uniquechars(source) ) or 2^12 or 2^16.
Even if that hurts compression performance, perhaps it will give interesting results for applications like yours.
Wikipedia: byte pair encoding
Stack Overflow: optimizing byte-pair encoding