Compress bytes into a readable string (no null or endofline) - string

I'm searching for the most appropriated encoding or method to compress bytes into character that can be read with a ReadLine-like command that only recognizes readable char and terminates on end of line char. There is probably a common practice to achieve it, but I don't know a lot about encoding.
Currently, I'm outputing bytes as a string of hex, so I need 2 bytes to represent 1 byte. It works well, but it is slow. Ex: byte with a value 255 is represented as 'FF'.
I'm sure it could be 3 or 4 times smaller, though there's a limit since I'm outputing MP3 data, but I don't know how. Should I just ZIP my string or there would be too much overhead on it?
Will ASCII85 contains random null bytes and EndOfLine or I'm safe with it?

Don't zip mp3 files, that will not gain much (or anything at all).
I'm a bit disappointed that you did not read up on Ascii85 before asking as I think the Wikipedia article explains fairly clearly that it uses only printable ASCII characters; so, no line endings or null bytes. It is efficient and the conversion is also fairly simple and quick - split your data to 4-byte ints; you will convert these to just five Ascii85 digits by repeatedly dividing the int value by 85 and taking ASCII value of the modulo + 33.
You can also consider using Base64 or UUEncode. These are fairly popular (e.g. used in email attachments) so you will find many libraries preparing these. But they are less efficient.

Related

How to get pieces value from .torrent file

I am trying to build a .torrent file interpreter. The problem is that I can't seem to understand how to go about interpreting the pieces value. I am aware that the pieces key contains a concatenation of the SHA-1 hashes for each piece and that SHA-1 contains 20 bytes. A result of this is that the final output should be a multiple of 20 bytes. However, after counting the bytes from the pieces value as a string or in hexadecimal form it still does not satisfy this. How should I interpret the pieces key?
Here we use bencode and bdecode, and the pieces value can get easily. I think you need to firstly read BEP for more details. What's more, you can see this and use it as an example.
From looking at a real torrent file, I found that the SHA-1 hashes had to be taken from its hexadecimal string format, but I previously thought that it was wrong because the byte length of the hash was not a multiple of 20. Turns out I forgot to add a trailing 0 to hexadecimals that were only 1 character (e.g. a had to be changed to 0a)

Encoding binary strings into arbitrary alphabets

If you have a set of binary strings that are limited to some normally-small size such as 256 or up to 512 bytes like some of the hashing algorithms, then if you want to encode those bits of 1's and 0's into say hex (a 16-character alphabet), then you take the whole string at once into memory and convert it into hex. At least that's what I think it means.
I don't have this question fully formulated, but what I'm wondering is if you can convert an arbitrarily long binary string into some alphabet, without needing to read the whole string into memory. The reason this isn't fully formed question is because I'm not exactly sure if you typically do read the whole string into memory to create the encoded version.
So if you have something like this:
1011101010011011011101010011011011101010011011110011011110110110111101001100101010010100100000010111101110101001101101110101001101101110101001101111001101111011011011110100110010101001010010000001011110111010100110110111010100110110111010100110111100111011101010011011011101010011011011101010100101010010100100000010111101110101001101101110101001101101111010011011110011011110110110111101001100101010010100100000010111101110101001101101101101101101101111010100110110111010100110110111010100110111100110111101101101111010011001010100101001000000101111011101010011011011101010011011011101010011011110011011110110110111101001100 ... 10^50 longer
Something like the whole genetic code or a million billion times that, it would be too large to read into memory and too slow to wait to dynamically create an encoding of it into hex if you have to stream the whole thing through memory before you can figure out the final encoding.
So I'm wondering three things:
If you do have to read something fully in order to encode it into some other alphabet.
If you do, then why that is the case.
If you don't, then how it works.
The reason I'm asking is because looking at a string like 1010101, if I were to encode it as hex there are a few ways:
One character at a time, so it would essentially stay 1010101 unless the alphabet was {a, b} then it would be abababa. This is the best case because you don't have to read anything more than 1 character into memory to figure out the encoding. But it limits you to a 2-character alphabet. (Anything more than 2 character alphabets and I start getting confused)
By turning it into an integer, then converting that into a hex value. But this would require reading the whole value to compute the final (big)integer size. So that's where I get confused.
I feel like the third way (3) would be to read partial chunks of the input bytes somehow, like 1010 then010, but that would not work if the encoding was integers because 1010 010 = A 2 in hex, but 2 = 10 not 2 = 010. So it's like you would need to break it by having a 1 at the beginning of each chunk. But then what if you wanted to have each chunk no longer than 10 hex characters, but you have a long string of 1000 0's, then you need some other trick perhaps like having the encoded hex value tell you how many preceding zeroes you have, etc. So it seems like it gets complicated, wondering if there are already some systems established that have figured out how to do this. Hence the above questions.
For an example, say I wanted to encode the above binary string into an 8-bit alphabet, so like ASCII. Then I might have aBc?D4*&((!.... But then to deserialize this into the bits is one part, and to serialize the bits into this is another (these characters aren't the actual characters mapped to the above bit example).
But then what if you wanted to have each chunk no longer than 10 hex characters, but you have a long string of 1000 0's, then you need some other trick perhaps like having the encoded hex value tell you how many preceding zeroes you have, etc. So it seems like it gets complicated, wondering if there are already some systems established that have figured out how to do this
Yes you're way over-complicating it. To start simple, consider bit strings whose length is by definition a multiple of 4. They can be represented in hexadecimal by just grouping the bits up by 4 and remapping that to hexadecimal digits:
raw: 11011110101011011011111011101111
group: 1101 1110 1010 1101 1011 1110 1110 1111
remap: D E A D B E E F
So 11011110101011011011111011101111 -> DEADBEEF. That all the nibbles had their top bit set was a coincidence resulting from choosing an example that way. By definition the input is divided up into groups of four, and every hexadecimal digit is later decoded to a group of four bits, including leading zeroes if applicable. This is all that you need for typical hash codes which have a multiple of 4 bits.
The problems start when we want encode bit strings that are of variable length and not necessarily a multiple of 4 long, then there will have to be some padding somewhere, and the decoder needs to know how much padding there was (and where, but the location is a convention that you choose). This is why your example seemed so ambiguous: it is. Extra information needs to be added to tell the decoder how many bits to discard.
For example, leaving aside the mechanism that transmits the number of padding bits, we could encode 1010101 as A5 or AA or 5A (and more!) depending on the location we choose for the padding, whichever convention we choose the decoder needs to know that there is 1 bit of padding. To put that back in terms of bits, 1010101 could be encoded as any of these:
x101 0101
101x 0101
1010 x101
1010 101x
Where x marks the bit which is inserted in the encoder and discarded in the decoder. The value of that bit doesn't actually matter because it is discarded, so DA is also a fine encoding and so on.
All of the choices of where to put the padding still enable the bit string to be encoded incrementally, without storing the whole bit string in memory, though putting the padding in the first hexadecimal digit requires knowing the length of the bit string up front.
If you are asking this in the context of Huffman coding, you wouldn't want to calculate the length of the bit string in advance so the padding has to go at the end. Often an extra symbol is added to the alphabet that signals the end of the stream, which usually makes it unnecessary to explicitly store how much padding bits there are (there might be any number of them, but as they appear after the STOP symbol, the decoder automatically disregards them).

File in Base64 string occupies more space than original file

I'm in this kind of... Problem... I'm adding to my program the resources by encoding in base64 string the files (images, videos and audio) and adding them to a String. What I do is to read the file and then, convert the bytes to a Base64 string and write it to a txt file, but the txt file occupies slightly MORE space than the original file. Also this happens when I add the string to my program code. The compiled executable occupies a lot of space. Ex:
An MP3 file occupies 2.3 MB
The Base64 string in a txt file occupies 3.19 MB
Any solution or way to optimize the space of base64 string?
P.D. This is just something I'm trying to do for fun. Do not comment below "WHY" or the reason "FOR WHAT" I want this. The answer is: just for fun.
That's inherent to Base64.
Base64 uses 4 octets to encode 3 octets, because it's a reasonably efficient way of encoding arbitrary binary data using just those bytes that mean something printable in ASCII and also avoid many characters that are special in many contexts. It's more compact than say hexadecimal strings (2 octets to encode each octet), but always larger than raw binary. It's value is only in contexts where raw binary won't work, so the extra size is worth it.
(Strictly it's 4 characters to encode 3 octets, so if that was then encoded in UTF-16 or UTF-32 it could be 8 or 16 octets per 3 encoded).

Encode string to another base with more characters?

I know that I can encode numbers to a base like 65 to decrease the size of the character display (even if the number is smaller in binary).
However, is there a way to encode UTF-8 text to another base with more characters than our standard 26 letter English alphabet? In other words, Instead of requiring 4 "characters" for the word "four" - I can create a representation or hash using only, maybe 2 (i.e. "6$")?
I believe the point of Base64 is you can easily convert any binary data into "human readable" letters and numbers. It makes it easy to transcribe arbitrary data to newsgroups or transmit them over text based protocols.
If you want to further "compress" this data, you need to figure out how many characters you want to allow. There's only so many combinations of 8 bits. The most efficient would be to use all of them, in which case why just not use gzip?
Your question seems related to Order-0 entropy coding :
http://en.wikipedia.org/wiki/Entropy_encoding
The most famous algorithm is this family is Huffman coding :
http://en.wikipedia.org/wiki/Huffman_coding
Huffman will not only tells you that only 64 characters are used and therefore only 6 bits per characters are necessary : it will also make a difference between frequent characters, such as (space), and rare ones, such as (;). It will then create a code in which frequent characters use less bits than rarer ones, resulting in better compression (typically 4.5bits per character on English texts).
Huffman coding is an all-around compression technique, used as part of many compression algorithms, including zip.
You can find a demo program which only applies one pass of Huffman compression here (Huff0), it will help you determine how much can be gained by using this technique for your sample inputs :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html

Does a strings length equal the byte size?

Exactly that: Does a strings length equal the byte size? Does it matter on the language?
I think it is, but I just want to make sure.
Additional Info: I'm just wondering in general. My specific situation was PHP with MySQL.
As the answer is no, that's all I need know.
Nope. A zero terminated string has one extra byte. A pascal string (the Delphi shortstring) has an extra byte for the length. And unicode strings has more than one byte per character.
By unicode it depends on the encoding. It could be 2 or 4 bytes per character or even a mix of 1,2 and 4 bytes.
It entirely depends on the platform and representation.
For example, in .NET a string takes two bytes in memory per UTF-16 code point. However, surrogate pairs require two UTF-16 values for a full Unicode character in the range U+100000 to U+10FFFF. The in-memory form also has an overhead for the length of the string and possibly some padding, as well as the normal object overhead of a type pointer etc.
Now, when you write a string out to disk (or the network, etc) from .NET, you specify the encoding (with most classes defaulting to UTF-8). At that point, the size depends very much on the encoding. ASCII always takes a single byte per character, but is very limited (no accents etc); UTF-8 gives the full Unicode range with a variable encoding (all ASCII characters are represented in a single byte, but others take up more). UTF-32 always uses exactly 4 bytes for any Unicode character - the list goes on.
As you can see, it's not a simple topic. To work out how much space a string is going to take up you'll need to specify exactly what the situation is - whether it's an object in memory on some platform (and if so, which platform - potentially even down to the implementation and operating system settings), or whether it's a raw encoded form such as a text file, and if so using which encoding.
It depends on what you mean by "length". If you mean "number of characters" then, no, many languages/encoding methods use more than one byte per character.
Not always, it depends on the encoding.
There's no single answer; it depends on language and implementation (remember that some languages have multiple implementations!)
Zero-terminated ASCII strings occupy at least one more byte than the "content" of the string. (More may be allocated, depending on how the string was created.)
Non-zero-terminated strings use a descriptor (or similar structure) to record length, which takes extra memory somewhere.
Unicode strings (in various languages) use two bytes per char.
Strings in an object store may be referenced via handles, which adds a layer of indirection (and more data) in order to simplify memory management.
You are correct. If you encode as ASCII, there is one byte per character. Otherwise, it is one or more bytes per character.
In particular, it is important to know how this effects substring operations. If you don't have one byte per character, does s[n] get the nth byte or nth char? Getting the nth char will be inefficient for large n instead of constant, as it is with a one byte per character.

Resources