Is base64 encoded file smaller than straight hexdump? - base64

I was wondering if base64 provides any compression compared to straight hex-dump - that means turning every byte into two characters from range [a-f0-9].

Yes it does, because base64 has more characters to work with—64 instead of the 16 of hexdump. This is one of the purposes of base64.
The Wikipedia article shows you the gain: If the binary data is n bytes, the base64 data is 4*ceil(n/3) bytes. (Compared to 2*n bytes for the hexdump.)
So, instead of a 100% overhead, you get roughly a 33% overhead.

Related

Reduce length of decimal variable (algorithm)

I have a string of decimal digits like:
965854242113548732659745896523654789653244879653245794444524
length : 60 character
I want to send it to a function, but first I want reduce the length of it as much as possible. How can I do that?
I think about convert it to base-34, that will be 1RG7EEWTN7NW60EWIWMASEWWMEOSWC2SS8482WQE. That is 40 characters in length. Can I reduce it more some way?
Your number fits into 70 bits - for such a small payload compression seems nonsensical. Assuming that the server API supports arbitrary binary data, I would simply encode the value in binary and prefix it with the number of bytes needed.
1 byte length information - for 854657986453156789675, the example you gave initially, this would be 9
9 bytes of binary payload
→ 10 bytes of data transferred for your example.
Your example in hex:
09 2e 54 c3 1e 81 cf 05 fd ab
With the length given in bytes, this of course supports only decimals up to 255 bytes length, but I suppose this is sufficient. If your transport protocol has a built in concept of length of a packet, you could even skip the initial length byte.
Important: ensure that all sides use the same endianness. As you are transmitting your data over the network, network byte order (big endian) would be natural.
If you want to transmit very large numbers, keep in mind that you can use any compression algorithm you like on the binary representation of your data. However, your payload must be significantly larger in order to make compression feasible - for example, using zLib compression for the above 9 byte payload results in an 18 byte payload due to the overhead for the zLib datastructures.
If (and only if) you cannot use arbitrary bytes for your payload, you can encode your data (possibly after compression). Most modern libraries have built in support for Base64, so this would be a natual way of representing the data.

How many characters should I reserve in my database for storing SHA512 hash?

Column: pwdhash
Type: char
Many web pages give me the bit size, not a character size.
Should I use a binary field instead?
64 bytes when stored in BLOB.
128 characters when stored as hex.
~88 characters when stored as Base64.
Yes, you should use a binary field.
A hash is not a string.
For readability reasons, these are sometimes stored as hexadecimal digits (two digits per byte). So 512 bits requires 64 bytes, which would need a char(128) field. If you used a binary field, it would only need 64 bytes.

How many bytes of memory is a tweet?

140 characters. How much memory would it take up ?
I'm trying to calculate how many tweets my EC2 Large instance Mongo DB can hold.
Twitter uses UTF-8 encoded messages.
UTF-8 code points can be up to six four octets long, making the maximum message size 140 x 4 = 560 8-bit bytes.
This is, of course, just for the raw messages, excluding storage overhead, indexing and other storage-related padding.
e: Twitter successfully let me post the message:
™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™
Yes, that's 140 trademark symbols, which are three octets each in UTF-8
Back in September, an engineer at Twitter gave a presentation that suggested it's about 200 bytes per tweet.
Of course you still have to account for overhead for your own metadata and the database itself, but 200 bytes/record is probably a good place to start.
Typically it's two bytes per character if you're storing Unicode as UTF-8, so that would mean 280 bytes max per tweet.
Probably 284 bytes in memory ( 4 byte length prefix + length*2). Inside the DB I cannot say but probably 280 if the DB is UTF-8, you could add some bytes of overhead, for metadata etc.
Potentially of interest:
http://mehack.com/map-of-a-twitter-status-object
Anatomy of a Twitter Status Object
Also more about twitter character encoding:
http://dev.twitter.com/pages/counting_characters
It's technically stored as UTF-8, and in reality, the slide deck from a tweeter guy here http://www.slideshare.net/raffikrikorian/twitter-by-the-numbers gives the real stat about it:
140 characters, ~200 bytes

Efficient binary-to-string formatting (like base64, but for UTF8/UTF16)?

I have many bunches of binary data, ranging from 16 to 4096 bytes, which need to be stored to a database and which should be easily comparable as a unit (e.g. two bunches of data batch only if the lengths match and all bytes match). Strings are nice for that, but converting binary data blindly to a string is apt to cause problems due to character encoding/reinterpretation issues.
Base64 was a common method for storing strings in an era when 7-bit ASCII was the norm; its 33% space penalty was a little annoying, but not horrible. Unfortunately, if one is using UTF-16, the space penalty is 166% (8 bytes to store 3) which seems pretty icky.
Is there any common storage method for storing binary data in a valid Unicode string which will allow better efficiency in UTF-16 (and hopefully not be too horrible in UTF-8)? A base-32768 coding would store 240 bits in sixteen characters, which would take 32 bytes of UTF-16 or 48 bytes of UTF-8. By comparison, base64 coding would use 40 characters, which would take 80 bytes of UTF-16 or 40 bytes of UTF-8. An approach which was designed to take the same space in UTF-8 or UTF-16 might store 48 bits in three characters that would take eight bytes in either UTF-8 or UTF-16, thus storing 240 bits in 40 bytes of either UTF-8 or UTF-16.
Are there any standards for anything like that?
Base32768 does exactly what you wanted. Sorry it took five years to exist.
Usage (this is JavaScript, although porting the base32768 module to another programming language is eminently practical):
var base32768 = require("base32768");
var buf = new Buffer("d41d8cd98f00b204e9800998ecf842", "hex"); // 15 bytes
var str = base32768.encode(buf);
console.log(str); // "迎裶垠⢀䳬Ɇ垙鸂", 8 code points
var buf2 = base32768.decode(str);
console.log(buf.equals(buf2)); // true
Base32768 selects 32,768 characters from the Basic Multilingual Plane. Each character takes 2 bytes when represented as UTF-16 or 3 bytes when represented as UTF-8, giving exactly the efficiency characteristics you describe: 240 bits can be stored in 16 characters i.e. 32 bytes of UTF-16 or 48 bytes of UTF-8. (Except for the occasional padding character, analogous to the = padding seen in Base64.)
This is done by dicing the input bytes (i.e. 8-bit unsigned numbers) into 15-bit unsigned numbers and assigning each resulting 15-bit number to one of the 32,768 characters.
Note that the characters chosen are also "safe" - no whitespace, control characters, combining diacritics or susceptibility to normalization corruption.

How do I pre-determine the length of the resultant cipher text produced in an encryption operation?

I have an application which stores some information in an encrypted state, both on file and in a database. How can I calculate what the length of the resultant cipher text will be based on the plain text input?
The encryption operation consists of using the .NET RijndaelManaged class/algorithm and then a conversion to a Base64 string prior to storage.
What I want to be able to do is to know beforehand how long the encrypted string will be for a given input so that I can limit the length of the input accordingly in relation to the storage space available for its encrypted form (if that makes sense!).
Thanks
Rijndael's output is the same size as the input, rounded up to the next closest multiple of the block size (usually 128 bits, aka 16 bytes). Base64 expands its input to its output by 4/3 -- it takes 4 bytes of output to represent each 3 bytes of input.
So if you have for example an input of 70 bytes, the encrypting step will produce 80 bytes of output (closest multiple of 16 that's > 70), Base64 will turn that into 108 (81/3 times 4).
The encrypted text will be the first cipher block size multiple bigger than you text. You check your Algorithm BlockSize property. Pure Base64 encoding increases the output by a third, but this can vary if you also need to URL escape (percent encode) certain Base64 symbols (like '+' and '/').

Resources