A random string should be incompressible.
pi = "31415..."
pi.size # => 10000
XZ.compress(pi).size # => 4540
A random hex string also gets significantly compressed. A random byte string, however, does not get compressed.
The string of pi only contains the bytes 48 through 57. With a prefix code on the integers, this string can be heavily compressed. Essentially, I'm wasting space by representing my 9 different characters in bytes (or 16, in the case of the hex string). Is this what's going on?
Can someone explain to me what the underlying method is, or point me to some sources?
It's a matter of information density. Compression is about removing redundant information.
In the string "314159", each character occupies 8 bits, and can therefore have any of 28 or 256 distinct values, but only 10 of those values are actually used. Even a painfully naive compression scheme could represent the same information using 4 bits per digit; this is known as Binary Coded Decimal. More sophisticated compression schemes can do better than that (a decimal digit is effectively log210, or about 3.32, bits), but at the expense of storing some extra information that allows for decompression.
In a random hexadecimal string, each 8-bit character has 4 meaningful bits, so compression by nearly 50% should be possible. The longer the string, the closer you can get to 50%. If you know in advance that the string contains only hexadecimal digits, you can compress it by exactly 50%, but of course that loses the ability to compress anything else.
In a random byte string, there is no opportunity for compression; you need the entire 8 bits per character to represent each value. If it's truly random, attempting to compress it will probably expand it slightly, since some additional information is needed to indicate that the output is compressed data.
Explaining the details of how compression works is beyond both the scope of this answer and my expertise.
In addition to Keith Thompson's excellent answer, there's another point that's relevant to LZMA (which is the compression algorithm that the XZ format uses). The number pi does not consist of a single repeating string of digits, but neither is it completely random. It does contain substrings of digits which are repeated within the larger sequence. LZMA can detect these and store only a single copy of the repeated substring, reducing the size of the compressed data.
Related
I thought, I know the base64 encoding, but I often see so encoded text: iVBORw0KGgoAAAANSUhEUgAAAXwAAAG4C ... YPQr/w8B0CBr+DAkGQAAAABJRU5ErkJggg==. I got mean, It overs by double =. Why does it push second void byte, if 8 bits enough satisfy empty bits of encoded text?
I found. Length of base64-encoded data should be divisible by 4. The remainder of division length by 4 is filled by =-chars. That's because conflict of bit rate. In modern systems it's used 8-bits bytes, but in base64 is used 6-bits chars. So the lowest common multiple is 24 bits or 3 bytes or 4 base64-chars. A lack is filled by =.
This is probably a very silly question, but after searching for a while, I couldn't find a straight answer.
If a source filter (such as the LAV Audio codec) is processing a 24-bit integral audio stream, how are individual audio samples delivered to the graph?
(for simplicity lets consider a monophonic stream)
Are they stored individually on a 32-bit integer with the most-significant bits unused, or are they stored in a packed form, with the least significant bits of the next sample occupying the spare, most-significant bits of the current sample?
The format is similar to 16-bit PCM: the values are signed integers, little endian.
With 24-bit audio you normally define the format with the help of WAVEFORMATEXTENSIBLE structure, as opposed to WAVEFORMATEX (well, the latter is also possible in terms of being accepted by certain filters, but in general you are expected to use the former).
The structure has two values: number of bits per sample and number of valid bits per sample. So it's possible to have the 24-bit data represented as 24-bit values, and also as 24-bit meaningful bits of 32-bit values. The payload data should match the format.
There is no mix of bits of different samples within a byte:
However, wBitsPerSample is the container size and must be a multiple of 8, whereas wValidBitsPerSample can be any value not exceeding the container size. For example, if the format uses 20-bit samples, wBitsPerSample must be at least 24, but wValidBitsPerSample is 20.
To my best knowledge it's typical to have just 24-bit values, that is three bytes per PCM sample.
Non-PCM formats might define different packing and use "unused" bits more efficiently, so that, for example, to samples of 20-bit audio consume 5 bytes.
I watched a video on youtube about the bits. After watching it I have a confusion what the actual size does a number, string or any character takes. What I have understood from the video is.
1= 00000001 // 1 bit
3= 00000011 // 2 bits
511 = 111111111 // 9bits
4294967295= 11111111111111111111111111111111 //32 bit
1.5 = ? // ?
I just want to know above given statement expect in decimal point are correct ? or all numeric , string or any character take 8 byte. I am using 64 bit operating system.
And what is the binary code of decimal value
If I understand correctly, you're asking how many bits/bytes are used to represent a given number or character. I'll try to cover the common cases:
Integer (whole number) values
Since most systems use 8-bits per byte, integer numbers are usually represented as a multiple of 8 bits:
8 bits (1 byte) is typical for the C char datatype.
16 bits (2 bytes) is typical for int or short values.
32 bits (4 bytes) is typical for int or long values.
Each successive bit is used to represent a value twice the size of the previous one, so the first bit represents one, the second bit represents two, the third represents four, and so on. If a bit is set to 1, the value it represents is added to the "total" value of the number as a whole, so the 4-bit value 1101 (8, 4, 2, 1) is 8+4+1 = 13.
Note that the zeroes are still counted as bits, even for numbers such as 3, because they're still necessary to represent the number. For example:
00000011 represents a decimal value of 3 as an 8-bit binary number.
00000111 represents a decimal value of 7 as an 8-bit binary number.
The zero in the first number is used to distinguish it from the second, even if it's not "set" as 1.
An "unsigned" 8-bit variable can represent 2^8 (256) values, in the range 0 to 255 inclusive. "Signed" values (i.e. numbers which can be negative) are often described as using a single bit to indicate whether the value is positive (0) or negative (1), which would give an effective range of 2^7 (-127 to +127) either way, but since there's not much point in having two different ways to represent zero (-0 and +0), two's complement is commonly used to allow a slightly greater storage range: -128 to +127.
Decimal (fractional) values
Numbers such as 1.5 are usually represented as IEEE floating point values. A 32-bit IEEE floating point value uses 32 bits like a typical int value, but will use those bits differently. I'd suggest reading the Wikipedia article if you're interested in the technical details of how it works - I hope that you like mathematics.
Alternatively, non-integer numbers may be represented using a fixed point format; this was a fairly common occurrence in the early days of DOS gaming, before FPUs became a standard feature of desktop machines, and fixed point arithmetic is still used today in some situations, such as embedded systems.
Text
Simple ASCII or Latin-1 text is usually represented as a series of 8-bit bytes - in other words it's a series of integers, with each numeric value representing a single character code. For example, an 8-bit value of 00100000 (32) represents the ASCII space () character.
Alternative 8-bit encodings (such as JIS X 0201) map those 2^8 number values to different visible characters, whilst yet other encodings may use 16-bit or 32-bit values for each character instead.
Unicode character sets (such a the 8-bit UTF-8 or 16-bit UTF-16) are more complicated; a single UTF-16 character might be represented as a single 16-bit value or a pair of 16-bit values, whilst UTF-8 characters can be anywhere from one 8-bit byte to four 8-bit bytes!
Endian-ness
You should also be aware that values spanning more than a single 8-bit byte are typically byte-ordered in one of two ways: little endian, or big endian.
Little Endian: A 16-bit value of 512 would be represented as 11111111 00000001 (i.e. smallest-value bits come first).
Big Endian: A 16-bit value of 512 would be represented as 00000001 11111111 (i.e. largest-value bits come first).
You may also hear of mixed-endian, middle-endian, or bi-endian representations - see the Wikipedia article for further information.
There is difference between a bit, a byte.
1 bit is a single digit 0 or 1 in base 2.
1 byte = 8 bits.
Yes the statements you gave are correct.
Binary code of 1.5 will be 001.100. However this is how we interpret binary. The way computer stores numbers is different and is based on compiler, platform. For example C uses IEEE 754 format. Google about it to learn more.
Your OS is 64 bit means your CPU architecture is 64 bit.
a bit is one binary digit e.g 0 or 1.
a byte is eight bits, or two hex digits
A Nibble is half a byte or 1 hex digit
Words get a bit more complex. Originally it was the number of bytes required to cover the range of addresses available in memory. As we had a lot of hybrids and memory schemes. Word is usually two-bytes, a double word, 4 bytes etc.
In a computing everything comes down to binary, a combination of 0s and 1s. Characters, decimal numbers etc are representations.
So the character 0 is 7 (or 8 bit ascii) is 00110000 in binary, 30 in hex, and 48 as decimal (base 10) number. It's only '0' if you choose to 'see' it as a single byte character.
Representing numbers with decimal points is even more varied. There are many accepted ways of doing that, but they are conventions not rules.
Have a look for 1 and 2s complement, gray code, BCD, floating point representation and such to get more of an idea.
How do we shrink/encode a 20 letter string to 6 letters. I found few algorithms address data compression like RLE, Arithmetic coding, Universal code but none of them guarantees 6 letters.
The original string can contain the characters A-Z (upper case), 0-9 ans a dash.
If your goal is to losslessly compress or hash an random input string of 20 characters (each character could be [A-Z], [0-9] or -) to an output string of 6 characters. It's theoretically impossible.
In information theory, given a discrete random variable X={x|x1,...,xn}, the Shannon entropy H(X) is defined as:
where p(xi) is the probablity of X = xi. In your case, X has 20 of 37 possible characters, so it could be {x|x1,...,xn} where n = 37^20. Supposing the 37 characters have the same probability of being (aka the input string is random), then p(xi) = 1/37^20. So the Shannon entropy of the input is:
. A char in common computer can hold 8 bit, so that 6 chars can hold 48 bit. There's no way to hold 104 bit information by 6 chars. You need at least 15 chars to hold it instead.
If you do allow the loss and have to hash the 20 chars into 6 chars, then your are trying to hash 37^20 values to 128^6 keys. It could be done, but you would got plenty of hash collisions.
In your case, supposing you hash them with the most uniformity (otherwise it would be worse), for each input value, there would be by average of 5.26 other input values sharing the same hash key with it. By a birthday attack, we could expect to find a collision within approximately 200 million trials. It could be done in less than 10 seconds by a common laptop. So I don't think this would be a safe hashing.
However if you insist to do that, you might want to read Hash function algorithms. It lists a lot of algorithms for your choice. Good luck!
I am trying to learn the basics of compression using only ASCII.
If I am sending an email of strings of lower-case letters. If the file has n
characters each stored as an 8-bit extended ASCII code, then we need 8n bits.
But according to Guiding principle of compression: we discard the unimportant information.
So using that we don't need all ASCII codes to code strings of lowercase letters: they use only 26 characters. We can make our own code with only 5-bit codewords (25 = 32 > 26), code the file using this coding scheme and then decode the email once received.
The size has decreased by 8n - 5n = 3n, i.e. a 37.5% reduction.
But what IF the email was formed with lower-case letters (26), upper-case letters, and extra m characters and they have to be stored efficiently?
If you have n symbols of equal probability, then it is possible to code each symbol using log2(n) bits. This is true even if log2(n) is fractional, using arithmetic or range coding. If you limit it to Huffman (fixed number of bits per symbol) coding, you can get close to log2(n), with still a fractional number of bits per symbol on average.
For example, you can encode ten symbols (e.g. decimal digits) in very close to 3.322 bits per symbol with arithmetic coding. With Huffman coding, you can code six of the symbols with three bits and four of the symbols with four bits, for an average of 3.4 bits per symbol.
The use of shift-up and shift-down operations can be beneficial since in English text you expect to have strings of lower case characters with occasional upper case characters. Now you are getting into both higher order models and unequal frequency distributions.