How to calculate integer based on byte sequence - python-3.x

So, I'm trying to understand the math involved when trying to translate hexadecimal escape sequences into integers.
So if I have the string "Ã", when I do "Ã".encode('utf-8') I get a byte string like this "\xc3". ord("Ã") is 195. The math is 16*12+3 which is 195. Things makes sense.
But if I have the character "é" - then the utf8-encoded hex escape sequence is "\xc3\xa9 - and ord("é") is 233. How is this calculation performed? (a9 on its own is 169 so it's clearly not addition).
Similarly with this 'Ĭ'.encode('utf-8'). This yields b'\xc4\xac'. And ord('Ĭ') is 300.
Can anyone explain the math involved here?

UTF-8 was designed according to a few general design principles / constraints. It is important to understand these design principles in order to understand why the encoding algorithm of UTF-8 is what it is.
Backwards-compatibility with ASCII: every ASCII character should have the same encoding in ASCII and UTF-8.
Detectability of non-ASCII characters: no octet that would be a valid encoding of an ASCII character should appear in the multi-octet encoding sequence of a non-ASCII character.
Length encoding: the length of a multi-octet encoding sequence should be encoded in the first octet, so that we know before reading the entire multi-octet encoding sequence how long it will be. Also, it is easy for a human to determine the length of the multi-octet encoding sequence.
Fallback / auto detection: text that is in one of the popular 8-bit encodings (e.g. ISO8859-15, Windows-1252) is highly unlikely to contain sequences that are valid UTF-8 multi-octet encoding sequences, therefore such encodings can be easily detected and vice versa.
Self-synchronizing: you can start decoding anywhere in the middle of a UTF-8 stream, and it will take you at most until the next ASCII character or start of the next multi-octet encoding sequence to be able to start decoding valid characters. If you can navigate backwards in the stream, it will take backing up at most 3 octets to find a valid start point.
Sorting order: sorting UTF-8 streams by octets will automatically yield a sort order by codepoints without having to decode the stream.
The way UTF-8 encoding works is like this:
Any ASCII character is encoded the same way as in ASCII, as a single octet starting with a 0 bit.
Any non-ASCII character is encoded as a multi-octet sequence.
The first octet of the multi-octet encoding sequence starts with the bit pattern 110, 1110, or 11110, where the number of 1 bits denotes the length of the multi-octet sequence, i.e. a multi-octet sequence starting with the octet 1110xxxx is 3 octets long.
Any further octet that is part of a multi-octet sequence starts with the bit pattern 10.
The Unicode code point is encoded into the non-fixed bits of the multi-octet encoding sequence.
Here is an example: A has the Unicode code point U+0041. Since it is an ASCII character, it will simply be encoded the same way as in ASCII, i.e. as binary 01000001.
The Euro sign € has the Unicode code point U+20AC. Since it is not an ASCII character, it needs to be encoded as a multi-octet encoding sequence. Hexadecimal 0x20AC in binary is 10000010101100, so it requires 14 bits to represent.
A two-octet sequence looks like this: 110xxxxx 10xxxxxx, so it gives us only 11 bits. Therefore, we need a three-octet sequence, which looks like this: 1110xxxx 10xxxxxx 10xxxxxx. This gives us 16 bits, which is more than we need. The zero-extended binary representation of the code point now simply gets packed into the xes:
11100010 10000010 10101100
^^^^00xx ^^xxxxxx ^^xxxxxx
The hexadecimal representation of this bitstring is 0xE2 0x82 0xAC.
Note: it would be possible to encode this also as a four-octet sequence, by zero-extending the code point even further. This is called an overlong encoding and is not allowed by the UTF-8 specification. Encodings must be as short as possible.
There is an encoding called Modified UTF-8 which encodes ASCII NUL not as ASCII but as an overlong multi-octet sequence. That way, a MUTF-8 string can contain ASCII NUL characters without ever containing a 0x00 null octet and can thus be processed by environments which expect strings to be null-terminated.

From the doc:
ord(c)
Given a string representing one Unicode character, return an
integer representing the Unicode code point of that character. For
example, ord('a') returns the integer 97 and ord('€') (Euro sign)
returns 8364. This is the inverse of chr().
What ord returns is the Unicode code point of the character - roughly, a number letting you identify the character among the large number of characters known in Unicode.
When you encode your character with UTF-8, your represent it by a sequence of bytes, which is not directly related to the Unicode code point. There can be some coincidences, mainly for ASCII characters that get represented with a sequence of one byte, but this will fail for all more 'exotic' characters.
Have a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and the wikipedia page about UTF-8.

The ASCII-encoding of "é" is 0xe9, which is equal to 233 in decimal base.
Sample code for your convenience:
for n in range(256):
print(n,hex(n),chr(n))

So, I thought I'd just wrap this one up and post the answers to the math issues I didn't comprehend before receiving a tons of wisdom from SO.
The first question regarded "é" which yields "\xc3\xa9" when encoded with utf8 and where ord("é") returns 233. Clearly 233 was not the sum of 195 (decimal representation of c3) and 169 (ditto for a9). So what's going on?
"é" is has the corresponding unicode point U+00E9. The decimal value for the hex e9 is 233. So that's what the ord("é") is all about.
So how does this end up as "\xc3\xa9"?
As Jörg W Mittag explained and demonstrated, in utf8 all non-ASCII are "encoded as a multi-octet sequence".
The binary representation of 233 is 11101001. As this is non-ASCII this needs to be packed in a two-octet sequence which according to Jörg will follow this pattern:
110xxxxx 10xxxxxx (110 and 10 are fixed leaving room for five bits in the first octet, and six bits in the second - 11 in total).
So the 8 bits binary representation of 233 is fitted into this pattern replacing the xx-parts... Since there are 11 bits available and we only need 8 bits we pad the 8 bits with 3 more, 000, (i.e. 00011101001).
^^^00011 ^^101001 (000 followed by our 8 bits representation of 233)
11000011 10101001 (binary representation of 233 inserted in a two-octet sequence)
11000011 equals the hex c3, as 10101001 equals a9- which in other words matches the original sequence "\xc3\xa9"
A similar walkthrough for the character "Ĭ":
'Ĭ'.encode('utf-8') yields b'\xc4\xac'. And ord('Ĭ') is 300.
So again the unicode point for this character is U+012C which has the decimal value of 300 ((1*16*16)+(2*16*1)+(12*1)) - so that's the ord-part.
Again the binary representation of 300 is 9 bits, 100101100. So once more there's a need for a two-octet sequence of the pattern 110xxxxx 10xxxxxx. And again we pad it with a couple of 0 so reach 11 bits (00100101100).
^^^00100 ^^101100 (00 followed by our 9 bits representation of 300)
11000100 10101100 (binary representation of 300 inserted in a two octet-sequence).
11000100 corresponds to c4in hex, 10101100 to ac - in other words b'\xc4\xac'.
Thank you everyone for helping out on this. I learned a lot.

Related

base64. Ending is not basically

I thought, I know the base64 encoding, but I often see so encoded text: iVBORw0KGgoAAAANSUhEUgAAAXwAAAG4C ... YPQr/w8B0CBr+DAkGQAAAABJRU5ErkJggg==. I got mean, It overs by double =. Why does it push second void byte, if 8 bits enough satisfy empty bits of encoded text?
I found. Length of base64-encoded data should be divisible by 4. The remainder of division length by 4 is filled by =-chars. That's because conflict of bit rate. In modern systems it's used 8-bits bytes, but in base64 is used 6-bits chars. So the lowest common multiple is 24 bits or 3 bytes or 4 base64-chars. A lack is filled by =.

Method `crypto.randomBytes()` in Node.js generates more bytes than specify in parameter `size`?

I have Node.js v6.3.1. Why the next code generates two characters instead of one?
crypto.randomBytes( 1 ).toString('hex')
Is one byte encodes two characters? Is it possible bug? (docs for randomBytes())
One byte is expressed in hexadecimal encoding as two characters, each in the range of 0-9a-f (upper or lower case). Each characters represents 4-bits.
Hexadecimal (hex) is generally used to represent binary data, some or most 8-bit values, depending on the character encoding, can not be represented as printable characters. As an example the byte with the bits 00000111 is represented as 07 in hex, it is the bell character so it can't print.
See Hexadecimal.

What is bits in computer science

I watched a video on youtube about the bits. After watching it I have a confusion what the actual size does a number, string or any character takes. What I have understood from the video is.
1= 00000001 // 1 bit
3= 00000011 // 2 bits
511 = 111111111 // 9bits
4294967295= 11111111111111111111111111111111 //32 bit
1.5 = ? // ?
I just want to know above given statement expect in decimal point are correct ? or all numeric , string or any character take 8 byte. I am using 64 bit operating system.
And what is the binary code of decimal value
If I understand correctly, you're asking how many bits/bytes are used to represent a given number or character. I'll try to cover the common cases:
Integer (whole number) values
Since most systems use 8-bits per byte, integer numbers are usually represented as a multiple of 8 bits:
8 bits (1 byte) is typical for the C char datatype.
16 bits (2 bytes) is typical for int or short values.
32 bits (4 bytes) is typical for int or long values.
Each successive bit is used to represent a value twice the size of the previous one, so the first bit represents one, the second bit represents two, the third represents four, and so on. If a bit is set to 1, the value it represents is added to the "total" value of the number as a whole, so the 4-bit value 1101 (8, 4, 2, 1) is 8+4+1 = 13.
Note that the zeroes are still counted as bits, even for numbers such as 3, because they're still necessary to represent the number. For example:
00000011 represents a decimal value of 3 as an 8-bit binary number.
00000111 represents a decimal value of 7 as an 8-bit binary number.
The zero in the first number is used to distinguish it from the second, even if it's not "set" as 1.
An "unsigned" 8-bit variable can represent 2^8 (256) values, in the range 0 to 255 inclusive. "Signed" values (i.e. numbers which can be negative) are often described as using a single bit to indicate whether the value is positive (0) or negative (1), which would give an effective range of 2^7 (-127 to +127) either way, but since there's not much point in having two different ways to represent zero (-0 and +0), two's complement is commonly used to allow a slightly greater storage range: -128 to +127.
Decimal (fractional) values
Numbers such as 1.5 are usually represented as IEEE floating point values. A 32-bit IEEE floating point value uses 32 bits like a typical int value, but will use those bits differently. I'd suggest reading the Wikipedia article if you're interested in the technical details of how it works - I hope that you like mathematics.
Alternatively, non-integer numbers may be represented using a fixed point format; this was a fairly common occurrence in the early days of DOS gaming, before FPUs became a standard feature of desktop machines, and fixed point arithmetic is still used today in some situations, such as embedded systems.
Text
Simple ASCII or Latin-1 text is usually represented as a series of 8-bit bytes - in other words it's a series of integers, with each numeric value representing a single character code. For example, an 8-bit value of 00100000 (32) represents the ASCII space () character.
Alternative 8-bit encodings (such as JIS X 0201) map those 2^8 number values to different visible characters, whilst yet other encodings may use 16-bit or 32-bit values for each character instead.
Unicode character sets (such a the 8-bit UTF-8 or 16-bit UTF-16) are more complicated; a single UTF-16 character might be represented as a single 16-bit value or a pair of 16-bit values, whilst UTF-8 characters can be anywhere from one 8-bit byte to four 8-bit bytes!
Endian-ness
You should also be aware that values spanning more than a single 8-bit byte are typically byte-ordered in one of two ways: little endian, or big endian.
Little Endian: A 16-bit value of 512 would be represented as 11111111 00000001 (i.e. smallest-value bits come first).
Big Endian: A 16-bit value of 512 would be represented as 00000001 11111111 (i.e. largest-value bits come first).
You may also hear of mixed-endian, middle-endian, or bi-endian representations - see the Wikipedia article for further information.
There is difference between a bit, a byte.
1 bit is a single digit 0 or 1 in base 2.
1 byte = 8 bits.
Yes the statements you gave are correct.
Binary code of 1.5 will be 001.100. However this is how we interpret binary. The way computer stores numbers is different and is based on compiler, platform. For example C uses IEEE 754 format. Google about it to learn more.
Your OS is 64 bit means your CPU architecture is 64 bit.
a bit is one binary digit e.g 0 or 1.
a byte is eight bits, or two hex digits
A Nibble is half a byte or 1 hex digit
Words get a bit more complex. Originally it was the number of bytes required to cover the range of addresses available in memory. As we had a lot of hybrids and memory schemes. Word is usually two-bytes, a double word, 4 bytes etc.
In a computing everything comes down to binary, a combination of 0s and 1s. Characters, decimal numbers etc are representations.
So the character 0 is 7 (or 8 bit ascii) is 00110000 in binary, 30 in hex, and 48 as decimal (base 10) number. It's only '0' if you choose to 'see' it as a single byte character.
Representing numbers with decimal points is even more varied. There are many accepted ways of doing that, but they are conventions not rules.
Have a look for 1 and 2s complement, gray code, BCD, floating point representation and such to get more of an idea.

how Base64 API understands end of String while decoding

Sometimes I am unable to provide the entire String .but with truncated String also Base64 API can decode it .How does base64 understands end of string
How does base64 understands end of string
You haven't said which base64 API you're using, but typically they require that the string you provide is a multiple of 4 character in length. Each 4 characters in a base64 string corresponds to 3 bytes.
If the overall binary data is not a multiple of 3 bytes, the final 4 characters contain padding of the = character to indicate the desired length. See the Padding section in the Wikipedia Base64 article for more details.
In Base64 each character represents one of 64 values: a 6-bit value. But bytes are 8-bit values, so base-64 encoded data must somehow be a multiple of both 6 and 8.
Well, one 6-bit character is obviously not going to fill one byte, but two 6-bit characters (12 bit) fill not exactly two bytes. Three 6-bit characters (18 bit) fills a little bit more than two bytes, but not three. However, four 6-bit characters (24 bit) fills exactly three 8-bit bytes.
So a base64 string must be a multiple of 4 characters in length, as to be able to fill a whole multiple of 8-bit bytes with data. This means you can split base64 encoded data at any multiple of four characters and it will work. But if you split the data at any other position, it will probably not work.
This also holds for the end of the data. For example, if I have only six 8-bit bytes to encode (40 bit) but the base64 string must then be two times 4 6-bit characters in length (48 bit) then I am 8 bits of data short. For those remaining (partial) 6-bit characters the = character to indicate that there follows no particular data.

Compression using Ascii, trying to figure out how many bits to store the following efficiently

I am trying to learn the basics of compression using only ASCII.
If I am sending an email of strings of lower-case letters. If the file has n
characters each stored as an 8-bit extended ASCII code, then we need 8n bits.
But according to Guiding principle of compression: we discard the unimportant information.
So using that we don't need all ASCII codes to code strings of lowercase letters: they use only 26 characters. We can make our own code with only 5-bit codewords (25 = 32 > 26), code the file using this coding scheme and then decode the email once received.
The size has decreased by 8n - 5n = 3n, i.e. a 37.5% reduction.
But what IF the email was formed with lower-case letters (26), upper-case letters, and extra m characters and they have to be stored efficiently?
If you have n symbols of equal probability, then it is possible to code each symbol using log2(n) bits. This is true even if log2(n) is fractional, using arithmetic or range coding. If you limit it to Huffman (fixed number of bits per symbol) coding, you can get close to log2(n), with still a fractional number of bits per symbol on average.
For example, you can encode ten symbols (e.g. decimal digits) in very close to 3.322 bits per symbol with arithmetic coding. With Huffman coding, you can code six of the symbols with three bits and four of the symbols with four bits, for an average of 3.4 bits per symbol.
The use of shift-up and shift-down operations can be beneficial since in English text you expect to have strings of lower case characters with occasional upper case characters. Now you are getting into both higher order models and unequal frequency distributions.

Resources