Printing strings and characters as hexadecimal in Go - string

Why cyrillic strings in hexadecimal format differ from cyrillic chars in hexadecimal format?
str := "Э"
fmt.Printf("%x\n", str)
//result d0ad
str := 'Э'
fmt.Printf("%x\n", str)
//result 42d

Printing the hexadecimal representation of a string prints the hex representation of its bytes, and printing the hexadecimal representation of a rune prints the hex representation of the number it is an alias to (rune is an alias to int32).
And strings in Go hold the UTF-8 encoded byte sequence of the text. In UTF-8 representation characters (runes) having a numeric code > 127 have multi-byte representation.
The rune Э has multi-byte representation in UTF-8 (being [208, 173]), and it is not the same as the multi-byte representation of the 32-bit integer 1069 = 0x42d. Integers are represented using two's complement in memory.
Recommended blog post: Strings, bytes, runes and characters in Go

Related

How to convert a byte to the correct decimal representation?

I have a function that checks a bit in a specific index of a string (converted to byte representation):
fn check_bit(s: String) -> bool {
let bytes = s.as_bytes(); // converts string to bytes
let byte = s[0]; // pick first byte
// byte here seems to be in decimal representation
byte | 0xf7 == 0xff // therefore this returns incorrect value
}
after printing and doing arithmetic operations on the byte variable, I noticed that byte is a decimal number. For example, for the ASCII value of char b (0x98), byte stores 98 as decimal, which results in an incorrect value when doing bitwise operations on them. How can I convert this decimal value to correct hex, binary or decimal representation? (for 0x98 I am expecting to get decimal value of 152)
noticed that byte is a decimal number
An integer has no intrinsic base, it is just a bit pattern in memory. You can write them out in a string as a decimal, binary, octal, etc., but the value in memory is the same.
In other words, integers are not stored in memory as strings.
For example, for ascii value of char b (0x98), byte simply stores 98 as decimal
ASCII b is not 0x98, it is 98, which is 0x62:
assert_eq!(98, 0x62);
assert_eq!(98, "b".as_bytes()[0]);
assert_eq!(98, 'b' as i32);
How can I convert this decimal value to correct hex, binary or decimal representation?
Such a conversion does not make sense because integers are not stored as strings as explained above.

What are length prefixed strings and what do they look like when encoded in 8-bit binary?

Here is the problem:
Pascal uses length prefixed strings, where the length of a string is encoded in 8-bit binary and stored
before the string. Give the bit string for “BYE!”, encoded in 8-bit ASCII, as it would be encoded in
Pascal.
I understand how the string "BYE!" would be encoded in 8-bit ASCII, but I don't understand how this is supposed to look with the length of the string encoded and stored before the string. I also know how to find the decimal equivalent values for each of the characters in the string, but I'm not sure if that is necessary to answer the question.
The string "BYE!" encoded in ASCII is: 'B' = 01000010, 'Y' = 01011001, 'E' = 01000101, '!' = 00100001.
The decimal equivalent for the string "BYE!" is: 'B' = 66, 'Y' = 89, 'E' = 69, '!' = 33.
The length of the string is 4 characters.
8 bit binary states that the number 4 is represented as 00000100
Therefore in pascal it should = 00000100 01000010 01011001 01000101 00100001
8-bit binary for the length of the string is different to 8-bit ascii it wants for the actual string.

How to get single backslash instead of double backslash with encode("unicode-escape")?

Get unicode point of character Ä.
Python3 version.
>>> str="Ä"
>>> str.encode("unicode-escape")
b'\\xc4'
How to get the single backslash format b'\xc4' instead of b'\\xc4' as my output ?
It's not entirely clear to me what you want, so I'll give you a few options.
Get the (Unicode) code point of a character as an integer:
>>> ord('Ä')
196
Display the integer in hex notation:
>>> hex(ord('Ä'))
'0xc4'
or with string formatting:
>>> '{:X}'.format(ord('Ä'))
'C4'
However, you talk about backslashes and show the bytestring b'\xc4'.
This is the Latin-1 encoding of 'Ä' (all characters with a Unicode codepoint below 256 can be encoded with Latin-1, and their byte value equals the Unicode codepoint).
>>> 'Ä'.encode('latin-1')
b'\xc4'
This is a bytestring of length 1.
It is displayed in a way in which you could type this character, ie. using an escape sequence with backslash-x and a two-digit hex number.
The "unicode-escape" codec produces these four ASCII characters (\, x, c 4), but not as str, but as a bytes object (because str.encode() returns bytes by definition).
To get a backslash in a str/bytes literal, you need to type two backslashes, so the representation form also uses two backslashes:
>>> 'Ä'.encode('unicode-escape')
b'\\xc4'
The "unicode-escape" codec is very Python-specific and I don't see a lot of applications; maybe if you want to write your own pickle protocol or parse fragments of Python source code.

What determines the position of a character when looping through UTF-8 strings?

I am reading the section on for statements in the Effective Go documentation and came across this example:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
The output is:
Character U+65E5 '日', at position: 0
Character U+672C '本', at position: 3
Character U+FFFD '�', at position: 6
Character U+8A9E '語', at position: 7
What I don't understand is why the positions are 0, 3, 6, and 7. This tells me the first and second character is 3 bytes long and the 'replacement rune' (U+FFFD) is 1 byte long, which I accept and understand. However, I thought rune was of int32 type and therefore would be 4 bytes each, not three.
Why are the positions in a range different to the total amount of memory each value should be consuming?
string values in Go are stored as read only byte slices ([]byte), where the bytes are the UTF-8 encoded bytes of the (runes of the) string. UTF-8 is a variable-length encoding, different Unicode code points may be encoded using different number of bytes. For example values in the range 0..127 are encoded as a single byte (whose value is the unicode codepoint itself), but values greater than 127 use more than 1 byte. The unicode/utf8 package contains UTF-8 related utility functions and constants, for example utf8.UTFMax reports the maximum number of bytes a valid Unicode codepoint may "occupy" in UTF-8 encoding (which is 4).
One thing to note here: not all possible byte sequences are valid UTF-8 sequences. A string may be any byte sequence, even those that are invalid UTF-8 sequences. For example the string value "\xff" represents an invalid UTF-8 byte sequence, for details, see How do I represent an Optional String in Go?
The for range construct –when applied on a string value– iterates over the runes of the string:
For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.
The for range construct may produce 1 or 2 iteration values. When using 2, like in your example:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
For each iteration, pos will be byte index of the rune / character, and char will be the rune of the string. As you can see in the quote above, if the string is an invalid UTF-8 byte sequence, when an invalid UTF-8 sequence is encountered, char will be 0xFFFD (the Unicode replacement character), and the for range construct (the iteration) will advance a singe byte only.
To sum it up: The position is always the byte index of the rune of the current iteration (or more specifically: the byte index of the first byte of the UTF-8 encoded sequence of the rune of the current iteration), but if invalid UTF-8 sequence is encountered, the position (index) will only be incremented by 1 in the next iteration.
A must-read blog post if you want to know more about the topic:
The Go Blog: Strings, bytes, runes and characters in Go
rune is code point. Code point is just integer. You can even use int64 to store it if you want to. (But Unicode only has 1,114,112 code points so int32 should be the right choice. No wonder rune is alias of int32 in Golang.)
Different encoding schemes encode code points in different ways. E.g. CJK character is usually encoded to 3 bytes in UTF-8, and to 2 bytes in UTF-16.
String literal in Golang is UTF-8.

Raw byte values vs Unicode text?

I am a beginner in python and came across a chapter, which read :
In Python 3.X, the normal str string handles Unicode text (including ASCII, which is just a simple kind of Unicode); a distinct bytes string type represents raw byte values (including media and encoded text);
I understand what is unicode text, but what values are the raw bytes??
Raw bytes can be anything you want them to be. A single byte is limited to 0-255 (hexadecimal 00-FF) so more than one has to be interpreted by a program to something meaningful.
Given the byte string b'\x41\x42\x43\x44', this could be a little-endian integer:
>>> int.from_bytes(raw,'little')
1145258561
>>> hex(int.from_bytes(raw,'little'))
'0x44434241'
Or a big-ending integer:
>>> hex(int.from_bytes(raw,'big'))
'0x41424344'
Or a UTF-8-encoded Unicode string:
>>> raw.decode('utf8')
'ABCD'
Or two little-endian 16-bit unsigned integers:
>>> struct.unpack('HH',raw)
(16961, 17475)
>>> list(map(hex,struct.unpack('HH',raw)))
['0x4241', '0x4443']
It's just data. It's up to a program decide what the data means.
Byte strings can be transmitted across a TCP socket or read or written to a file. Unicode text cannot. It must be encoded to bytes first.

Resources