I was running some fuzzing on my code and it found a bug. I have reduced it down to the following code snippet and I cannot see what is wrong.
Given the string
s := string("\xc0")
The len(s) function returns 1. However, if you loop through the string the first rune is length 3.
for _, r := range s {
fmt.Println("len of rune:", utf8.RuneLen(r)) // Will print 3
}
My assumptions are:
len(string) is returning the number of bytes in the string
utf8.RuneLen(r) is returning the number of bytes in the rune
I assume I am misunderstanding something, but how can the length of a string be less than the length of one of it's runes?
Playground here: https://go.dev/play/p/SH3ZI2IZyrL
The explanation is simple: your input is not valid UTF-8 encoded string.
fmt.Println(utf8.ValidString(s))
This outputs: false.
The for range over a string ranges over its runes, but if an invalid UTF-8 sequence is encountered, the Unicode replacement character 0xFFFD is set for r. Spec: For statements:
For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.
This applies to your case: you get 0xfffd for r which has 3 bytes using UTF-8 encoding.
If you go with a valid string holding a rune of \xc0:
s = string([]rune{'\xc0'})
Then output is:
len of s: 2
runes in s: 1
len of rune: 2
UTF-8 bytes of s: [195 128]
Hexa UTF-8 bytes of s: c3 80
Try it on the Go Playground.
Related
This question already has answers here:
Rune vs byte ranging over string
(3 answers)
Is there any difference between range over string and range over rune slice?
(2 answers)
Why range and subscription of a string produce different types?
(1 answer)
Closed 4 months ago.
Here I am checking the type of each elements of string s using the index s[k] and value v but returning different outputs. Using index i am getting the type uint8 but for value semantics I am getting the int32.
func main() {
s := "AaBbCcXxYyZz"
for k,v := range s {
fmt.Printf("%v\t%T\t%s\n", s[k], s[k], string(s[k]))
fmt.Printf("%v\t%T\t%s\n", v, v, string(v))
}
}
The loop for k,v := range s {} iterates over unicode codepoints. In Golang they are called runes and are represented as 32-bit signed inegers:
For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.
Golang specification
The indexing s[k] returns the byte in the internal representation of the string.
The difference is easy to see for multibyte alphabets, such as Chinese. Try iterate the string "給祭断情試紀脱答条証行日稿" (it a meaningless lorem impsum phrase in chinese):
s[0]: 231 uint8 ç
:32102 int32 給
s[3]: 231 uint8 ç
:31085 int32 祭
s[6]: 230 uint8 æ
:26029 int32 断
See the step between the values of k? It is due to utf-8 encoding of those chinese characters occupies 3 bytes.
Full example: https://go.dev/play/p/-44NZMojcgq
I have the byte representation of a character in a string, let's say the character is 'H', which has the byte value of 72. My string is therefore "72".
How do I go about converting this string ("72") into its corresponding character value ('H') based on the byte value (72) represented in my string using python 3.6?
Psuedo code:
str = 72
print(decode_as_byte_value(str))
Expected result:
H
ord('H')
chr(72)
Its as simple as that. Remember that chr() only takes int and ord() only takes str
Please do not use this community for such Syntex based questions.
Still your ans is:
# Get the ASCII number of a character
number = ord(char)
# Get the character given by an ASCII number
char = chr(number)
If this is your answer tick mark this response.
I am reading the section on for statements in the Effective Go documentation and came across this example:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
The output is:
Character U+65E5 '日', at position: 0
Character U+672C '本', at position: 3
Character U+FFFD '�', at position: 6
Character U+8A9E '語', at position: 7
What I don't understand is why the positions are 0, 3, 6, and 7. This tells me the first and second character is 3 bytes long and the 'replacement rune' (U+FFFD) is 1 byte long, which I accept and understand. However, I thought rune was of int32 type and therefore would be 4 bytes each, not three.
Why are the positions in a range different to the total amount of memory each value should be consuming?
string values in Go are stored as read only byte slices ([]byte), where the bytes are the UTF-8 encoded bytes of the (runes of the) string. UTF-8 is a variable-length encoding, different Unicode code points may be encoded using different number of bytes. For example values in the range 0..127 are encoded as a single byte (whose value is the unicode codepoint itself), but values greater than 127 use more than 1 byte. The unicode/utf8 package contains UTF-8 related utility functions and constants, for example utf8.UTFMax reports the maximum number of bytes a valid Unicode codepoint may "occupy" in UTF-8 encoding (which is 4).
One thing to note here: not all possible byte sequences are valid UTF-8 sequences. A string may be any byte sequence, even those that are invalid UTF-8 sequences. For example the string value "\xff" represents an invalid UTF-8 byte sequence, for details, see How do I represent an Optional String in Go?
The for range construct –when applied on a string value– iterates over the runes of the string:
For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.
The for range construct may produce 1 or 2 iteration values. When using 2, like in your example:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
For each iteration, pos will be byte index of the rune / character, and char will be the rune of the string. As you can see in the quote above, if the string is an invalid UTF-8 byte sequence, when an invalid UTF-8 sequence is encountered, char will be 0xFFFD (the Unicode replacement character), and the for range construct (the iteration) will advance a singe byte only.
To sum it up: The position is always the byte index of the rune of the current iteration (or more specifically: the byte index of the first byte of the UTF-8 encoded sequence of the rune of the current iteration), but if invalid UTF-8 sequence is encountered, the position (index) will only be incremented by 1 in the next iteration.
A must-read blog post if you want to know more about the topic:
The Go Blog: Strings, bytes, runes and characters in Go
rune is code point. Code point is just integer. You can even use int64 to store it if you want to. (But Unicode only has 1,114,112 code points so int32 should be the right choice. No wonder rune is alias of int32 in Golang.)
Different encoding schemes encode code points in different ways. E.g. CJK character is usually encoded to 3 bytes in UTF-8, and to 2 bytes in UTF-16.
String literal in Golang is UTF-8.
The charCodeAt() method in JavaScript returns the numeric Unicode value of the character at the given index, e.g.
"s".charCodeAt(0) // returns 115
How would I go by to get the numeric unicode value of the the same string/letter in Go?
The character type in Go is rune which is an alias for int32 so it is already a number, just print it.
You still need a way to get the character at the specified position. Simplest way is to convert the string to a []rune which you can index. To convert a string to runes, simply use the type conversion []rune("some string"):
fmt.Println([]rune("s")[0])
Prints:
115
If you want it printed as a character, use the %c format string:
fmt.Println([]rune("absdef")[2]) // Also prints 115
fmt.Printf("%c", []rune("absdef")[2]) // Prints s
Also note that the for range on a string iterates over the runes of the string, so you can also use that. It is more efficient than converting the whole string to []rune:
i := 0
for _, r := range "absdef" {
if i == 2 {
fmt.Println(r)
break
}
i++
}
Note that the counter i must be a distinct counter, it cannot be the loop iteration variable, as the for range returns the byte position and not the rune index (which will be different if the string contains multi-byte characters in the UTF-8 representation).
Wrapping it into a function:
func charCodeAt(s string, n int) rune {
i := 0
for _, r := range s {
if i == n {
return r
}
i++
}
return 0
}
Try these on the Go Playground.
Also note that strings in Go are stored in memory as a []byte which is the UTF-8 encoded byte sequence of the text (read the blog post Strings, bytes, runes and characters in Go for more info). If you have guarantees that the string uses characters whose code is less than 127, you can simply work with bytes. That is indexing a string in Go indexes its bytes, so for example "s"[0] is the byte value of 's' which is 115.
fmt.Println("s"[0]) // Prints 115
fmt.Println("absdef"[2]) // Prints 115
Internally string is a 8 bit byte array in golang. So every byte will represent the ascii value.
str:="abc"
byteValue := str[0]
intValue := int(byteValue)
fmt.Println(byteValue)//97
fmt.Println(intValue)//97
Does Golang do any conversion or somehow try to interpret the bytes when casting a byte slice to a string? I've just tried with a byte slice containing a null byte and it looks like it still keep the string as it is.
var test []byte
test = append(test, 'a')
test = append(test, 'b')
test = append(test, 0)
test = append(test, 'd')
fmt.Println(test[2] == 0) // OK
But how about strings with invalid unicode points or UTF-8 encoding. Could the casting fail or the data be corrupted?
The Go Programming Language Specification
String types
A string type represents the set of string values. A string value is a
(possibly empty) sequence of bytes.
Conversions
Conversions to and from a string type
Converting a slice of bytes to a string type yields a string whose
successive bytes are the elements of the slice.
string([]byte{'h', 'e', 'l', 'l', '\xc3', '\xb8'}) // "hellø"
string([]byte{}) // ""
string([]byte(nil)) // ""
type MyBytes []byte
string(MyBytes{'h', 'e', 'l', 'l', '\xc3', '\xb8'}) // "hellø"
Converting a value of a string type to a slice of bytes type yields a
slice whose successive elements are the bytes of the string.
[]byte("hellø") // []byte{'h', 'e', 'l', 'l', '\xc3', '\xb8'}
[]byte("") // []byte{}
MyBytes("hellø") // []byte{'h', 'e', 'l', 'l', '\xc3', '\xb8'}
A string value is a (possibly empty) sequence of bytes. A string value may or may not represent Unicode characters encoded in UTF-8. There is no interpretation of the bytes during the conversion from byte slice to string nor from string to byte slice. Therefore, the bytes will not be changed and the conversions will not fail.
No, the casting can't fail. Here's an example showing this (run in the Go Playground):
b := []byte{0x80}
s := string(b)
fmt.Println(s)
fmt.Println([]byte(s))
for _, c := range s {
fmt.Println(c)
}
This prints:
�
[128]
65533
Note that ranging over invalid UTF-8 is well defined according to the Go spec:
For a string value, the "range" clause iterates over the Unicode code
points in the string starting at byte index 0. On successive
iterations, the index value will be the index of the first byte of
successive UTF-8-encoded code points in the string, and the second
value, of type rune, will be the value of the corresponding code
point. If the iteration encounters an invalid UTF-8 sequence, the
second value will be 0xFFFD, the Unicode replacement character, and
the next iteration will advance a single byte in the string.