Why is Go adding bytes to my string? - string

When I add a single byte to my string at 0x80 or above, golang will add 0xc2 before my byte.
I think this has something to do with utf8 runes. Either way, how do I just add 0x80 to the end of my string?
Example:
var s string = ""
len(s) // this will be 0
s += string(0x80)
len(s) // this will be 2, string is now bytes 0xc2 0x80

The From the specification:
Converting a signed or unsigned integer value to a string type yields a string containing the UTF-8 representation of the integer.
The expression string(0x80) evaluates to a string with the UTF-8 representation of 0x80, not a string containing the single byte 0x80. The UTF-8 representation of 0x80 is 0xc2 0x80.
Use the \x hex escape to specify the byte 0x80 in a string:
s += "\x80"
You can create a string from an arbitrary sequence of bytes using the string([]byte) conversion.
s += string([]byte{0x80})

I haven't found a way to avoid adding that character, if I use string(0x80) to convert the byte. However, I did find that if I change the whole string to a slice of bytes, then add the byte, then switch back to a string, I can get the correct byte order in the string.
Example:
bytearray := []byte(some_string)
bytearray = append(bytearray, 0x80)
some_string = string(bytearray)
Kind of a silly work around, if anyone finds a better method, please post it.

Related

Printing strings and characters as hexadecimal in Go

Why cyrillic strings in hexadecimal format differ from cyrillic chars in hexadecimal format?
str := "Э"
fmt.Printf("%x\n", str)
//result d0ad
str := 'Э'
fmt.Printf("%x\n", str)
//result 42d
Printing the hexadecimal representation of a string prints the hex representation of its bytes, and printing the hexadecimal representation of a rune prints the hex representation of the number it is an alias to (rune is an alias to int32).
And strings in Go hold the UTF-8 encoded byte sequence of the text. In UTF-8 representation characters (runes) having a numeric code > 127 have multi-byte representation.
The rune Э has multi-byte representation in UTF-8 (being [208, 173]), and it is not the same as the multi-byte representation of the 32-bit integer 1069 = 0x42d. Integers are represented using two's complement in memory.
Recommended blog post: Strings, bytes, runes and characters in Go

How to convert a byte to the correct decimal representation?

I have a function that checks a bit in a specific index of a string (converted to byte representation):
fn check_bit(s: String) -> bool {
let bytes = s.as_bytes(); // converts string to bytes
let byte = s[0]; // pick first byte
// byte here seems to be in decimal representation
byte | 0xf7 == 0xff // therefore this returns incorrect value
}
after printing and doing arithmetic operations on the byte variable, I noticed that byte is a decimal number. For example, for the ASCII value of char b (0x98), byte stores 98 as decimal, which results in an incorrect value when doing bitwise operations on them. How can I convert this decimal value to correct hex, binary or decimal representation? (for 0x98 I am expecting to get decimal value of 152)
noticed that byte is a decimal number
An integer has no intrinsic base, it is just a bit pattern in memory. You can write them out in a string as a decimal, binary, octal, etc., but the value in memory is the same.
In other words, integers are not stored in memory as strings.
For example, for ascii value of char b (0x98), byte simply stores 98 as decimal
ASCII b is not 0x98, it is 98, which is 0x62:
assert_eq!(98, 0x62);
assert_eq!(98, "b".as_bytes()[0]);
assert_eq!(98, 'b' as i32);
How can I convert this decimal value to correct hex, binary or decimal representation?
Such a conversion does not make sense because integers are not stored as strings as explained above.

Use python string as byte

I have the byte representation of a character in a string, let's say the character is 'H', which has the byte value of 72. My string is therefore "72".
How do I go about converting this string ("72") into its corresponding character value ('H') based on the byte value (72) represented in my string using python 3.6?
Psuedo code:
str = 72
print(decode_as_byte_value(str))
Expected result:
H
ord('H')
chr(72)
Its as simple as that. Remember that chr() only takes int and ord() only takes str
Please do not use this community for such Syntex based questions.
Still your ans is:
# Get the ASCII number of a character
number = ord(char)
# Get the character given by an ASCII number
char = chr(number)
If this is your answer tick mark this response.

What determines the position of a character when looping through UTF-8 strings?

I am reading the section on for statements in the Effective Go documentation and came across this example:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
The output is:
Character U+65E5 '日', at position: 0
Character U+672C '本', at position: 3
Character U+FFFD '�', at position: 6
Character U+8A9E '語', at position: 7
What I don't understand is why the positions are 0, 3, 6, and 7. This tells me the first and second character is 3 bytes long and the 'replacement rune' (U+FFFD) is 1 byte long, which I accept and understand. However, I thought rune was of int32 type and therefore would be 4 bytes each, not three.
Why are the positions in a range different to the total amount of memory each value should be consuming?
string values in Go are stored as read only byte slices ([]byte), where the bytes are the UTF-8 encoded bytes of the (runes of the) string. UTF-8 is a variable-length encoding, different Unicode code points may be encoded using different number of bytes. For example values in the range 0..127 are encoded as a single byte (whose value is the unicode codepoint itself), but values greater than 127 use more than 1 byte. The unicode/utf8 package contains UTF-8 related utility functions and constants, for example utf8.UTFMax reports the maximum number of bytes a valid Unicode codepoint may "occupy" in UTF-8 encoding (which is 4).
One thing to note here: not all possible byte sequences are valid UTF-8 sequences. A string may be any byte sequence, even those that are invalid UTF-8 sequences. For example the string value "\xff" represents an invalid UTF-8 byte sequence, for details, see How do I represent an Optional String in Go?
The for range construct –when applied on a string value– iterates over the runes of the string:
For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.
The for range construct may produce 1 or 2 iteration values. When using 2, like in your example:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
For each iteration, pos will be byte index of the rune / character, and char will be the rune of the string. As you can see in the quote above, if the string is an invalid UTF-8 byte sequence, when an invalid UTF-8 sequence is encountered, char will be 0xFFFD (the Unicode replacement character), and the for range construct (the iteration) will advance a singe byte only.
To sum it up: The position is always the byte index of the rune of the current iteration (or more specifically: the byte index of the first byte of the UTF-8 encoded sequence of the rune of the current iteration), but if invalid UTF-8 sequence is encountered, the position (index) will only be incremented by 1 in the next iteration.
A must-read blog post if you want to know more about the topic:
The Go Blog: Strings, bytes, runes and characters in Go
rune is code point. Code point is just integer. You can even use int64 to store it if you want to. (But Unicode only has 1,114,112 code points so int32 should be the right choice. No wonder rune is alias of int32 in Golang.)
Different encoding schemes encode code points in different ways. E.g. CJK character is usually encoded to 3 bytes in UTF-8, and to 2 bytes in UTF-16.
String literal in Golang is UTF-8.

How to get the ASCII value of a string

Suppose there is a string:
String str="Hello";
HOw can i get the ASCII value of that above mentioned string?
Given your comment, it sounds like all you need is:
char[] chars = str.ToCharArray();
Array.Sort(chars);
A char value in .NET is actually a UTF-16 code unit, but for all ASCII characters, the UTF-16 code unit value is the same as the ASCII value anyway.
You can create a new string from the array like this:
string sortedText = new string(chars);
Console.WriteLine(chars);
As it happens, "Hello" is already in ascending ASCII order...
byte[] asciiBytes =Encoding.ASCII.GetBytes(str);
You now have an array of the ASCII value of the bytes

Resources