How can I get the Unicode value of a character in go? - string

I try to get the unicode value of a string character in Go as an Int value.
I do this:
value = strconv.Itoa(int(([]byte(char))[0]))
where char contains a string with one character.
That works for many cases. It doesn't work for umlauts like ä, ö, ü, Ä, Ö, Ü.
E.g. Ä results in 65, which is the same as for A.
How can I do that?
Supplement: I had two problems. The first was solved with any of the answers below. The second was a bit more tricky. My input was not Go normalized UTF-8 code, e.g. umlauts were represented by two characters instead of one. As ANisus said the solution is found in the package golang.org/x/text/unicode/norm. The line above is now two lines:
rune, _ := utf8.DecodeRune(norm.NFC.Bytes([]byte(char)))
value = strconv.Itoa(int(rune))
Any hints to make this shorter welcome ...

Strings are utf8 encoded, so to decode a character from a string to get the rune (unicode code point), you can use the unicode/utf8 package.
Example:
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
str := "AÅÄÖ"
for len(str) > 0 {
r, size := utf8.DecodeRuneInString(str)
fmt.Printf("%d %v\n", r, size)
str = str[size:]
}
}
Result:
65 1
197 2
196 2
214 2
Edit: (To clarify Michael's supplement)
A character such as Ä may be created using different unicode code points:
Precomposed: Ä (U+00C4)
Using combining diaeresis: A (U+0041) + ¨ (U+0308)
In order to get the precomposed form, one can use the normalization package, golang.org/x/text/unicode/norm. The NFC (Canonical Decomposition,
followed by Canonical Composition) form will turn U+0041 + U+0308 into U+00C4:
c := "\u0041\u0308"
r, _ := utf8.DecodeRune(norm.NFC.Bytes([]byte(c)))
fmt.Printf("%+q", r) // '\u00c4'

The "character" type in Go is the rune which is an alias for int32, see also Rune literals. A rune is an integer value identifying a Unicode code point.
In Go strings are represented and stored as the UTF-8 encoded byte sequence of the text. The range form of the for loop iterates over the runes of the text:
s := "äöüÄÖÜ世界"
for _, r := range s {
fmt.Printf("%c - %d\n", r, r)
}
Output:
ä - 228
ö - 246
ü - 252
Ä - 196
Ö - 214
Ü - 220
世 - 19990
界 - 30028
Try it on the Go Playground.
Read this blog article if you want to know more about the topic:
Strings, bytes, runes and characters in Go

you can use the unicode/utf8 package
rune,_:=utf8.DecodeRuneInString("Ä")
fmt.Println(rune)

Related

Writing Bytes to strings.builder prints nothing

I am learning go and am unsure why this piece of code prints nothing
package main
import (
"strings"
)
func main(){
var sb strings.Builder
sb.WriteByte(byte(127))
println(sb.String())
}
I would expect it to print 127
You are appending a byte to the string's buffer, not the characters "127".
Since Go strings are UTF-8, any number <=127 will be the same character as that number in ASCII. As you can see in this ASCII chart, 127 will get you the "delete" character. Since "delete" is a non-printable character, println doesn't output anything.
Here's an example of doing the same thing from your question, but using a printable character. 90 for "Z". You can see that it does print out Z.
If you want to append the characters "127" you can use sb.WriteString("127") or sb.Write([]byte("127")). If you want to append the string representation of a byte, you might want to look at using fmt.Sprintf.
Note: I'm not an expert on character encoding so apologies if the terminology in this answer is incorrect.

Output UUID in Go as a short string

Is there a built in way, or reasonably standard package that allows you to convert a standard UUID into a short string that would enable shorter URL's?
I.e. taking advantage of using a larger range of characters such as [A-Za-z0-9] to output a shorter string.
I know we can use base64 to encode the bytes, as follows, but I'm after something that creates a string that looks like a "word", i.e. no + and /:
id = base64.StdEncoding.EncodeToString(myUuid.Bytes())
A universally unique identifier (UUID) is a 128-bit value, which is 16 bytes. For human-readable display, many systems use a canonical format using hexadecimal text with inserted hyphen characters, for example:
123e4567-e89b-12d3-a456-426655440000
This has length 16*2 + 4 = 36. You may choose to omit the hypens which gives you:
fmt.Printf("%x\n", uuid)
fmt.Println(hex.EncodeToString(uuid))
// Output: 32 chars
123e4567e89b12d3a456426655440000
123e4567e89b12d3a456426655440000
You may choose to use base32 encoding (which encodes 5 bits with 1 symbol in contrast to hex encoding which encodes 4 bits with 1 symbol):
fmt.Println(base32.StdEncoding.EncodeToString(uuid))
// Output: 26 chars
CI7EKZ7ITMJNHJCWIJTFKRAAAA======
Trim the trailing = signs when transmitting, so this will always be 26 chars. Note that you have to append "======" prior to decode the string using base32.StdEncoding.DecodeString().
If this is still too long for you, you may use base64 encoding (which encodes 6 bits with 1 symbol):
fmt.Println(base64.RawURLEncoding.EncodeToString(uuid))
// Output: 22 chars
Ej5FZ-ibEtOkVkJmVUQAAA
Note that base64.RawURLEncoding produces a base64 string (without padding) which is safe for URL inclusion, because the 2 extra chars in the symbol table (beyond [0-9a-zA-Z]) are - and _, both which are safe to be included in URLs.
Unfortunately for you, the base64 string may contain 2 extra chars beyond [0-9a-zA-Z]. So read on.
Interpreted, escaped string
If you are alien to these 2 extra characters, you may choose to turn your base64 string into an interpreted, escaped string similar to the interpreted string literals in Go. For example if you want to insert a backslash in an interpreted string literal, you have to double it because backslash is a special character indicating a sequence, e.g.:
fmt.Println("One backspace: \\") // Output: "One backspace: \"
We may choose to do something similar to this. We have to designate a special character: be it 9.
Reasoning: base64.RawURLEncoding uses the charset: A..Za..z0..9-_, so 9 represents the highest code with alphanumeric character (61 decimal = 111101b). See advantage below.
So whenever the base64 string contains a 9, replace it with 99. And whenever the base64 string contains the extra characters, use a sequence instead of them:
9 => 99
- => 90
_ => 91
This is a simple replacement table which can be captured by a value of strings.Replacer:
var escaper = strings.NewReplacer("9", "99", "-", "90", "_", "91")
And using it:
fmt.Println(escaper.Replace(base64.RawURLEncoding.EncodeToString(uuid)))
// Output:
Ej5FZ90ibEtOkVkJmVUQAAA
This will slightly increase the length as sometimes a sequence of 2 chars will be used instead of 1 char, but the gain will be that only [0-9a-zA-Z] chars will be used, as you wanted. The average length will be less than 1 additional character: 23 chars. Fair trade.
Logic: For simplicity let's assume all possible uuids have equal probability (uuid is not completely random, so this is not the case, but let's set this aside as this is just an estimation). Last base64 symbol will never be a replaceable char (that's why we chose the special char to be 9 instead of like A), 21 chars may turn into a replaceable sequence. The chance for one being replaceable: 3 / 64 = 0.047, so on average this means 21*3/64 = 0.98 sequences which turn 1 char into a 2-char sequence, so this is equal to the number of extra characters.
To decode, use an inverse decoding table captured by the following strings.Replacer:
var unescaper = strings.NewReplacer("99", "9", "90", "-", "91", "_")
Example code to decode an escaped base64 string:
fmt.Println("Verify decoding:")
s := escaper.Replace(base64.RawURLEncoding.EncodeToString(uuid))
dec, err := base64.RawURLEncoding.DecodeString(unescaper.Replace(s))
fmt.Printf("%x, %v\n", dec, err)
Output:
123e4567e89b12d3a456426655440000, <nil>
Try all the examples on the Go Playground.
As suggested here, If you want just a fairly random string to use as slug, better to not bother with UUID at all.
You can simply use go's native math/rand library to make random strings of desired length:
import (
"math/rand"
"encoding/hex"
)
b := make([]byte, 4) //equals 8 characters
rand.Read(b)
s := hex.EncodeToString(b)
Another option is math/big. While base64 has a constant output of 22
characters, math/big can get down to 2 characters, depending on the input:
package main
import (
"encoding/base64"
"fmt"
"math/big"
)
type uuid [16]byte
func (id uuid) encode() string {
return new(big.Int).SetBytes(id[:]).Text(62)
}
func main() {
var id uuid
for n := len(id); n > 0; n-- {
id[n - 1] = 0xFF
s := base64.RawURLEncoding.EncodeToString(id[:])
t := id.encode()
fmt.Printf("%v %v\n", s, t)
}
}
Result:
AAAAAAAAAAAAAAAAAAAA_w 47
AAAAAAAAAAAAAAAAAAD__w h31
AAAAAAAAAAAAAAAAAP___w 18owf
AAAAAAAAAAAAAAAA_____w 4GFfc3
AAAAAAAAAAAAAAD______w jmaiJOv
AAAAAAAAAAAAAP_______w 1hVwxnaA7
AAAAAAAAAAAA_________w 5k1wlNFHb1
AAAAAAAAAAD__________w lYGhA16ahyf
AAAAAAAAAP___________w 1sKyAAIxssts3
AAAAAAAA_____________w 62IeP5BU9vzBSv
AAAAAAD______________w oXcFcXavRgn2p67
AAAAAP_______________w 1F2si9ujpxVB7VDj1
AAAA_________________w 6Rs8OXba9u5PiJYiAf
AAD__________________w skIcqom5Vag3PnOYJI3
AP___________________w 1SZwviYzes2mjOamuMJWv
_____________________w 7N42dgm5tFLK9N8MT7fHC7
https://golang.org/pkg/math/big

Go lang's equivalent of charCode() method of JavaScript

The charCodeAt() method in JavaScript returns the numeric Unicode value of the character at the given index, e.g.
"s".charCodeAt(0) // returns 115
How would I go by to get the numeric unicode value of the the same string/letter in Go?
The character type in Go is rune which is an alias for int32 so it is already a number, just print it.
You still need a way to get the character at the specified position. Simplest way is to convert the string to a []rune which you can index. To convert a string to runes, simply use the type conversion []rune("some string"):
fmt.Println([]rune("s")[0])
Prints:
115
If you want it printed as a character, use the %c format string:
fmt.Println([]rune("absdef")[2]) // Also prints 115
fmt.Printf("%c", []rune("absdef")[2]) // Prints s
Also note that the for range on a string iterates over the runes of the string, so you can also use that. It is more efficient than converting the whole string to []rune:
i := 0
for _, r := range "absdef" {
if i == 2 {
fmt.Println(r)
break
}
i++
}
Note that the counter i must be a distinct counter, it cannot be the loop iteration variable, as the for range returns the byte position and not the rune index (which will be different if the string contains multi-byte characters in the UTF-8 representation).
Wrapping it into a function:
func charCodeAt(s string, n int) rune {
i := 0
for _, r := range s {
if i == n {
return r
}
i++
}
return 0
}
Try these on the Go Playground.
Also note that strings in Go are stored in memory as a []byte which is the UTF-8 encoded byte sequence of the text (read the blog post Strings, bytes, runes and characters in Go for more info). If you have guarantees that the string uses characters whose code is less than 127, you can simply work with bytes. That is indexing a string in Go indexes its bytes, so for example "s"[0] is the byte value of 's' which is 115.
fmt.Println("s"[0]) // Prints 115
fmt.Println("absdef"[2]) // Prints 115
Internally string is a 8 bit byte array in golang. So every byte will represent the ascii value.
str:="abc"
byteValue := str[0]
intValue := int(byteValue)
fmt.Println(byteValue)//97
fmt.Println(intValue)//97

How to convert strings to array of byte and back

4I must write strings to a binary MIDI file. The standard requires one to know the length of the string in bytes. As I want to write for mobile as well I cannot use AnsiString, which was a good way to ensure that the string was a one-byte string. That simplified things. I tested the following code:
TByte = array of Byte;
function TForm3.convertSB (arg: string): TByte;
var
i: Int32;
begin
Label1.Text := (SizeOf (Char));
for i := Low (arg) to High (arg) do
begin
label1.Text := label1.Text + ' ' + IntToStr (Ord (arg [i]));
end;
end; // convert SB //
convertSB ('MThd');
It returns 2 77 84 104 100 (as label text) in Windows as well as Android. Does this mean that Delphi treats strings by default as UTF-8? This would greatly simplify things but I couldn't find it in the help. And what is the best way to convert this to an array of bytes? Read each character and test whether it is 1, 2 or 4 bytes and allocate this space in the array? For converting back to a character: just read the array of bytes until a byte is encountered < 128?
Delphi strings are encoded internally as UTF-16. There was a big clue in the fact that SizeOf(Char) is 2.
The reason that all your characters had ordinal in the ASCII range is that UTF-16 extends ASCII in the sense that characters 0 to 127, in the ASCII range, have the same ordinal value in UTF-16. And all your characters are ASCII characters.
That said, you do not need to worry about the internal storage. You simply convert between string and byte array using the TEncoding class. For instance, to convert to UTF-8 you write:
bytes := TEncoding.UTF8.GetBytes(str);
And in the opposite direction:
str := TEncoding.UTF8.GetString(bytes);
The class supports many other encodings, as described in the documentation. It's not clear from the question which encoding you are need to use. Hopefully you can work the rest out from here.

Indexing string as chars

The elements of strings have type byte and may be accessed using the
usual indexing operations.
How can I get element of string as char ?
"some"[1] -> "o"
The simplest solution is to convert it to an array of runes :
var runes = []rune("someString")
Note that when you iterate on a string, you don't need the conversion. See this example from Effective Go :
for pos, char := range "日本語" {
fmt.Printf("character %c starts at byte position %d\n", char, pos)
}
This prints
character 日 starts at byte position 0
character 本 starts at byte position 3
character 語 starts at byte position 6
Go strings are usually, but not necessarily, UTF-8 encoded. In the case they are Unicode strings, the term "char[acter]" is pretty complex and there is no generall/unique bijection of runes (code points) and Unicode characters.
Anyway one can easily work with code points (runes) in a slice and use indexes into it using a conversion:
package main
import "fmt"
func main() {
utf8 := "Hello, 世界"
runes := []rune(utf8)
fmt.Printf("utf8:% 02x\nrunes: %#v\n", []byte(utf8), runes)
}
Also here: http://play.golang.org/p/qWVSA-n93o
Note: Often the desire to access Unicode "characters" by index is a design mistake. Most of textual data is processed sequentially.
Another option is the package utf8string:
package main
import "golang.org/x/exp/utf8string"
func main() {
s := utf8string.NewString("🧡💛💚💙💜")
t := s.At(2)
println(t == '💚')
}
https://pkg.go.dev/golang.org/x/exp/utf8string

Resources