How to get a single Unicode character from string - string

I wonder how I can I get a Unicode character from a string. For example, if the string is "你好", how can I get the first character "你"?
From another place I get one way:
var str = "你好"
runes := []rune(str)
fmt.Println(string(runes[0]))
It does work.
But I still have some questions:
Is there another way to do it?
Why in Go does str[0] not get a Unicode character from a string, but it gets byte data?

First, you may want to read https://blog.golang.org/strings
It will answer part of your questions.
A string in Go can contains arbitrary bytes. When you write str[i], the result is a byte, and the index is always a number of bytes.
Most of the time, strings are encoded in UTF-8 though. You have multiple ways to deal with UTF-8 encoding in a string.
For instance, you can use the for...range statement to iterate on a string rune by rune.
var first rune
for _,c := range str {
first = c
break
}
// first now contains the first rune of the string
You can also leverage the unicode/utf8 package. For instance:
r, size := utf8.DecodeRuneInString(str)
// r contains the first rune of the string
// size is the size of the rune in bytes
If the string is encoded in UTF-8, there is no direct way to access the nth rune of the string, because the size of the runes (in bytes) is not constant. If you need this feature, you can easily write your own helper function to do it (with for...range, or with the unicode/utf8 package).

You can use the utf8string package:
package main
import "golang.org/x/exp/utf8string"
func main() {
s := utf8string.NewString("ÄÅàâäåçèéêëìîïü")
// example 1
r := s.At(1)
println(r == 'Å')
// example 2
t := s.Slice(1, 3)
println(t == "Åà")
}
https://pkg.go.dev/golang.org/x/exp/utf8string

you can do this:
func main() {
str := "cat"
var s rune
for i, c := range str {
if i == 2 {
s = c
}
}
}
s is now equal to a

Related

Replace a character in a string in golang

I am trying to replace a specific position character from an array of strings. Here is what my code looks like:
package main
import (
"fmt"
)
func main() {
str := []string{"test","testing"}
str[0][2] = 'y'
fmt.Println(str)
}
Now, running this gives me the error:
cannot assign to str[0][2]
Any idea how to do this? I have tried using strings.Replace, but AFAIK it will replace all the occurrence of the given character, while I want to replace that specific character. Any help is appreciated. TIA.
Strings in Go are immutable, you can't change their content. To change the value of a string variable, you have to assign a new string value.
An easy way is to first convert the string to a byte or rune slice, do the change and convert back:
s := []byte(str[0])
s[2] = 'y'
str[0] = string(s)
fmt.Println(str)
This will output (try it on the Go Playground):
[teyt testing]
Note: I converted the string to byte slice, because this is what happens when you index a string: it indexes its bytes. A string stores the UTF-8 byte sequence of the text, which may not necessarily map bytes to characters one-to-one.
If you need to replace the 2nd character, use []rune instead:
s := []rune(str[0])
s[2] = 'y'
str[0] = string(s)
fmt.Println(str)
In this example it doesn't matter though, but in general it may.
Also note that strings.Replace() does not (necessarily) replace all occurrences:
func Replace(s, old, new string, n int) string
The parameter n tells how many replacement are to be performed max. So the following also works (try it on the Go Playground):
str[0] = strings.Replace(str[0], "s", "y", 1)
Yet another solution could be to slice the string up until the replacable character, and starting from the character after the replacable one, and just concatenate them (try this one on the Go Playground):
str[0] = str[0][:2] + "y" + str[0][3:]
Care must be taken here too: the slice indices are byte indices, not character (rune) indices.
See related question: Immutable string and pointer address
Here's a function that will do that for you. It takes care of converting the string that you want to modify into a []rune, and then back out to string.
If your intention is to replace bytes rather than runes, you can:
copy this function's code, rename it from runeSub to byteSub
change the r rune parameter to b byte
Also available on repl.it
package main
import "fmt"
// runeSub - given an array of strings (ss), replace the
// (ri)th rune (character) in the (si)th string
// of (ss), with the rune (r)
//
// ss - the array of strings
// si - the index of the string in ss that you want to modify
// ri - the index of the rune in ss[si] that you want to replace
// r - the rune you want to insert
//
// NOTE: this function has no panic protection from things like
// out-of-bound index values
func runeSub(ss []string, si, ri int, r rune) {
rr := []rune(ss[si])
rr[ri] = r
ss[si] = string(rr)
}
func main() {
ss := []string{"test","testing"}
runeSub(ss, 0, 2, 'y')
fmt.Println(ss)
}

How to create a string of arbitrary length

I want to create a dummy string of a given length to do a performance test. For example I want to first test with 1 KB of string and then may be 10 KB of string etc. I don't care which character (or rune?) it gets filled with. I understand that a string in Go is backed by byte array. So, I want the final string to be backed by a byte array of size equivalent of 1 KB (if I give 1024 as the argument).
For example, I tried the brute force code below:
...
oneKBPayload := createPayload(1024, 'A')
...
//I don't mind even if the char argument is removed and 'A' is used for example
func createPayload(len int, char rune) string {
payload := make([]byte, len)
for i := 0; i < len; i++ {
payload = append(payload, byte(char))
}
return string(payload[:])
}
and it produced a result of (for 10 length)
"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000AAAAAAAAAA"
I realize that it has something to do with the encoding. But how to fix this so that I create any string which is backed by a byte array of the given length so that when I write it over the network, I generate the intended payload.
Your createPayload() creates a byte slice with the given length, which is filled with zeros by default (zero value). Then you append len number of runes to this slice, so the result will be double the length you intend to create (given the rune is less than 127), and that's why you see zeros then followed by the 'A' rune when printed.
If you change it to:
payload := make([]byte, 0, len)
Then the result will be what you want.
But easier would be to simply use strings.Repeat() which repeats a given string value n times. Repeat a one-rune (or more specifically a one-byte) string value n times, and you get what you want:
s := strings.Repeat("A", 10)
fmt.Println(len(s), s)
This will output (try it on the Go Playground):
10 AAAAAAAAAA
If you don't care about the content of the string only about its length, then simply convert a byte slice like this:
s := string(make([]byte, 1024))
fmt.Println(len(s))
Or alternatively like this:
s2 := string([]byte{1023: 0})
fmt.Println(len(s2))
Both prints 1024. Try them on the Go Playground.
If you do care about the content and you already have a byte slice allocated, this is how you can efficiently fill it: Is there analog of memset in go?

Easy way to get a sub-string/sub-slice of up to N characters/elements in Go

In Python I can slice a string to get a sub-string of up to N characters and if the string is too short it will simply return the rest of the string, e.g.
"mystring"[:100] # Returns "mystring"
What's the easiest way to do the same in Go? Trying the same thing panics:
"mystring"[:100] // panic: runtime error: slice bounds out of range
Of course, I can write it all manually:
func Substring(s string, startIndex int, count int) string {
maxCount := len(s) - startIndex
if count > maxCount {
count = maxCount
}
return s[startIndex:count]
}
fmt.Println(Substring("mystring", 0, n))
But that's rather a lot of work for something so simple and (I would have thought) common. What's more, I don't know how to generalise this function to slices of other types, since Go doesn't support generics. I'm hoping there is a better way. Even Math.Min() doesn't easily work here, because it expects and returns float64.
Note that while a function remains the recommended solution (even if it has to be implemented for slices with different type), it wouldn't work well with string.
fmt.Println(Substring("世界mystring", 0, 5)) would actually print 世�� instead of 世界mys.
See "Code points, characters, and runes": a character may be represented by a number of different sequences of code points, and therefore different sequences of UTF-8 bytes.
And in Go, a "code point" is a rune (as seen here).
Using rune would be more robust (again, in case of strings)
func SubstringRunes(s string, startIndex int, count int) string {
runes := []rune(s)
length := len(runes)
maxCount := length - startIndex
if count > maxCount {
count = maxCount
}
return string(runes[startIndex:count])
}
See it in action in this playground.

Go lang's equivalent of charCode() method of JavaScript

The charCodeAt() method in JavaScript returns the numeric Unicode value of the character at the given index, e.g.
"s".charCodeAt(0) // returns 115
How would I go by to get the numeric unicode value of the the same string/letter in Go?
The character type in Go is rune which is an alias for int32 so it is already a number, just print it.
You still need a way to get the character at the specified position. Simplest way is to convert the string to a []rune which you can index. To convert a string to runes, simply use the type conversion []rune("some string"):
fmt.Println([]rune("s")[0])
Prints:
115
If you want it printed as a character, use the %c format string:
fmt.Println([]rune("absdef")[2]) // Also prints 115
fmt.Printf("%c", []rune("absdef")[2]) // Prints s
Also note that the for range on a string iterates over the runes of the string, so you can also use that. It is more efficient than converting the whole string to []rune:
i := 0
for _, r := range "absdef" {
if i == 2 {
fmt.Println(r)
break
}
i++
}
Note that the counter i must be a distinct counter, it cannot be the loop iteration variable, as the for range returns the byte position and not the rune index (which will be different if the string contains multi-byte characters in the UTF-8 representation).
Wrapping it into a function:
func charCodeAt(s string, n int) rune {
i := 0
for _, r := range s {
if i == n {
return r
}
i++
}
return 0
}
Try these on the Go Playground.
Also note that strings in Go are stored in memory as a []byte which is the UTF-8 encoded byte sequence of the text (read the blog post Strings, bytes, runes and characters in Go for more info). If you have guarantees that the string uses characters whose code is less than 127, you can simply work with bytes. That is indexing a string in Go indexes its bytes, so for example "s"[0] is the byte value of 's' which is 115.
fmt.Println("s"[0]) // Prints 115
fmt.Println("absdef"[2]) // Prints 115
Internally string is a 8 bit byte array in golang. So every byte will represent the ascii value.
str:="abc"
byteValue := str[0]
intValue := int(byteValue)
fmt.Println(byteValue)//97
fmt.Println(intValue)//97

Indexing string as chars

The elements of strings have type byte and may be accessed using the
usual indexing operations.
How can I get element of string as char ?
"some"[1] -> "o"
The simplest solution is to convert it to an array of runes :
var runes = []rune("someString")
Note that when you iterate on a string, you don't need the conversion. See this example from Effective Go :
for pos, char := range "日本語" {
fmt.Printf("character %c starts at byte position %d\n", char, pos)
}
This prints
character 日 starts at byte position 0
character 本 starts at byte position 3
character 語 starts at byte position 6
Go strings are usually, but not necessarily, UTF-8 encoded. In the case they are Unicode strings, the term "char[acter]" is pretty complex and there is no generall/unique bijection of runes (code points) and Unicode characters.
Anyway one can easily work with code points (runes) in a slice and use indexes into it using a conversion:
package main
import "fmt"
func main() {
utf8 := "Hello, 世界"
runes := []rune(utf8)
fmt.Printf("utf8:% 02x\nrunes: %#v\n", []byte(utf8), runes)
}
Also here: http://play.golang.org/p/qWVSA-n93o
Note: Often the desire to access Unicode "characters" by index is a design mistake. Most of textual data is processed sequentially.
Another option is the package utf8string:
package main
import "golang.org/x/exp/utf8string"
func main() {
s := utf8string.NewString("🧡💛💚💙💜")
t := s.At(2)
println(t == '💚')
}
https://pkg.go.dev/golang.org/x/exp/utf8string

Resources