Safetly write bytes as UTF-8 characters using strings.Builder?

Safetly write bytes as UTF-8 characters using strings.Builder? - string

I'm trying to do an exercise where I have to reverse some strings. I heard that Go strings.Builder is the fastest way to create strings at the moment, so I did the following:
func String(toReverse string) string {
var reversedString strings.Builder
for i := len(toReverse) - 1; i >= 0; i-- {
reversedString.WriteByte(toReverse[i])
}
return reversedString.String()
}
The problem is that this doesn't work with multibyte test cases like this:
Hello, 世界
becomes
"\u008c\u0095ç\u0096¸ä ,olleH"
Thanks.

While it is possible - and the default - to access a string's individual bytes by index, a Go string is also convertible to an array of runes, which represent individual Unicode code points.
First of all you need to cast your string into a []rune, and iterate on that:
func String(toReverse string) string {
var reversedString strings.Builder
runes := []rune(toReverse)
for i := len(runes) - 1; i >= 0; i-- {
reversedString.WriteRune(runes[i])
}
return reversedString.String()
}
See https://play.golang.org/p/WYn_MGAGw_x for a live demo.

Related

How can I iterate over each 2 consecutive characters in a string in go?

I have a string like this:
package main
import "fmt"
func main() {
some := "p1k4"
for i, j := range some {
fmt.Println()
}
}
I want take each two consecutive characters in the string and print them. the output should like p1, 1k, k4, 4p.
I have tried it and still having trouble finding the answer, how should I write the code in go and get the output I want?

Go stores strings in memory as their UTF-8 encoded byte sequence. This maps ASCII charactes one-to-one in bytes, but characters outside of that range map to multiple bytes.
So I would advise to use the for range loop over a string, which ranges over the runes (characters) of the string, properly decoding multi-byte runes. This has the advantage that it does not require allocation (unlike converting the string to []rune). You may also print the pairs using fmt.Printf("%c%c", char1, char2), which also will not require allocation (unlike converting runes back to string and concatenating them).
To learn more about strings, characters and runes in Go, read blog post: Strings, bytes, runes and characters in Go
Since the loop only returns the "current" rune in the iteration (but not the previous or the next rune), use another variable to store the previous (and first) runes so you have access to them when printing.
Let's write a function that prints the pairs as you want:
func printPairs(s string) {
var first, prev rune
for i, r := range s {
if i == 0 {
first, prev = r, r
continue
}
fmt.Printf("%c%c, ", prev, r)
prev = r
}
// Print last pair: prev is the last rune
fmt.Printf("%c%c\n", prev, first)
}
Testing it with your input and with another string that has multi-byte runes:
printPairs("p1k4")
printPairs("Go-世界")
Output will be (try it on the Go Playground):
p1, 1k, k4, 4p
Go, o-, -世, 世界, 界G

package main
import (
"fmt"
)
func main() {
str := "12345"
for i := 0; i < len(str); i++ {
fmt.Println(string(str[i]) + string(str[(i+1)%len(str)]))
}
}

This is a simple for loop over your string with the first character appended at the back:
package main
import "fmt"
func main() {
some := "p1k4"
ns := some + string(some[0])
for i := 0; i < len(ns)-1; i++ {
fmt.Println(ns[i:i+2])
}
}

Splitting a rune correctly in golang

I'm wondering if there is an easy way, such as well known functions to handle code points/runes, to take a chunk out of the middle of a rune slice without messing it up or if it's all needs to coded ourselves to get down to something equal to or less than a maximum number of bytes.
Specifically, what I am looking to do is pass a string to a function, convert it to runes so that I can respect code points and if the slice is longer than some maximum bytes, remove enough runes from the center of the runes to get the bytes down to what's necessary.
This is simple math if the strings are just single byte characters and be handled something like:
func shortenStringIDToMaxLength(in string, maxLen int) string {
if len(in) > maxLen {
excess := len(in) - maxLen
start := maxLen/2 - excess/2
return in[:start] + in[start+excess:]
}
return in
}
but in a variable character width byte string it's either going to be a fair bit more coding looping through or there will be nice functions to make this easy. Does anyone have a code sample of how to best handle such a thing with runes?
The idea here is that the DB field the string will go into has a fixed maximum length in bytes, not code points so there needs to be some algorithm from runes to maximum bytes. The reason for taking the characters from the the middle of the string is just the needs of this particular program.
Thanks!
EDIT:
Once I found out that the range operator respected runes on strings this became easy to do with just strings which I found because of the great answers below. I shouldn't have to worry about the string being a well formed UTF format in this case but if I do I now know about the UTF module, thanks!
Here's what I ended up with:
package main
import (
"fmt"
)
func ShortenStringIDToMaxLength(in string, maxLen int) string {
if maxLen < 1 {
// Panic/log whatever is your error system of choice.
}
bytes := len(in)
if bytes > maxLen {
excess := bytes - maxLen
lPos := bytes/2 - excess/2
lastPos := 0
for pos, _ := range in {
if pos > lPos {
lPos = lastPos
break
}
lastPos = pos
}
rPos := lPos + excess
for pos, _ := range in[lPos:] {
if pos >= excess {
rPos = pos
break
}
}
return in[:lPos] + in[lPos+rPos:]
}
return in
}
func main() {
out := ShortenStringIDToMaxLength(`123456789 123456789`, 5)
fmt.Println(out, len(out))
}
https://play.golang.org/p/YLGlj_17A-j

Here is an adaptation of your algorithm, which removes incomplete runes from the beginning of your prefix and the end of your suffix :
func TrimLastIncompleteRune(s string) string {
l := len(s)
for i := 1; i <= l; i++ {
suff := s[l-i : l]
// repeatedly try to decode a rune from the last bytes in string
r, cnt := utf8.DecodeRuneInString(suff)
if r == utf8.RuneError {
continue
}
// if success : return the substring which contains
// this succesfully decoded rune
lgth := l - i + cnt
return s[:lgth]
}
return ""
}
func TrimFirstIncompleteRune(s string) string {
// repeatedly try to decode a rune from the beginning
for i := 0; i < len(s); i++ {
if r, _ := utf8.DecodeRuneInString(s[i:]); r != utf8.RuneError {
// if success : return
return s[i:]
}
}
return ""
}
func shortenStringIDToMaxLength(in string, maxLen int) string {
if len(in) > maxLen {
firstHalf := maxLen / 2
secondHalf := len(in) - (maxLen - firstHalf)
prefix := TrimLastIncompleteRune(in[:firstHalf])
suffix := TrimFirstIncompleteRune(in[secondHalf:])
return prefix + suffix
}
return in
}
link on play.golang.org
This algorithm only tries to drop more bytes from the selected prefix and suffix.
If it turns out that you need to drop 3 bytes from the suffix to have a valid rune, for example, it does not try to see if it can add 3 more bytes to the prefix, to have an end result closer to maxLen bytes.

You can use simple arithmetic to find start and end such that the string s[:start] + s[end:] is shorter than your byte limit. But you need to make sure that start and end are both the first byte of any utf-8 sequence to keep the sequence valid.
UTF-8 has the property that any given byte is the first byte of a sequence as long as its top two bits aren't 10.
So you can write code something like this (playground: https://play.golang.org/p/xk_Yo_1wTYc)
package main
import (
"fmt"
)
func truncString(s string, maxLen int) string {
if len(s) <= maxLen {
return s
}
start := (maxLen + 1) / 2
for start > 0 && s[start]>>6 == 0b10 {
start--
}
end := len(s) - (maxLen - start)
for end < len(s) && s[end]>>6 == 0b10 {
end++
}
return s[:start] + s[end:]
}
func main() {
fmt.Println(truncString("this is a test", 5))
fmt.Println(truncString("日本語", 7))
}
This code has the desirable property that it takes O(maxLen) time, no matter how long the input string (assuming it's valid utf-8).

Golang converting from rune to string

I have the following code, it is supposed to cast a rune into a string and print it. However, I am getting undefined characters when it is printed. I am unable to figure out where the bug is:
package main
import (
"fmt"
"strconv"
"strings"
"text/scanner"
)
func main() {
var b scanner.Scanner
const a = `a`
b.Init(strings.NewReader(a))
c := b.Scan()
fmt.Println(strconv.QuoteRune(c))
}

That's because you used Scanner.Scan() to read a rune but it does something else. Scanner.Scan() can be used to read tokens or runes of special tokens controlled by the Scanner.Mode bitmask, and it returns special constants form the text/scanner package, not the read rune itself.
To read a single rune use Scanner.Next() instead:
c := b.Next()
fmt.Println(c, string(c), strconv.QuoteRune(c))
Output:
97 a 'a'
If you just want to convert a single rune to string, use a simple type conversion. rune is alias for int32, and converting integer numbers to string:
Converting a signed or unsigned integer value to a string type yields a string containing the UTF-8 representation of the integer.
So:
r := rune('a')
fmt.Println(r, string(r))
Outputs:
97 a
Also to loop over the runes of a string value, you can simply use the for ... range construct:
for i, r := range "abc" {
fmt.Printf("%d - %c (%v)\n", i, r, r)
}
Output:
0 - a (97)
1 - b (98)
2 - c (99)
Or you can simply convert a string value to []rune:
fmt.Println([]rune("abc")) // Output: [97 98 99]
There is also utf8.DecodeRuneInString().
Try the examples on the Go Playground.
Note:
Your original code (using Scanner.Scan()) works like this:
You called Scanner.Init() which sets the Mode (b.Mode) to scanner.GoTokens.
Calling Scanner.Scan() on the input (from "a") returns scanner.Ident because "a" is a valid Go identifier:
c := b.Scan()
if c == scanner.Ident {
fmt.Println("Identifier:", b.TokenText())
}
// Output: "Identifier: a"

I know I'm a bit late to the party but here's a []rune to string function:
func runesToString(runes []rune) (outString string) {
// don't need index so _
for _, v := range runes {
outString += string(v)
}
return
}
yes, there is a named return but I think it's ok in this case as it reduces the number of lines and the function is only short

This simple code works in converting a rune to a string
s := fmt.Sprintf("%c", rune)

Since I came to this question searching for rune and string and char, thought this may help newbies like me
// str := "aഐbc"
// testString(str)
func testString(oneString string){
//string to byte slice - No sweat -just type cast it
// As string IS A byte slice
var twoByteArr []byte = []byte(oneString)
// string to rune Slices - No sweat
// string IS A slice of runes
var threeRuneSlice []rune = []rune(oneString)
// Hmm! String seems to have a dual personality it is both a slice of bytes and
// a slice of runes - yeah - read on
// A rune slice can be convered to string -
// No sweat - as string == rune slice
var thrirdString string = string(threeRuneSlice)
// There is a catch here and that is in printing "characters", using for loop and range
fmt.Println("Chars in oneString")
for i,r := range oneString {
fmt.Printf(" %d %v %c ",i,r,r) //you may not get index 0,1,2,3 here
// since the range runs specially over strings https://blog.golang.org/strings
}
fmt.Println("\nChars in threeRuneSlice")
for i,r := range threeRuneSlice {
fmt.Printf(" %d %v %c ",i,r,r) // i = 0,1,2,4 , perfect!!
// as runes are made up of 4 bytes (rune is int32 and byte in unint8
// and a set of bytes is used to represent a rune which is used to
// represent UTF characters == the REAL CHARECTER
}
fmt.Println("\nValues in oneString ")
for j := 0; j < len(oneString); j++ {
fmt.Printf(" %d %v ",j,oneString[j]) // No you cannot get charecters if you iterate through string in this way
// as you are going over bytes here - not runes
}
fmt.Println("\nValues in twoByteArr")
for j := 0; j < len(twoByteArr); j++ {
fmt.Printf(" %d=%v ",j,twoByteArr[j]) // == same as above
}
fmt.Printf("\none - %s, two %s, three %s\n",oneString,twoByteArr,thrirdString)
}
And some more pointless demo https://play.golang.org/p/tagRBVG8k7V
adapted from https://groups.google.com/g/golang-nuts/c/84GCvDBhpbg/m/Tt6089MPFQAJ
to show that the 'characters' are encoded with one to up to 4 bytes depending on the unicode code point

Provide simple examples to understand how to do it quickly.
// rune => string
fmt.Printf("%c\n", 65) // A
fmt.Println(string(rune(0x1F60A))) // 😊
fmt.Println(string([]rune{0x1F468, 0x200D, 0x1F9B0})) // 👨‍🦰
// string => rune
fmt.Println(strconv.FormatUint(uint64([]rune("😊")[0]), 16)) // 1f60a
fmt.Printf("%U\n", '😊') // U+1F60A
fmt.Printf("%U %U %U\n", '👨', '‍', '🦰') // U+1F468 U+200D U+1F9B0
go playground

Convert string to binary in Go

How do you convert a string to its binary representation in Go?
Example:
Input: "A"
Output: "01000001"
In my testing, fmt.Sprintf("%b", 75) only works on integers.

Cast the 1-character string to a byte in order to get its numerical representation.
s := "A"
st := fmt.Sprintf("%08b", byte(s[0]))
fmt.Println(st)
Output: "01000001"
(Bear in mind code "%b" (without number in between) causes leading zeros in output to be dropped.)

You have to iterate over the runes of the string:
func toBinaryRunes(s string) string {
var buffer bytes.Buffer
for _, runeValue := range s {
fmt.Fprintf(&buffer, "%b", runeValue)
}
return fmt.Sprintf("%s", buffer.Bytes())
}
Or over the bytes:
func toBinaryBytes(s string) string {
var buffer bytes.Buffer
for i := 0; i < len(s); i++ {
fmt.Fprintf(&buffer, "%b", s[i])
}
return fmt.Sprintf("%s", buffer.Bytes())
}
Live playground:
http://play.golang.org/p/MXZ1Y17xWa

How can I assign a new char into a string in Go?

I'm trying to alter an existing string in Go but I keep getting this error "cannot assign to new_str[i]"
package main
import "fmt"
func ToUpper(str string) string {
new_str := str
for i:=0; i<len(str); i++{
if str[i]>='a' && str[i]<='z'{
chr:=uint8(rune(str[i])-'a'+'A')
new_str[i]=chr
}
}
return new_str
}
func main() {
fmt.Println(ToUpper("cdsrgGDH7865fxgh"))
}
This is my code, I wish to change lowercase to uppercase but I cant alter the string. Why? How can I alter it?
P.S I wish to use ONLY the fmt package!
Thanks in advance.

You can't... they are immutable. From the Golang Language Specification:
Strings are immutable: once created, it is impossible to change the contents of a string.
You can however, cast it to a []byte slice and alter that:
func ToUpper(str string) string {
new_str := []byte(str)
for i := 0; i < len(str); i++ {
if str[i] >= 'a' && str[i] <= 'z' {
chr := uint8(rune(str[i]) - 'a' + 'A')
new_str[i] = chr
}
}
return string(new_str)
}
Working sample: http://play.golang.org/p/uZ_Gui7cYl

Use range and avoid unnecessary conversions and allocations. Strings are immutable. For example,
package main
import "fmt"
func ToUpper(s string) string {
var b []byte
for i, c := range s {
if c >= 'a' && c <= 'z' {
if b == nil {
b = []byte(s)
}
b[i] = byte('A' + rune(c) - 'a')
}
}
if b == nil {
return s
}
return string(b)
}
func main() {
fmt.Println(ToUpper("cdsrgGDH7865fxgh"))
}
Output:
CDSRGGDH7865FXGH

In Go strings are immutable. Here is one very bad way of doing what you want (playground)
package main
import "fmt"
func ToUpper(str string) string {
new_str := ""
for i := 0; i < len(str); i++ {
chr := str[i]
if chr >= 'a' && chr <= 'z' {
chr = chr - 'a' + 'A'
}
new_str += string(chr)
}
return new_str
}
func main() {
fmt.Println(ToUpper("cdsrgGDH7865fxgh"))
}
This is bad because
you are treating your string as characters - what if it is UTF-8? Using range str is the way to go
appending to strings is slow - lots of allocations - a bytes.Buffer would be a good idea
there is a very good library routine to do this already strings.ToUpper
It is worth exploring the line new_str += string(chr) a bit more. Strings are immutable, so what this does is make a new string with the chr on the end, it doesn't extend the old string. This is wildly inefficient for long strings as the allocated memory will tend to the square of the string length.
Next time just use strings.ToUpper!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Safetly write bytes as UTF-8 characters using strings.Builder? - string

Related

How can I iterate over each 2 consecutive characters in a string in go?

Splitting a rune correctly in golang

Golang converting from rune to string

Convert string to binary in Go

How can I assign a new char into a string in Go?

Categories

Resources