Overhead of converting from []byte to string and vice-versa - string

I always seem to be converting strings to []byte to string again over and over. Is there a lot of overhead with this? Is there a better way?
For example, here is a function that accepts a UTF8 string, normalizes it, remove accents, then converts special characters to ASCII equivalent:
var transliterations = map[rune]string{'Æ':"AE",'Ð':"D",'Ł':"L",'Ø':"OE",'Þ':"Th",'ß':"ss",'æ':"ae",'ð':"d",'ł':"l",'ø':"oe",'þ':"th",'Œ':"OE",'œ':"oe"}
func RemoveAccents(s string) string {
b := make([]byte, len(s))
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
_, _, e := t.Transform(b, []byte(s), true)
if e != nil { panic(e) }
r := string(b)
var f bytes.Buffer
for _, c := range r {
temp := rune(c)
if val, ok := transliterations[temp]; ok {
f.WriteString(val)
} else {
f.WriteRune(temp)
}
}
return f.String()
}
So I'm starting with a string because that's what I get, then I'm converting it to a byte array, then back to a string, then to a byte array again, then back to a string again. Surely this is unnecessary but I can't figure out how to not do this..? And does it really have a lot of overhead or do I not have to worry about slowing things down with excessive conversions?
(Also if anyone has the time I've not yet figured out how bytes.Buffer actually works, would it not be better to initialize a buffer of 2x the size of the string, which is the maximum output size of the return value?)

In Go, strings are immutable so any change creates a new string. As a general rule, convert from a string to a byte or rune slice once and convert back to a string once. To avoid reallocations, for small and transient allocations, over-allocate to provide a safety margin if you don't know the exact number.
For example,
package main
import (
"bytes"
"fmt"
"unicode"
"unicode/utf8"
"code.google.com/p/go.text/transform"
"code.google.com/p/go.text/unicode/norm"
)
var isMn = func(r rune) bool {
return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
var transliterations = map[rune]string{
'Æ': "AE", 'Ð': "D", 'Ł': "L", 'Ø': "OE", 'Þ': "Th",
'ß': "ss", 'æ': "ae", 'ð': "d", 'ł': "l", 'ø': "oe",
'þ': "th", 'Œ': "OE", 'œ': "oe",
}
func RemoveAccents(b []byte) ([]byte, error) {
mnBuf := make([]byte, len(b)*125/100)
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
n, _, err := t.Transform(mnBuf, b, true)
if err != nil {
return nil, err
}
mnBuf = mnBuf[:n]
tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*125/100))
for i, w := 0, 0; i < len(mnBuf); i += w {
r, width := utf8.DecodeRune(mnBuf[i:])
if s, ok := transliterations[r]; ok {
tlBuf.WriteString(s)
} else {
tlBuf.WriteRune(r)
}
w = width
}
return tlBuf.Bytes(), nil
}
func main() {
in := "test stringß"
fmt.Println(in)
inBytes := []byte(in)
outBytes, err := RemoveAccents(inBytes)
if err != nil {
fmt.Println(err)
}
out := string(outBytes)
fmt.Println(out)
}
Output:
test stringß
test stringss

There is no answer to this question. If these conversions are a performance bottleneck in your application you should fix them. If not: Not.
Did you profile your application under realistic load and RemoveAccents is the bottleneck? No? So why bother?
Really: I assume one could do better (in the sense of less garbage, less iterations and less conversions) e.g. by chaining in some "TransliterationTransformer". But I doubt it would be wirth the hassle.

There is a small overhead with converting a string to a byte slice (not an array, that's a different type). Namely allocating the space for the byte slice.
Strings are its own type and are an interpretation of a sequence of bytes. But not every sequence of bytes is a useful string. Strings are also immutable. If you look at the strings package, you will see that strings will be sliced a lot.
In your example you can omit the second conversion back to string. You can also range over a byte slice.
As with every question about performance: you will probably need to measure. Is the allocation of byte slices really your bottleneck?
You can initialize your bytes.Buffer like so:
f := bytes.NewBuffer(make([]byte, 0, len(s)*2))
where you have a size of 0 and a capacity of 2x the size of your string. If you can estimate the size of your buffer, it is probably good to do that. It will save you a few reallocations of the underlying byte slices.

Related

Transforming Go's PutUint16 to Python

I want to get the equivalent of the Go code given below in Python:
func Make(op Opcode, operands ...int) []byte {
def, ok := definitions[op]
if !ok {
return []byte{}
}
instructionLen := 1
for _, w := range def.OperandWidths {
instructionLen += w
}
instruction := make([]byte, instructionLen)
instruction[0] = byte(op)
offset := 1
for i, o := range operands {
width := def.OperandWidths[i]
switch width {
case 2:
binary.BigEndian.PutUint16(instruction[offset:], uint16(o))
case 1:
instruction[offset] = byte(o)
}
offset += width
}
return instruction
}
func ReadOperands(def *Definition, ins Instructions) ([]int, int) {
operands := make([]int, len(def.OperandWidths))
offset := 0
for i, width := range def.OperandWidths {
switch width {
case 2:
operands[i] = int(ReadUint16(ins[offset:]))
case 1:
operands[i] = int(ReadUint8(ins[offset:]))
}
offset += width
}
return operands, offset
}
op above is any of:
type Opcode byte
const (
OpConstant Opcode = iota
OpAdd
OpPop
OpSub
OpMul
OpDiv
)
The code above comes from the book Writing a Compiler in Go and can be found here
I am not exactly sure about what is going on here with byte transformations and packing but in order to understand it better I am writing the whole thing in Python. Can someone help me translate those two functions in Python?
You can use the to_bytes method of integers. o.to_bytes(2, byteorder='big') will give the same effect as PutUint16. Likewise int.from_bytes can be used for reading. There is also struct.pack which handles similar things in a format-string kind of way.
Instead of building the buffer and writing into offsets, as done in the Go code, it makes more sense simply to use + to append to a bytes which begins empty.

Golang Increcementing numbers in strings (using runes)

I have a string mixed with characters and numerals, but i want to increment the last character which happens to be a number, here is what i have, it works, but once i reach 10 rune goes to black since 10 decimal is zero, is there a better way to do this?
package main
import (
"fmt"
)
func main() {
str := "version-1.1.0-8"
rStr := []rune(str)
last := rStr[len(rStr)-1]
rStr[len(rStr)-1] = last + 1
}
So this works for str := "version-1.1.0-8" = version-1.1.0-9
str := version-1.1.0-9 = version-1.1.0-
I understand why it is happening, but I dont know how to fix it
Your intention is to increment the number represented by the last rune, so you should do that: parse out that number, increment it as a number, and "re-encode" it into string.
You can't operate on a single rune, as once the number reaches 10, it can only be represented using 2 runes. Another issue is if the last number is 19, incrementing it needs to alter the previous rune (and not adding a new rune).
Parsing the numbers and re-encoding though is much easier than one might think.
You can take advantage of the fmt package's fmt.Sscanf() and fmt.Sprintf() functions. Parsing and re-encoding is just a single function call.
Let's wrap this functionality into a function:
const format = "version-%d.%d.%d-%d"
func incLast(s string) (string, error) {
var a, b, c, d int
if _, err := fmt.Sscanf(s, format, &a, &b, &c, &d); err != nil {
return "", err
}
d++
return fmt.Sprintf(format, a, b, c, d), nil
}
Testing it:
s := "version-1.1.0-8"
for i := 0; i < 13; i++ {
var err error
if s, err = incLast(s); err != nil {
panic(err)
}
fmt.Println(s)
}
Output (try it on the Go Playground):
version-1.1.0-9
version-1.1.0-10
version-1.1.0-11
version-1.1.0-12
version-1.1.0-13
version-1.1.0-14
version-1.1.0-15
version-1.1.0-16
version-1.1.0-17
version-1.1.0-18
version-1.1.0-19
version-1.1.0-20
version-1.1.0-21
Another option would be to just parse and re-encode the last part, and not the complete version text. This is how it would look like:
func incLast2(s string) (string, error) {
i := strings.LastIndexByte(s, '-')
if i < 0 {
return "", fmt.Errorf("invalid input")
}
d, err := strconv.Atoi(s[i+1:])
if err != nil {
return "", err
}
d++
return s[:i+1] + strconv.Itoa(d), nil
}
Testing and output is the same. Try this one on the Go Playground.

golang: optimal sorting and joining strings

This short method in go's source code has a comment which implies that it's not allocating memory in an optimal way.
... could do better allocation-wise here ...
This is the source code for the Join method.
What exactly is inefficiently allocated here? I don't see a way around allocating the source string slice and the destination byte slice. The source being the slice of keys. The destination being the slice of bytes.
The code referenced by the comment is memory efficient as written. Any allocations are in strings.Join which is written to minimize memory allocations.
I suspect that the comment was accidentally copied and pasted from this code in the net/http package:
// TODO: could do better allocation-wise here, but trailers are rare,
// so being lazy for now.
if _, err := io.WriteString(w, "Trailer: "+strings.Join(keys, ",")+"\r\n"); err != nil {
return err
}
This snippet has the following possible allocations:
[]byte created in strings.Join for constructing the result
string conversion result returned by strings.Join
string result for expression "Trailer: "+strings.Join(keys, ",")+"\r\n"
The []byte conversion result used in io.WriteString
A more memory efficient approach is to allocate a single []byte for the data to be written.
n := len("Trailer: ") + len("\r\n")
for _, s := range keys {
n += len(s) + 1
}
p := make([]byte, 0, n-1) // subtract 1 for len(keys) - 1 commas
p = append(p, "Trailer: "...)
for i, s := range keys {
if i > 0 {
p = append(p, ',')
}
p = append(p, s...)
}
p = append(p, "\r\n"...)
w.Write(p)

Golang converting from rune to string

I have the following code, it is supposed to cast a rune into a string and print it. However, I am getting undefined characters when it is printed. I am unable to figure out where the bug is:
package main
import (
"fmt"
"strconv"
"strings"
"text/scanner"
)
func main() {
var b scanner.Scanner
const a = `a`
b.Init(strings.NewReader(a))
c := b.Scan()
fmt.Println(strconv.QuoteRune(c))
}
That's because you used Scanner.Scan() to read a rune but it does something else. Scanner.Scan() can be used to read tokens or runes of special tokens controlled by the Scanner.Mode bitmask, and it returns special constants form the text/scanner package, not the read rune itself.
To read a single rune use Scanner.Next() instead:
c := b.Next()
fmt.Println(c, string(c), strconv.QuoteRune(c))
Output:
97 a 'a'
If you just want to convert a single rune to string, use a simple type conversion. rune is alias for int32, and converting integer numbers to string:
Converting a signed or unsigned integer value to a string type yields a string containing the UTF-8 representation of the integer.
So:
r := rune('a')
fmt.Println(r, string(r))
Outputs:
97 a
Also to loop over the runes of a string value, you can simply use the for ... range construct:
for i, r := range "abc" {
fmt.Printf("%d - %c (%v)\n", i, r, r)
}
Output:
0 - a (97)
1 - b (98)
2 - c (99)
Or you can simply convert a string value to []rune:
fmt.Println([]rune("abc")) // Output: [97 98 99]
There is also utf8.DecodeRuneInString().
Try the examples on the Go Playground.
Note:
Your original code (using Scanner.Scan()) works like this:
You called Scanner.Init() which sets the Mode (b.Mode) to scanner.GoTokens.
Calling Scanner.Scan() on the input (from "a") returns scanner.Ident because "a" is a valid Go identifier:
c := b.Scan()
if c == scanner.Ident {
fmt.Println("Identifier:", b.TokenText())
}
// Output: "Identifier: a"
I know I'm a bit late to the party but here's a []rune to string function:
func runesToString(runes []rune) (outString string) {
// don't need index so _
for _, v := range runes {
outString += string(v)
}
return
}
yes, there is a named return but I think it's ok in this case as it reduces the number of lines and the function is only short
This simple code works in converting a rune to a string
s := fmt.Sprintf("%c", rune)
Since I came to this question searching for rune and string and char, thought this may help newbies like me
// str := "aഐbc"
// testString(str)
func testString(oneString string){
//string to byte slice - No sweat -just type cast it
// As string IS A byte slice
var twoByteArr []byte = []byte(oneString)
// string to rune Slices - No sweat
// string IS A slice of runes
var threeRuneSlice []rune = []rune(oneString)
// Hmm! String seems to have a dual personality it is both a slice of bytes and
// a slice of runes - yeah - read on
// A rune slice can be convered to string -
// No sweat - as string == rune slice
var thrirdString string = string(threeRuneSlice)
// There is a catch here and that is in printing "characters", using for loop and range
fmt.Println("Chars in oneString")
for i,r := range oneString {
fmt.Printf(" %d %v %c ",i,r,r) //you may not get index 0,1,2,3 here
// since the range runs specially over strings https://blog.golang.org/strings
}
fmt.Println("\nChars in threeRuneSlice")
for i,r := range threeRuneSlice {
fmt.Printf(" %d %v %c ",i,r,r) // i = 0,1,2,4 , perfect!!
// as runes are made up of 4 bytes (rune is int32 and byte in unint8
// and a set of bytes is used to represent a rune which is used to
// represent UTF characters == the REAL CHARECTER
}
fmt.Println("\nValues in oneString ")
for j := 0; j < len(oneString); j++ {
fmt.Printf(" %d %v ",j,oneString[j]) // No you cannot get charecters if you iterate through string in this way
// as you are going over bytes here - not runes
}
fmt.Println("\nValues in twoByteArr")
for j := 0; j < len(twoByteArr); j++ {
fmt.Printf(" %d=%v ",j,twoByteArr[j]) // == same as above
}
fmt.Printf("\none - %s, two %s, three %s\n",oneString,twoByteArr,thrirdString)
}
And some more pointless demo https://play.golang.org/p/tagRBVG8k7V
adapted from https://groups.google.com/g/golang-nuts/c/84GCvDBhpbg/m/Tt6089MPFQAJ
to show that the 'characters' are encoded with one to up to 4 bytes depending on the unicode code point
Provide simple examples to understand how to do it quickly.
// rune => string
fmt.Printf("%c\n", 65) // A
fmt.Println(string(rune(0x1F60A))) // 😊
fmt.Println(string([]rune{0x1F468, 0x200D, 0x1F9B0})) // 👨‍🦰
// string => rune
fmt.Println(strconv.FormatUint(uint64([]rune("😊")[0]), 16)) // 1f60a
fmt.Printf("%U\n", '😊') // U+1F60A
fmt.Printf("%U %U %U\n", '👨', '‍', '🦰') // U+1F468 U+200D U+1F9B0
go playground

What is the fastest way to generate a long random string in Go?

Like [a-zA-Z0-9] string:
na1dopW129T0anN28udaZ
or hexadecimal string:
8c6f78ac23b4a7b8c0182d
By long I mean 2K and more characters.
This does about 200MBps on my box. There's obvious room for improvement.
type randomDataMaker struct {
src rand.Source
}
func (r *randomDataMaker) Read(p []byte) (n int, err error) {
for i := range p {
p[i] = byte(r.src.Int63() & 0xff)
}
return len(p), nil
}
You'd just use io.CopyN to produce the string you want. Obviously you could adjust the character set on the way in or whatever.
The nice thing about this model is that it's just an io.Reader so you can use it making anything.
Test is below:
func BenchmarkRandomDataMaker(b *testing.B) {
randomSrc := randomDataMaker{rand.NewSource(1028890720402726901)}
for i := 0; i < b.N; i++ {
b.SetBytes(int64(i))
_, err := io.CopyN(ioutil.Discard, &randomSrc, int64(i))
if err != nil {
b.Fatalf("Error copying at %v: %v", i, err)
}
}
}
On one core of my 2.2GHz i7:
BenchmarkRandomDataMaker 50000 246512 ns/op 202.83 MB/s
EDIT
Since I wrote the benchmark, I figured I'd do the obvious improvement thing (call out to the random less frequently). With 1/8 the calls to rand, it runs about 4x faster, though it's a big uglier:
New version:
func (r *randomDataMaker) Read(p []byte) (n int, err error) {
todo := len(p)
offset := 0
for {
val := int64(r.src.Int63())
for i := 0; i < 8; i++ {
p[offset] = byte(val & 0xff)
todo--
if todo == 0 {
return len(p), nil
}
offset++
val >>= 8
}
}
panic("unreachable")
}
New benchmark:
BenchmarkRandomDataMaker 200000 251148 ns/op 796.34 MB/s
EDIT 2
Took out the masking in the cast to byte since it was redundant. Got a good deal faster:
BenchmarkRandomDataMaker 200000 231843 ns/op 862.64 MB/s
(this is so much easier than real work sigh)
EDIT 3
This came up in irc today, so I released a library. Also, my actual benchmark tool, while useful for relative speed, isn't sufficiently accurate in its reporting.
I created randbo that you can reuse to produce random streams wherever you may need them.
You can use the Go package uniuri to generate random strings (or view the source code to see how they're doing it). You'll want to use:
func NewLen(length int) string
NewLen returns a new random string of the provided length, consisting of standard characters.
Or, to specify the set of characters used:
func NewLenChars(length int, chars []byte) string
This is actually a little biased towards the first 8 characters in the set (since 255 is not a multiple of len(alphanum)), but this will get you most of the way there.
import (
"crypto/rand"
)
func randString(n int) string {
const alphanum = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
var bytes = make([]byte, n)
rand.Read(bytes)
for i, b := range bytes {
bytes[i] = alphanum[b % byte(len(alphanum))]
}
return string(bytes)
}
If you want to generate cryptographically secure random string, I recommend you to take a look at this page. Here is a helper function that reads n random bytes from the source of randomness of your OS and then use these bytes to base64encode it. Note that the string length would be bigger than n because of base64.
package main
import(
"crypto/rand"
"encoding/base64"
"fmt"
)
func GenerateRandomBytes(n int) ([]byte, error) {
b := make([]byte, n)
_, err := rand.Read(b)
if err != nil {
return nil, err
}
return b, nil
}
func GenerateRandomString(s int) (string, error) {
b, err := GenerateRandomBytes(s)
return base64.URLEncoding.EncodeToString(b), err
}
func main() {
token, _ := GenerateRandomString(32)
fmt.Println(token)
}
Here Evan Shaw's answer re-worked without the bias towards the first 8 characters of the string. Note that it uses lots of expensive big.Int operations so probably isn't that quick! The answer is crypto strong though.
It uses rand.Int to make an integer of exactly the right size len(alphanum) ** n, then does what is effectively a base conversion into base len(alphanum).
There is almost certainly a better algorithm for this which would involve keeping a much smaller remainder and adding random bytes to it as necessary. This would get rid of the expensive long integer arithmetic.
import (
"crypto/rand"
"fmt"
"math/big"
)
func randString(n int) string {
const alphanum = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
symbols := big.NewInt(int64(len(alphanum)))
states := big.NewInt(0)
states.Exp(symbols, big.NewInt(int64(n)), nil)
r, err := rand.Int(rand.Reader, states)
if err != nil {
panic(err)
}
var bytes = make([]byte, n)
r2 := big.NewInt(0)
symbol := big.NewInt(0)
for i := range bytes {
r2.DivMod(r, symbols, symbol)
r, r2 = r2, r
bytes[i] = alphanum[symbol.Int64()]
}
return string(bytes)
}

Resources