substrings and the Go garbage collector - string

When taking a substring of a string in Go, no new memory is allocated. Instead, the underlying representation of the substring contains a Data pointer that is an offset of the original string's Data pointer.
This means that if I have a large string and wish to keep track of a small substring, the garbage collector will be unable to free any of the large string until I release all references to the shorter substring.
Slices have a similar problem, but you can get around it by making a copy of the subslice using copy(). I am unaware of any similar copy operation for strings. What is the idiomatic and fastest way to make a "copy" of a substring?

For example,
package main
import (
"fmt"
"unsafe"
)
type String struct {
str *byte
len int
}
func main() {
str := "abc"
substr := string([]byte(str[1:]))
fmt.Println(str, substr)
fmt.Println(*(*String)(unsafe.Pointer(&str)), *(*String)(unsafe.Pointer(&substr)))
}
Output:
abc bc
{0x4c0640 3} {0xc21000c940 2}

I know this is an old question, but there are a couple ways you can do this without creating two copies of the data you want.
First is to create the []byte of the substring, then simply coerce it to a string using unsafe.Pointer. This works because the header for a []byte is the same as that for a string, except that the []byte has an extra Cap field at the end, so it just gets truncated.
package main
import (
"fmt"
"unsafe"
)
func main() {
str := "foobar"
byt := []byte(str[3:])
sub := *(*string)(unsafe.Pointer(&byt))
fmt.Println(str, sub)
}
The second way is to use reflect.StringHeader and reflect.SliceHeader to do a more explicit header transfer.
package main
import (
"fmt"
"unsafe"
"reflect"
)
func main() {
str := "foobar"
byt := []byte(str[3:])
bytPtr := (*reflect.SliceHeader)(unsafe.Pointer(&byt)).Data
strHdr := reflect.StringHeader{Data: bytPtr, Len: len(byt)}
sub := *(*string)(unsafe.Pointer(&strHdr))
fmt.Println(str, sub)
}

Related

Combining multiple maps that are stored on channel (Same key's values get summed.) in Go

My objective is to create a program that counts every unique word's occurrence in a text file in a parallellised fashion, all occurrences have to be presented in a single map.
What I do here is dividing the textfile into string and then to an array. That array is then divided into two slices of equal length and fed concurrently to the mapper function.
func WordCount(text string) (map[string]int) {
wg := new(sync.WaitGroup)
s := strings.Fields(newText)
freq := make(map[string]int,len(s))
channel := make(chan map[string]int,2)
wg.Add(1)
go mappers(s[0:(len(s)/2)], freq, channel,wg)
wg.Add(1)
go mappers(s[(len(s)/2):], freq, channel,wg)
wg.Wait()
actualMap := <-channel
return actualMap
func mappers(slice []string, occurrences map[string]int, ch chan map[string]int, wg *sync.WaitGroup) {
var l = sync.Mutex{}
for _, word := range slice {
l.Lock()
occurrences[word]++
l.Unlock()
}
ch <- occurrences
wg.Done()
}
The bottom line is, is that I get a huge multiline error that starts with
fatal error: concurrent map writes
When I run the code. Which I thought I guarded for through mutual exclusion
l.Lock()
occurrences[word]++
l.Unlock()
What am I doing wrong here? And furthermore. How can I combine all the maps in a channel? And with combine I mean same key's values get summed in the new map.
The main problem is that you use a separate lock in each goroutine. That doesn't do any help to serialize access to the map. The same lock has to be used in each goroutine.
And since you use the same map in each goroutine, you don't have to merge them, and you don't need a channel to deliver the result.
Even if you use the same mutex in each goroutine, since you use a single map, this probably won't help in performance, the goroutines will have to compete with each other for the map's lock.
You should create a separate map in each goroutine, use that to count locally, and then deliver the result map on the channel. This might give you a performance boost.
But then you don't need a lock, since each goroutine will have its own map which it can read/write without a mutex.
But then you'll do have to deliver the result on the channel, and then merge it.
And since goroutines deliver results on the channel, the waitgroup becomes unnecessary.
func WordCount(text string) map[string]int {
s := strings.Fields(text)
channel := make(chan map[string]int, 2)
go mappers(s[0:(len(s)/2)], channel)
go mappers(s[(len(s)/2):], channel)
total := map[string]int{}
for i := 0; i < 2; i++ {
m := <-channel
for k, v := range m {
total[k] += v
}
}
return total
}
func mappers(slice []string, ch chan map[string]int) {
occurrences := map[string]int{}
for _, word := range slice {
occurrences[word]++
}
ch <- occurrences
}
Example testing it:
fmt.Println(WordCount("aa ab cd cd de ef a x cd aa"))
Output (try it on the Go Playground):
map[a:1 aa:2 ab:1 cd:3 de:1 ef:1 x:1]
Also note that in theory this looks "good", but in practice you may still not achieve any performance boost, as the goroutines do too "little" work, and launching them and merging the results requires effort which may outweight the benefits.

Convert byte to string using reflect.StringHeader still allocates new memory?

I've got this small code snippet to test 2 ways of converting byte slice to string object, one function to allocate a new string object, another uses unsafe pointer arithmetic to construct string*, which doesn't allocate new memory:
package main
import (
"fmt"
"reflect"
"unsafe"
)
func byteToString(b []byte) string {
return string(b)
}
func byteToStringNoAlloc(b []byte) string {
if len(b) == 0 {
return ""
}
sh := reflect.StringHeader{uintptr(unsafe.Pointer(&b[0])), len(b)}
return *(*string)(unsafe.Pointer(&sh))
}
func main() {
b := []byte("hello")
fmt.Printf("1st element of slice: %v\n", &b[0])
str := byteToString(b)
sh := (*reflect.StringHeader)(unsafe.Pointer(&str))
fmt.Printf("New alloc: %v\n", sh)
toStr := byteToStringNoAlloc(b)
shNoAlloc := (*reflect.StringHeader)(unsafe.Pointer(&toStr))
fmt.Printf("No alloc: %v\n", shNoAlloc) // why different from &b[0]
}
I run this program under go 1.13:
1st element of slice: 0xc000076068
New alloc: &{824634204304 5}
No alloc: &{824634204264 5}
I exptect that the "1st element of slice" should print out the same address like "No alloc", but acturally they're very different. Where did I get wrong?
First of all, type conversions are calling a internal functions, for this case it's slicebytetostring.
https://golang.org/src/runtime/string.go?h=slicebytetostring#L75
It does copy of slice's content into new allocated memory.
In the second case you're creating a new header of the slice and cast it into string header the new unofficial holder of slice's content.
The problem of this is that garbage collector doesn't handle such kind of cases and resulting string header will be marked as a single structure which has no relations with the actual slice which holds the actual content, so, your resulting string would be valid only while the actual content holders are alive (don't count this string header itself).
So once garbage collector sweep the actual content, your string will still point to the same address but already freed memory, and you'll get the panic error or undefined behavior if you touch it.
By the way, there's no need to use reflect package and its headers because direct cast already creates new header as a result:
*(*string)(unsafe.Pointer(&byte_slice))

How to get substring index based on runes not bytes in Go?

I'm playing with Golang and found this problem. I can use the following code to get the index based on bytes:
strings.Index("您好你好", "你好")
What I got is 6 and it's based on bytes counting.
If we count on runes (characters), we should get 2 which is what I want. How can I get what I want?
Thanks.
You can use the utf8.RuneCountInString() method:
import (
"fmt"
"strings"
"unicode/utf8"
)
func main() {
input_string := "您好你好"
byte_index := strings.Index(input_string, "你好")
fmt.Println(utf8.RuneCountInString(input_string[:byte_index]))
}

Can a zero-length and zero-cap slice still point to an underlying array and prevent garbage collection?

Let's take the following scenario:
a := make([]int, 10000)
a = a[len(a):]
As we know from "Go Slices: Usage and Internals" there's a "possible gotcha" in downslicing. For any slice a if you do a[start:end] it still points to the original memory, so if you don't copy, a small downslice could potentially keep a very large array in memory for a long time.
However, this case is chosen to result in a slice that should not only have zero length, but zero capacity. A similar question could be asked for the construct a = a[0:0:0].
Does the current implementation still maintain a pointer to the underlying memory, preventing it from being garbage collected, or does it recognize that a slice with no len or cap could not possibly reference anything, and thus garbage collect the original backing array during the next GC pause (assuming no other references exist)?
Edit: Playing with reflect and unsafe on the Playground reveals that the pointer is non-zero:
func main() {
a := make([]int, 10000)
a = a[len(a):]
aHeader := *(*reflect.SliceHeader)((unsafe.Pointer(&a)))
fmt.Println(aHeader.Data)
a = make([]int, 0, 0)
aHeader = *(*reflect.SliceHeader)((unsafe.Pointer(&a)))
fmt.Println(aHeader.Data)
}
http://play.golang.org/p/L0tuzN4ULn
However, this doesn't necessarily answer the question because the second slice that NEVER had anything in it also has a non-zero pointer as the data field. Even so, the pointer could simply be uintptr(&a[len(a)-1]) + sizeof(int) which would be outside the block of backing memory and thus not trigger actual garbage collection, though this seems unlikely since that would prevent garbage collection of other things. The non-zero value could also conceivably just be Playground weirdness.
As seen in your example, re-slicing copies the slice header, including the data pointer to the new slice, so I put together a small test to try and force the runtime to reuse the memory if possible.
I'd like this to be more deterministic, but at least with go1.3 on x86_64, it shows that the memory used by the original array is eventually reused (it does not work in the playground in this form).
package main
import (
"fmt"
"unsafe"
)
func check(i uintptr) {
fmt.Printf("Value at %d: %d\n", i, *(*int64)(unsafe.Pointer(i)))
}
func garbage() string {
s := ""
for i := 0; i < 100000; i++ {
s += "x"
}
return s
}
func main() {
s := make([]int64, 100000)
s[0] = 42
p := uintptr(unsafe.Pointer(&s[0]))
check(p)
z := s[0:0:0]
s = nil
fmt.Println(z)
garbage()
check(p)
}

Format a Go string without printing?

Is there a simple way to format a string in Go without printing the string?
I can do:
bar := "bar"
fmt.Printf("foo: %s", bar)
But I want the formatted string returned rather than printed so I can manipulate it further.
I could also do something like:
s := "foo: " + bar
But this becomes difficult to read when the format string is complex, and cumbersome when one or many of the parts aren't strings and have to be converted first, like
i := 25
s := "foo: " + strconv.Itoa(i)
Is there a simpler way to do this?
Sprintf is what you are looking for.
Example
fmt.Sprintf("foo: %s", bar)
You can also see it in use in the Errors example as part of "A Tour of Go."
return fmt.Sprintf("at %v, %s", e.When, e.What)
1. Simple strings
For "simple" strings (typically what fits into a line) the simplest solution is using fmt.Sprintf() and friends (fmt.Sprint(), fmt.Sprintln()). These are analogous to the functions without the starter S letter, but these Sxxx() variants return the result as a string instead of printing them to the standard output.
For example:
s := fmt.Sprintf("Hi, my name is %s and I'm %d years old.", "Bob", 23)
The variable s will be initialized with the value:
Hi, my name is Bob and I'm 23 years old.
Tip: If you just want to concatenate values of different types, you may not automatically need to use Sprintf() (which requires a format string) as Sprint() does exactly this. See this example:
i := 23
s := fmt.Sprint("[age:", i, "]") // s will be "[age:23]"
For concatenating only strings, you may also use strings.Join() where you can specify a custom separator string (to be placed between the strings to join).
Try these on the Go Playground.
2. Complex strings (documents)
If the string you're trying to create is more complex (e.g. a multi-line email message), fmt.Sprintf() becomes less readable and less efficient (especially if you have to do this many times).
For this the standard library provides the packages text/template and html/template. These packages implement data-driven templates for generating textual output. html/template is for generating HTML output safe against code injection. It provides the same interface as package text/template and should be used instead of text/template whenever the output is HTML.
Using the template packages basically requires you to provide a static template in the form of a string value (which may be originating from a file in which case you only provide the file name) which may contain static text, and actions which are processed and executed when the engine processes the template and generates the output.
You may provide parameters which are included/substituted in the static template and which may control the output generation process. Typical form of such parameters are structs and map values which may be nested.
Example:
For example let's say you want to generate email messages that look like this:
Hi [name]!
Your account is ready, your user name is: [user-name]
You have the following roles assigned:
[role#1], [role#2], ... [role#n]
To generate email message bodies like this, you could use the following static template:
const emailTmpl = `Hi {{.Name}}!
Your account is ready, your user name is: {{.UserName}}
You have the following roles assigned:
{{range $i, $r := .Roles}}{{if $i}}, {{end}}{{.}}{{end}}
`
And provide data like this for executing it:
data := map[string]interface{}{
"Name": "Bob",
"UserName": "bob92",
"Roles": []string{"dbteam", "uiteam", "tester"},
}
Normally output of templates are written to an io.Writer, so if you want the result as a string, create and write to a bytes.Buffer (which implements io.Writer). Executing the template and getting the result as string:
t := template.Must(template.New("email").Parse(emailTmpl))
buf := &bytes.Buffer{}
if err := t.Execute(buf, data); err != nil {
panic(err)
}
s := buf.String()
This will result in the expected output:
Hi Bob!
Your account is ready, your user name is: bob92
You have the following roles assigned:
dbteam, uiteam, tester
Try it on the Go Playground.
Also note that since Go 1.10, a newer, faster, more specialized alternative is available to bytes.Buffer which is: strings.Builder. Usage is very similar:
builder := &strings.Builder{}
if err := t.Execute(builder, data); err != nil {
panic(err)
}
s := builder.String()
Try this one on the Go Playground.
Note: you may also display the result of a template execution if you provide os.Stdout as the target (which also implements io.Writer):
t := template.Must(template.New("email").Parse(emailTmpl))
if err := t.Execute(os.Stdout, data); err != nil {
panic(err)
}
This will write the result directly to os.Stdout. Try this on the Go Playground.
try using Sprintf(); it will not print the output but save it for future purpose.
check this out.
package main
import "fmt"
func main() {
address := "NYC"
fmt.Sprintf("I live in %v", address)
}
when you run this code, it will not output anything. But once you assigned the Sprintf() to a separate variable, it can be used for future purposes.
package main
import "fmt"
func main() {
address := "NYC"
fmt.Sprintf("I live in %v", address)
var city = fmt.Sprintf("lives in %v", address)
fmt.Println("Michael",city)
}
I've created go project for string formatting from template (it allow to format strings in C# or Python style) and by performance tests it is fater than fmt.Sprintf, you could find it here https://github.com/Wissance/stringFormatter
it works in following manner:
func TestStrFormat(t *testing.T) {
strFormatResult, err := Format("Hello i am {0}, my age is {1} and i am waiting for {2}, because i am {0}!",
"Michael Ushakov (Evillord666)", "34", "\"Great Success\"")
assert.Nil(t, err)
assert.Equal(t, "Hello i am Michael Ushakov (Evillord666), my age is 34 and i am waiting for \"Great Success\", because i am Michael Ushakov (Evillord666)!", strFormatResult)
strFormatResult, err = Format("We are wondering if these values would be replaced : {5}, {4}, {0}", "one", "two", "three")
assert.Nil(t, err)
assert.Equal(t, "We are wondering if these values would be replaced : {5}, {4}, one", strFormatResult)
strFormatResult, err = Format("No args ... : {0}, {1}, {2}")
assert.Nil(t, err)
assert.Equal(t, "No args ... : {0}, {1}, {2}", strFormatResult)
}
func TestStrFormatComplex(t *testing.T) {
strFormatResult, err := FormatComplex("Hello {user} what are you doing here {app} ?", map[string]string{"user":"vpupkin", "app":"mn_console"})
assert.Nil(t, err)
assert.Equal(t, "Hello vpupkin what are you doing here mn_console ?", strFormatResult)
}
In your case, you need to use Sprintf() for format string.
func Sprintf(format string, a ...interface{}) string
Sprintf formats according to a format specifier and returns the resulting string.
s := fmt.Sprintf("Good Morning, This is %s and I'm living here from last %d years ", "John", 20)
Your output will be :
Good Morning, This is John and I'm living here from last 20 years.
fmt.SprintF function returns a string and you can format the string the very same way you would have with fmt.PrintF
I came to this page specifically looking for a way to format an error string. So if someone needs help with the same, you want to use the fmt.Errorf() function.
The method signature is func Errorf(format string, a ...interface{}) error.
It returns the formatted string as a value that satisfies the error interface.
You can look up more details in the documentation - https://golang.org/pkg/fmt/#Errorf.
Instead of using template.New, you can just use the new builtin with
template.Template:
package main
import (
"strings"
"text/template"
)
func format(s string, v interface{}) string {
t, b := new(template.Template), new(strings.Builder)
template.Must(t.Parse(s)).Execute(b, v)
return b.String()
}
func main() {
bar := "bar"
println(format("foo: {{.}}", bar))
i := 25
println(format("foo: {{.}}", i))
}

Resources