Convert byte to string using reflect.StringHeader still allocates new memory?

Convert byte to string using reflect.StringHeader still allocates new memory? - string

I've got this small code snippet to test 2 ways of converting byte slice to string object, one function to allocate a new string object, another uses unsafe pointer arithmetic to construct string*, which doesn't allocate new memory:
package main
import (
"fmt"
"reflect"
"unsafe"
)
func byteToString(b []byte) string {
return string(b)
}
func byteToStringNoAlloc(b []byte) string {
if len(b) == 0 {
return ""
}
sh := reflect.StringHeader{uintptr(unsafe.Pointer(&b[0])), len(b)}
return *(*string)(unsafe.Pointer(&sh))
}
func main() {
b := []byte("hello")
fmt.Printf("1st element of slice: %v\n", &b[0])
str := byteToString(b)
sh := (*reflect.StringHeader)(unsafe.Pointer(&str))
fmt.Printf("New alloc: %v\n", sh)
toStr := byteToStringNoAlloc(b)
shNoAlloc := (*reflect.StringHeader)(unsafe.Pointer(&toStr))
fmt.Printf("No alloc: %v\n", shNoAlloc) // why different from &b[0]
}
I run this program under go 1.13:
1st element of slice: 0xc000076068
New alloc: &{824634204304 5}
No alloc: &{824634204264 5}
I exptect that the "1st element of slice" should print out the same address like "No alloc", but acturally they're very different. Where did I get wrong?

First of all, type conversions are calling a internal functions, for this case it's slicebytetostring.
https://golang.org/src/runtime/string.go?h=slicebytetostring#L75
It does copy of slice's content into new allocated memory.
In the second case you're creating a new header of the slice and cast it into string header the new unofficial holder of slice's content.
The problem of this is that garbage collector doesn't handle such kind of cases and resulting string header will be marked as a single structure which has no relations with the actual slice which holds the actual content, so, your resulting string would be valid only while the actual content holders are alive (don't count this string header itself).
So once garbage collector sweep the actual content, your string will still point to the same address but already freed memory, and you'll get the panic error or undefined behavior if you touch it.
By the way, there's no need to use reflect package and its headers because direct cast already creates new header as a result:
*(*string)(unsafe.Pointer(&byte_slice))

Related

How to use map[string]*string

I'm trying to use sarama (Admin mode) to create a topic.
Without the ConfigEntries works fine. But I need to define some configs.
I set up the topic config (Here is happening the error):
tConfigs := map[string]*string{
"cleanup.policy": "delete",
"delete.retention.ms": "36000000",
}
But then I get an error:
./main.go:99:28: cannot use "delete" (type string) as type *string in map value
./main.go:100:28: cannot use "36000000" (type string) as type *string in map value
I'm trying to use the admin mode like this:
err = admin.CreateTopic(t.Name, &sarama.TopicDetail{
NumPartitions: 1,
ReplicationFactor: 3,
ConfigEntries: tConfigs,
}, false)
Here is the line from the sarama module that defines CreateTopic()
https://github.com/Shopify/sarama/blob/master/admin.go#L18
Basically, I didn't understand how the map of pointers strings works :)

To initialize a map having string pointer value type with a composite literal, you have to use string pointer values. A string literal is not a pointer, it's just a string value.
An easy way to get a pointer to a string value is to take the address of a variable of string type, e.g.:
s1 := "delete"
s2 := "36000000"
tConfigs := map[string]*string{
"cleanup.policy": &s1,
"delete.retention.ms": &s2,
}
To make it convenient when used many times, create a helper function:
func strptr(s string) *string { return &s }
And using it:
tConfigs := map[string]*string{
"cleanup.policy": strptr("delete"),
"delete.retention.ms": strptr("36000000"),
}
Try the examples on the Go Playground.
See background and other options here: How do I do a literal *int64 in Go?

Garbage collection and correct usage of pointers in Go

I come from a Python/Ruby/JavaScript background. I understand how pointers work, however, I'm not completely sure how to leverage them in the following situation.
Let's pretend we have a fictitious web API that searches some image database and returns a JSON describing what's displayed in each image that was found:
[
{
"url": "https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
"description": "Ocean islands",
"tags": [
{"name":"ocean", "rank":1},
{"name":"water", "rank":2},
{"name":"blue", "rank":3},
{"name":"forest", "rank":4}
]
},
...
{
"url": "https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg",
"description": "Bridge over river",
"tags": [
{"name":"bridge", "rank":1},
{"name":"river", "rank":2},
{"name":"water", "rank":3},
{"name":"forest", "rank":4}
]
}
]
My goal is to create a data structure in Go that will map each tag to a list of image URLs that would look like this:
{
"ocean": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg"
],
"water": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"blue": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg"
],
"forest":[
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"bridge": [
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"river":[
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
]
}
As you can see, each image URL can belong to multiple tags at the same time. If I have thousands of images and even more tags, this data structure can grow very large if image URL strings are copied by value for each tag. This is where I want to leverage pointers.
I can represent the JSON API response by two structs in Go, func searchImages() mimics the fake API:
package main
import "fmt"
type Image struct {
URL string
Description string
Tags []*Tag
}
type Tag struct {
Name string
Rank int
}
// this function mimics json.NewDecoder(resp.Body).Decode(&parsedJSON)
func searchImages() []*Image {
parsedJSON := []*Image{
&Image {
URL: "https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
Description: "Ocean islands",
Tags: []*Tag{
&Tag{"ocean", 1},
&Tag{"water", 2},
&Tag{"blue", 3},
&Tag{"forest", 4},
},
},
&Image {
URL: "https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg",
Description: "Bridge over river",
Tags: []*Tag{
&Tag{"bridge", 1},
&Tag{"river", 2},
&Tag{"water", 3},
&Tag{"forest", 4},
},
},
}
return parsedJSON
}
Now the less optimal mapping function that results in a very large in-memory data structure can look like this:
func main() {
result := searchImages()
tagToUrlMap := make(map[string][]string)
for _, image := range result {
for _, tag := range image.Tags {
// fmt.Println(image.URL, tag.Name)
tagToUrlMap[tag.Name] = append(tagToUrlMap[tag.Name], image.URL)
}
}
fmt.Println(tagToUrlMap)
}
I can modify it to use pointers to the Image struct URL field instead of copying it by value:
// Version 1
tagToUrlMap := make(map[string][]*string)
for _, image := range result {
for _, tag := range image.Tags {
// fmt.Println(image.URL, tag.Name)
tagToUrlMap[tag.Name] = append(tagToUrlMap[tag.Name], &image.URL)
}
}
It works and my first question is what happens to the result data structure after I build the mapping in this way? Will the Image URL string fields be left in memory somehow and the rest of the result will be garbage collected? Or will the result data structure stay in memory until the end of the program because something points to its members?
Another way to do this would be to copy the URL to an intermediate variable and use a pointer to it instead:
// Version 2
tagToUrlMap := make(map[string][]*string)
for _, image := range result {
imageUrl = image.URL
for _, tag := range image.Tags {
// fmt.Println(image.URL, tag.Name)
tagToUrlMap[tag.Name] = append(tagToUrlMap[tag.Name], &imageUrl)
}
}
Is this better? Will the result data structure be garbage collected correctly?
Or perhaps I should use a pointer to string in the Image struct instead?
type Image struct {
URL *string
Description string
Tags []*Tag
}
Is there a better way to do this? I would also appreciate any resources on Go that describe various uses of pointers in depth. Thanks!
https://play.golang.org/p/VcKWUYLIpH7
UPDATE: I'm worried about optimal memory consumption and not generating unwanted garbage the most. My goal is to use the minimal amount of memory possible.

Foreword: I released the presented string pool in my github.com/icza/gox library, see stringsx.Pool.
First some background. string values in Go are represented by a small struct-like data structure reflect.StringHeader:
type StringHeader struct {
Data uintptr
Len int
}
So basically passing / copying a string value passes / copies this small struct value, which is 2 words only regardless of the length of the string. On 64-bit architectures, it's only 16 bytes, even if the string has a thousand characters.
So basically string values already act as pointers. Introducing another pointer like *string just complicates usage, and you won't really gain any noticable memory. For the sake of memory optimization, forget about using *string.
It works and my first question is what happens to the result data structure after I build the mapping in this way? Will the Image URL string fields be left in memory somehow and the rest of the result will be garbage collected? Or will the result data structure stay in memory until the end of the program because something points to its members?
If you have a pointer value pointing to a field of a struct value, then the whole struct will be kept in memory, it can't be garbage collected. Note that although it could be possible to release memory reserved for other fields of the struct, but the current Go runtime and garbage collector does not do so. So to achieve optimal memory usage, you should forget about storing addresses of struct fields (unless you also need the complete struct values, but still, storing field addresses and slice/array element addresses always requires care).
The reason for this is because memory for struct values are allocated as a contiguous segment, and so keeping only a single referenced field would strongly fragment the available / free memory, and would make optimal memory management even harder and less efficient. Defragmenting such areas would also require copying the referenced field's memory area, which would require "live-changing" pointer values (changing memory addresses).
So while using pointers to string values may save you some tiny memory, the added complexity and additional indirections make it unworthy.
So what to do then?
"Optimal" solution
So the cleanest way is to keep using string values.
And there is one more optimization we didn't talk about earlier.
You get your results by unmarshaling a JSON API response. This means that if the same URL or tag value is included multiple times in the JSON response, different string values will be created for them.
What does this mean? If you have the same URL twice in the JSON response, after unmarshaling, you will have 2 distinct string values which will contain 2 different pointers pointing to 2 different allocated byte sequences (string content which otherwise will be the same). The encoding/json package does not do string interning.
Here's a little app that proves this:
var s []string
err := json.Unmarshal([]byte(`["abc", "abc", "abc"]`), &s)
if err != nil {
panic(err)
}
for i := range s {
hdr := (*reflect.StringHeader)(unsafe.Pointer(&s[i]))
fmt.Println(hdr.Data)
}
Output of the above (try it on the Go Playground):
273760312
273760315
273760320
We see 3 different pointers. They could be the same, as string values are immutable.
The json package does not detect repeating string values because the detection adds memory and computational overhead, which is obviously something unwanted. But in our case we shoot for optimal memory usage, so an "initial", additional computation does worth the big memory gain.
So let's do our own string interning. How to do that?
After unmarshaling the JSON result, during building the tagToUrlMap map, let's keep track of string values we have come across, and if the subsequent string value has been seen earlier, just use that earlier value (its string descriptor).
Here's a very simple string interner implementation:
var cache = map[string]string{}
func interned(s string) string {
if s2, ok := cache[s]; ok {
return s2
}
// New string, store it
cache[s] = s
return s
}
Let's test this "interner" in the example code above:
var s []string
err := json.Unmarshal([]byte(`["abc", "abc", "abc"]`), &s)
if err != nil {
panic(err)
}
for i := range s {
hdr := (*reflect.StringHeader)(unsafe.Pointer(&s[i]))
fmt.Println(hdr.Data, s[i])
}
for i := range s {
s[i] = interned(s[i])
}
for i := range s {
hdr := (*reflect.StringHeader)(unsafe.Pointer(&s[i]))
fmt.Println(hdr.Data, s[i])
}
Output of the above (try it on the Go Playground):
273760312 abc
273760315 abc
273760320 abc
273760312 abc
273760312 abc
273760312 abc
Wonderful! As we can see, after using our interned() function, only a single instance of the "abc" string is used in our data structure (which is actually the first occurrence). This means all other instances (given no one else uses them) can be–and will be–properly garbage collected (by the garbage collector, some time in the future).
One thing to not forget here: the string interner uses a cache dictionary which stores all previously encountered string values. So to let those strings go, you should "clear" this cache map too, simplest done by assigning a nil value to it.
Without further ado, let's see our solution:
result := searchImages()
tagToUrlMap := make(map[string][]string)
for _, image := range result {
imageURL := interned(image.URL)
for _, tag := range image.Tags {
tagName := interned(tag.Name)
tagToUrlMap[tagName] = append(tagToUrlMap[tagName], imageURL)
}
}
// Clear the interner cache:
cache = nil
To verify the results:
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
if err := enc.Encode(tagToUrlMap); err != nil {
panic(err)
}
Output is (try it on the Go Playground):
{
"blue": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg"
],
"bridge": [
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"forest": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"ocean": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg"
],
"river": [
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"water": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
]
}
Further memory optimizations:
We used the builtin append() function to add new image URLs to tags. append() may (and usually does) allocate bigger slices than needed (thinking of future growth). After our "build" process, we may go through our tagToUrlMap map and "trim" those slices to the minimum needed.
This is how it could be done:
for tagName, urls := range tagToUrlMap {
if cap(urls) > len(urls) {
urls2 := make([]string, len(urls))
copy(urls2, urls)
tagToUrlMap[tagName] = urls2
}
}

Will the [...] be garbage collected correctly?
Yes.
You never need to worry that something will be collected which is still in use and you can rely on everything being collected once it is no longer used.
So the question about GC is never "Will it be collected correctly?" but "Do I generate unnecessary garbage?". Now this actual question does not depend that much on the data structure than on the amount of neu objects created (on the heap). So this is a question about how the data structures are used and much less on the structure itself. Use benchmarks and run go test with -benchmem.
(High end performance might also consider how much work the GC has to do: Scanning pointers might take time. Forget that for now.)
The other relevant question is about memory consumption. Copying a string copies just three words while copying a *string copies one word. So there is not much to safe here by using *string.
So unfortunately there are no clear answers to the relevant questions (amount of garbage generated and total memory consumption). Don't overthink the problem, use what fits your purpose, measure and refactor.

Can a zero-length and zero-cap slice still point to an underlying array and prevent garbage collection?

Let's take the following scenario:
a := make([]int, 10000)
a = a[len(a):]
As we know from "Go Slices: Usage and Internals" there's a "possible gotcha" in downslicing. For any slice a if you do a[start:end] it still points to the original memory, so if you don't copy, a small downslice could potentially keep a very large array in memory for a long time.
However, this case is chosen to result in a slice that should not only have zero length, but zero capacity. A similar question could be asked for the construct a = a[0:0:0].
Does the current implementation still maintain a pointer to the underlying memory, preventing it from being garbage collected, or does it recognize that a slice with no len or cap could not possibly reference anything, and thus garbage collect the original backing array during the next GC pause (assuming no other references exist)?
Edit: Playing with reflect and unsafe on the Playground reveals that the pointer is non-zero:
func main() {
a := make([]int, 10000)
a = a[len(a):]
aHeader := *(*reflect.SliceHeader)((unsafe.Pointer(&a)))
fmt.Println(aHeader.Data)
a = make([]int, 0, 0)
aHeader = *(*reflect.SliceHeader)((unsafe.Pointer(&a)))
fmt.Println(aHeader.Data)
}
http://play.golang.org/p/L0tuzN4ULn
However, this doesn't necessarily answer the question because the second slice that NEVER had anything in it also has a non-zero pointer as the data field. Even so, the pointer could simply be uintptr(&a[len(a)-1]) + sizeof(int) which would be outside the block of backing memory and thus not trigger actual garbage collection, though this seems unlikely since that would prevent garbage collection of other things. The non-zero value could also conceivably just be Playground weirdness.

As seen in your example, re-slicing copies the slice header, including the data pointer to the new slice, so I put together a small test to try and force the runtime to reuse the memory if possible.
I'd like this to be more deterministic, but at least with go1.3 on x86_64, it shows that the memory used by the original array is eventually reused (it does not work in the playground in this form).
package main
import (
"fmt"
"unsafe"
)
func check(i uintptr) {
fmt.Printf("Value at %d: %d\n", i, *(*int64)(unsafe.Pointer(i)))
}
func garbage() string {
s := ""
for i := 0; i < 100000; i++ {
s += "x"
}
return s
}
func main() {
s := make([]int64, 100000)
s[0] = 42
p := uintptr(unsafe.Pointer(&s[0]))
check(p)
z := s[0:0:0]
s = nil
fmt.Println(z)
garbage()
check(p)
}

Using string pointer to send a string through windows messages

I am trying to understand how are pointers to strings working. I have a code (not exactly original), which was written by somebody, and the person is not around here anymore, so I need to understand the idea of such usage.
var
STR: string;
pStr: ^string;
begin
STR := 'Hello world';
New(pStr);
pStr^ := STR;
PostMessage(Handle, WM_USER+1, wParam(pStr), 0);
end;
Now I know for sure, that a message handler gets the message and the pointer contains the string, which can be worked with, but what happens 'under the hood' of those operations ?
I tried to make a small project. I thought, that assigning string to what a str pointer is pointing to would actually increase refcount of the original string and not make any copies of a string, but refcount remained 1 and it seems it did copy the contents.
So get the question, what happened? Calling New on a pointer allocates an empty string, right?
After assignment I tried to look at refcount/length of a string the pointer pointed to like this PChar(#pStr^[1])[-8] but it returned nonsense (14), and the length byte was also wrong.
Additionally the questioin is, is it safe using pointers in such a way to pass on the string through windows messaging?

New(pStr) allocates a string on the heap and returns a pointer to it. Because string is a managed type, the string is default initialized, to the empty string. Since a string is implemented as a pointer, what you fundamentally have is a pointer to a pointer.
You code is perfectly fine, so long as you only post the message to your own process. Since the payload of the message is a pointer, it only means something in the context of the virtual address space of your process. If you wanted to send to a different process you'd need an IPC mechanism.
Clearly in the code that pulls the message off the queue you need to dispose of the string. Something like this:
var
p: ^string;
str: string;
....
p := Pointer(wParam);
str := p^;
Dispose(p);
Your code to query the reference count and the length is just wrong. Here's how to do it correctly:
{$APPTYPE CONSOLE}
var
pStr: ^string;
p: PInteger;
begin
New(pStr);
pStr^ := 'Hello world';
p := PInteger(pStr^);
dec(p);
Writeln(p^); // length
dec(p);
Writeln(p^); // ref count
Readln;
end.
Output:
11
1

substrings and the Go garbage collector

When taking a substring of a string in Go, no new memory is allocated. Instead, the underlying representation of the substring contains a Data pointer that is an offset of the original string's Data pointer.
This means that if I have a large string and wish to keep track of a small substring, the garbage collector will be unable to free any of the large string until I release all references to the shorter substring.
Slices have a similar problem, but you can get around it by making a copy of the subslice using copy(). I am unaware of any similar copy operation for strings. What is the idiomatic and fastest way to make a "copy" of a substring?

For example,
package main
import (
"fmt"
"unsafe"
)
type String struct {
str *byte
len int
}
func main() {
str := "abc"
substr := string([]byte(str[1:]))
fmt.Println(str, substr)
fmt.Println(*(*String)(unsafe.Pointer(&str)), *(*String)(unsafe.Pointer(&substr)))
}
Output:
abc bc
{0x4c0640 3} {0xc21000c940 2}

I know this is an old question, but there are a couple ways you can do this without creating two copies of the data you want.
First is to create the []byte of the substring, then simply coerce it to a string using unsafe.Pointer. This works because the header for a []byte is the same as that for a string, except that the []byte has an extra Cap field at the end, so it just gets truncated.
package main
import (
"fmt"
"unsafe"
)
func main() {
str := "foobar"
byt := []byte(str[3:])
sub := *(*string)(unsafe.Pointer(&byt))
fmt.Println(str, sub)
}
The second way is to use reflect.StringHeader and reflect.SliceHeader to do a more explicit header transfer.
package main
import (
"fmt"
"unsafe"
"reflect"
)
func main() {
str := "foobar"
byt := []byte(str[3:])
bytPtr := (*reflect.SliceHeader)(unsafe.Pointer(&byt)).Data
strHdr := reflect.StringHeader{Data: bytPtr, Len: len(byt)}
sub := *(*string)(unsafe.Pointer(&strHdr))
fmt.Println(str, sub)
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Convert byte to string using reflect.StringHeader still allocates new memory? - string

Related

How to use map[string]*string

Garbage collection and correct usage of pointers in Go

Can a zero-length and zero-cap slice still point to an underlying array and prevent garbage collection?

Using string pointer to send a string through windows messages

substrings and the Go garbage collector

Categories

Resources