Garbage collection and correct usage of pointers in Go - string

I come from a Python/Ruby/JavaScript background. I understand how pointers work, however, I'm not completely sure how to leverage them in the following situation.
Let's pretend we have a fictitious web API that searches some image database and returns a JSON describing what's displayed in each image that was found:
[
{
"url": "https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
"description": "Ocean islands",
"tags": [
{"name":"ocean", "rank":1},
{"name":"water", "rank":2},
{"name":"blue", "rank":3},
{"name":"forest", "rank":4}
]
},
...
{
"url": "https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg",
"description": "Bridge over river",
"tags": [
{"name":"bridge", "rank":1},
{"name":"river", "rank":2},
{"name":"water", "rank":3},
{"name":"forest", "rank":4}
]
}
]
My goal is to create a data structure in Go that will map each tag to a list of image URLs that would look like this:
{
"ocean": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg"
],
"water": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"blue": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg"
],
"forest":[
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"bridge": [
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"river":[
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
]
}
As you can see, each image URL can belong to multiple tags at the same time. If I have thousands of images and even more tags, this data structure can grow very large if image URL strings are copied by value for each tag. This is where I want to leverage pointers.
I can represent the JSON API response by two structs in Go, func searchImages() mimics the fake API:
package main
import "fmt"
type Image struct {
URL string
Description string
Tags []*Tag
}
type Tag struct {
Name string
Rank int
}
// this function mimics json.NewDecoder(resp.Body).Decode(&parsedJSON)
func searchImages() []*Image {
parsedJSON := []*Image{
&Image {
URL: "https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
Description: "Ocean islands",
Tags: []*Tag{
&Tag{"ocean", 1},
&Tag{"water", 2},
&Tag{"blue", 3},
&Tag{"forest", 4},
},
},
&Image {
URL: "https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg",
Description: "Bridge over river",
Tags: []*Tag{
&Tag{"bridge", 1},
&Tag{"river", 2},
&Tag{"water", 3},
&Tag{"forest", 4},
},
},
}
return parsedJSON
}
Now the less optimal mapping function that results in a very large in-memory data structure can look like this:
func main() {
result := searchImages()
tagToUrlMap := make(map[string][]string)
for _, image := range result {
for _, tag := range image.Tags {
// fmt.Println(image.URL, tag.Name)
tagToUrlMap[tag.Name] = append(tagToUrlMap[tag.Name], image.URL)
}
}
fmt.Println(tagToUrlMap)
}
I can modify it to use pointers to the Image struct URL field instead of copying it by value:
// Version 1
tagToUrlMap := make(map[string][]*string)
for _, image := range result {
for _, tag := range image.Tags {
// fmt.Println(image.URL, tag.Name)
tagToUrlMap[tag.Name] = append(tagToUrlMap[tag.Name], &image.URL)
}
}
It works and my first question is what happens to the result data structure after I build the mapping in this way? Will the Image URL string fields be left in memory somehow and the rest of the result will be garbage collected? Or will the result data structure stay in memory until the end of the program because something points to its members?
Another way to do this would be to copy the URL to an intermediate variable and use a pointer to it instead:
// Version 2
tagToUrlMap := make(map[string][]*string)
for _, image := range result {
imageUrl = image.URL
for _, tag := range image.Tags {
// fmt.Println(image.URL, tag.Name)
tagToUrlMap[tag.Name] = append(tagToUrlMap[tag.Name], &imageUrl)
}
}
Is this better? Will the result data structure be garbage collected correctly?
Or perhaps I should use a pointer to string in the Image struct instead?
type Image struct {
URL *string
Description string
Tags []*Tag
}
Is there a better way to do this? I would also appreciate any resources on Go that describe various uses of pointers in depth. Thanks!
https://play.golang.org/p/VcKWUYLIpH7
UPDATE: I'm worried about optimal memory consumption and not generating unwanted garbage the most. My goal is to use the minimal amount of memory possible.

Foreword: I released the presented string pool in my github.com/icza/gox library, see stringsx.Pool.
First some background. string values in Go are represented by a small struct-like data structure reflect.StringHeader:
type StringHeader struct {
Data uintptr
Len int
}
So basically passing / copying a string value passes / copies this small struct value, which is 2 words only regardless of the length of the string. On 64-bit architectures, it's only 16 bytes, even if the string has a thousand characters.
So basically string values already act as pointers. Introducing another pointer like *string just complicates usage, and you won't really gain any noticable memory. For the sake of memory optimization, forget about using *string.
It works and my first question is what happens to the result data structure after I build the mapping in this way? Will the Image URL string fields be left in memory somehow and the rest of the result will be garbage collected? Or will the result data structure stay in memory until the end of the program because something points to its members?
If you have a pointer value pointing to a field of a struct value, then the whole struct will be kept in memory, it can't be garbage collected. Note that although it could be possible to release memory reserved for other fields of the struct, but the current Go runtime and garbage collector does not do so. So to achieve optimal memory usage, you should forget about storing addresses of struct fields (unless you also need the complete struct values, but still, storing field addresses and slice/array element addresses always requires care).
The reason for this is because memory for struct values are allocated as a contiguous segment, and so keeping only a single referenced field would strongly fragment the available / free memory, and would make optimal memory management even harder and less efficient. Defragmenting such areas would also require copying the referenced field's memory area, which would require "live-changing" pointer values (changing memory addresses).
So while using pointers to string values may save you some tiny memory, the added complexity and additional indirections make it unworthy.
So what to do then?
"Optimal" solution
So the cleanest way is to keep using string values.
And there is one more optimization we didn't talk about earlier.
You get your results by unmarshaling a JSON API response. This means that if the same URL or tag value is included multiple times in the JSON response, different string values will be created for them.
What does this mean? If you have the same URL twice in the JSON response, after unmarshaling, you will have 2 distinct string values which will contain 2 different pointers pointing to 2 different allocated byte sequences (string content which otherwise will be the same). The encoding/json package does not do string interning.
Here's a little app that proves this:
var s []string
err := json.Unmarshal([]byte(`["abc", "abc", "abc"]`), &s)
if err != nil {
panic(err)
}
for i := range s {
hdr := (*reflect.StringHeader)(unsafe.Pointer(&s[i]))
fmt.Println(hdr.Data)
}
Output of the above (try it on the Go Playground):
273760312
273760315
273760320
We see 3 different pointers. They could be the same, as string values are immutable.
The json package does not detect repeating string values because the detection adds memory and computational overhead, which is obviously something unwanted. But in our case we shoot for optimal memory usage, so an "initial", additional computation does worth the big memory gain.
So let's do our own string interning. How to do that?
After unmarshaling the JSON result, during building the tagToUrlMap map, let's keep track of string values we have come across, and if the subsequent string value has been seen earlier, just use that earlier value (its string descriptor).
Here's a very simple string interner implementation:
var cache = map[string]string{}
func interned(s string) string {
if s2, ok := cache[s]; ok {
return s2
}
// New string, store it
cache[s] = s
return s
}
Let's test this "interner" in the example code above:
var s []string
err := json.Unmarshal([]byte(`["abc", "abc", "abc"]`), &s)
if err != nil {
panic(err)
}
for i := range s {
hdr := (*reflect.StringHeader)(unsafe.Pointer(&s[i]))
fmt.Println(hdr.Data, s[i])
}
for i := range s {
s[i] = interned(s[i])
}
for i := range s {
hdr := (*reflect.StringHeader)(unsafe.Pointer(&s[i]))
fmt.Println(hdr.Data, s[i])
}
Output of the above (try it on the Go Playground):
273760312 abc
273760315 abc
273760320 abc
273760312 abc
273760312 abc
273760312 abc
Wonderful! As we can see, after using our interned() function, only a single instance of the "abc" string is used in our data structure (which is actually the first occurrence). This means all other instances (given no one else uses them) can be–and will be–properly garbage collected (by the garbage collector, some time in the future).
One thing to not forget here: the string interner uses a cache dictionary which stores all previously encountered string values. So to let those strings go, you should "clear" this cache map too, simplest done by assigning a nil value to it.
Without further ado, let's see our solution:
result := searchImages()
tagToUrlMap := make(map[string][]string)
for _, image := range result {
imageURL := interned(image.URL)
for _, tag := range image.Tags {
tagName := interned(tag.Name)
tagToUrlMap[tagName] = append(tagToUrlMap[tagName], imageURL)
}
}
// Clear the interner cache:
cache = nil
To verify the results:
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
if err := enc.Encode(tagToUrlMap); err != nil {
panic(err)
}
Output is (try it on the Go Playground):
{
"blue": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg"
],
"bridge": [
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"forest": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"ocean": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg"
],
"river": [
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"water": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
]
}
Further memory optimizations:
We used the builtin append() function to add new image URLs to tags. append() may (and usually does) allocate bigger slices than needed (thinking of future growth). After our "build" process, we may go through our tagToUrlMap map and "trim" those slices to the minimum needed.
This is how it could be done:
for tagName, urls := range tagToUrlMap {
if cap(urls) > len(urls) {
urls2 := make([]string, len(urls))
copy(urls2, urls)
tagToUrlMap[tagName] = urls2
}
}

Will the [...] be garbage collected correctly?
Yes.
You never need to worry that something will be collected which is still in use and you can rely on everything being collected once it is no longer used.
So the question about GC is never "Will it be collected correctly?" but "Do I generate unnecessary garbage?". Now this actual question does not depend that much on the data structure than on the amount of neu objects created (on the heap). So this is a question about how the data structures are used and much less on the structure itself. Use benchmarks and run go test with -benchmem.
(High end performance might also consider how much work the GC has to do: Scanning pointers might take time. Forget that for now.)
The other relevant question is about memory consumption. Copying a string copies just three words while copying a *string copies one word. So there is not much to safe here by using *string.
So unfortunately there are no clear answers to the relevant questions (amount of garbage generated and total memory consumption). Don't overthink the problem, use what fits your purpose, measure and refactor.

Related

string vs integer as map key for memory utilization in golang?

I have a below read function which is called by multiple go routines to read s3 files and it populates two concurrent map as shown below.
During server startup, it calls read function below to populate two concurrent map.
And also periodically every 30 seconds, it calls read function again to read new s3 files and populate two concurrent map again with some new data.
So basically at a given state of time during the whole lifecycle of this app, both my concurrent map have some data and also periodically being updated too.
func (r *clientRepository) read(file string, bucket string) error {
var err error
//... read s3 file
for {
rows, err := pr.ReadByNumber(r.cfg.RowsToRead)
if err != nil {
return errs.Wrap(err)
}
if len(rows) <= 0 {
break
}
byteSlice, err := json.Marshal(rows)
if err != nil {
return errs.Wrap(err)
}
var productRows []ParquetData
err = json.Unmarshal(byteSlice, &productRows)
if err != nil {
return errs.Wrap(err)
}
for i := range productRows {
var flatProduct definitions.CustomerInfo
err = r.ConvertData(spn, &productRows[i], &flatProduct)
if err != nil {
return errs.Wrap(err)
}
// populate first concurrent map here
r.products.Set(strconv.FormatInt(flatProduct.ProductId, 10), &flatProduct)
for _, catalogId := range flatProduct.Catalogs {
strCatalogId := strconv.FormatInt(int64(catalogId), 10)
// upsert second concurrent map here
r.productCatalog.Upsert(strCatalogId, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
productID := newValue.(int64)
if valueInMap == nil {
return map[int64]struct{}{productID: {}}
}
oldIDs := valueInMap.(map[int64]struct{})
// value is irrelevant, no need to check if key exists
oldIDs[productID] = struct{}{}
return oldIDs
})
}
}
}
return nil
}
In above code flatProduct.ProductId and strCatalogId are integer but I am converting them into string bcoz concurrent map works with string only. And then I have below three functions which is used by my main application threads to get data from the concurrent map populated above.
func (r *clientRepository) GetProductMap() *cmap.ConcurrentMap {
return r.products
}
func (r *clientRepository) GetProductCatalogMap() *cmap.ConcurrentMap {
return r.productCatalog
}
func (r *clientRepository) GetProductData(pid string) *definitions.CustomerInfo {
pd, ok := r.products.Get(pid)
if ok {
return pd.(*definitions.CustomerInfo)
}
return nil
}
I have a use case where I need to populate map from multiple go routines and then read data from those maps from bunch of main application threads so it needs to be thread safe and it should be fast enough as well without much locking.
Problem Statement
I am dealing with lots of data like 30-40 GB worth of data from all these files which I am reading into memory. I am using concurrent map here which solves most of my concurrency issues but the key for the concurrent map is string and it doesn't have any implementation where key can be integer. In my case key is just a product id which can be int32 so is it worth it storing all those product id's as string in this concurrent map? I think string allocation takes more memory compare to storing all those keys as integer? At least it does in c/c++ so I am assuming it should be same case here in golang too.
Is there anything I can to improve here w.r.t map usage so that I can reduce memory utilization plus I don't lose performance as well while reading data from these maps from main threads?
I am using concurrent map from this repo which doesn't have implementation for key as integer.
Update
I am trying to use cmap_int in my code to try it out.
type clientRepo struct {
customers *cmap.ConcurrentMap
customersCatalog *cmap.ConcurrentMap
}
func NewClientRepository(logger log.Logger) (ClientRepository, error) {
// ....
customers := cmap.New[string]()
customersCatalog := cmap.New[string]()
r := &clientRepo{
customers: &customers,
customersCatalog: &customersCatalog,
}
// ....
return r, nil
}
But I am getting error as:
Cannot use '&products' (type *ConcurrentMap[V]) as the type *cmap.ConcurrentMap
What I need to change in my clientRepo struct so that it can work with new version of concurrent map which uses generics?
I don't know the implementation details of concurrent map in Go, but if it's using a string as a key I'm guessing that behind the scenes it's storing both the string and a hash of the string (which will be used for actual indexing operations).
That is going to be something of a memory hog, and there'll be nothing that can be done about that as concurrent map uses only strings for key.
If there were some sort of map that did use integers, it'd likely be using hashes of those integers anyway. A smooth hash distribution is a necessary feature for good and uniform lookup performance, in the event that key data itself is not uniformly distributed. It's almost like you need a very simple map implementation!
I'm wondering if a simple array would do, if your product ID's fit within 32bits (or can be munged to do so, or down to some other acceptable integer length). Yes, that way you'd have a large amount of memory allocated, possibly with large tracts unused. However, indexing is super-rapid, and the OS's virtual memory subsystem would ensure that areas of the array that you don't index aren't swapped in. Caveat - I'm thinking very much in terms of C and fixed-size objects here - less so Go - so this may be a bogus suggestion.
To persevere, so long as there's nothing about the array that implies initialisation-on-allocation (e.g. in C the array wouldn't get initialised by the compiler), allocation doesn't automatically mean it's all in memory, all at once, and only the most commonly used areas of the array will be in RAM courtesy of the OS's virtual memory subsystem.
EDIT
You could have a map of arrays, where each array covered a range of product Ids. This would be close to the same effect, trading off storage of hashes and strings against storage of null references. If product ids are clumped in some sort of structured way, this could work well.
Also, just a thought, and I'm showing a total lack of knowledge of Go here. Does Go store objects by reference? In which case wouldn't an array of objects actually be an array of references (so, fixed in size) and the actual objects allocated only as needed (ie a lot of the array is null references)? That doesn't sound good for my one big array suggestion...
The library you use is relatively simple and you may just replace all string into int32 (and modify the hashing function) and it will still work fine.
I ran a tiny (and not that rigorous) benchmark against the replaced version:
$ go test -bench=. -benchtime=10x -benchmem
goos: linux
goarch: amd64
pkg: maps
BenchmarkCMapAlloc-4 10 174272711 ns/op 49009948 B/op 33873 allocs/op
BenchmarkCMapAllocSS-4 10 369259624 ns/op 102535456 B/op 1082125 allocs/op
BenchmarkCMapUpdateAlloc-4 10 114794162 ns/op 0 B/op 0 allocs/op
BenchmarkCMapUpdateAllocSS-4 10 192165246 ns/op 16777216 B/op 1048576 allocs/op
BenchmarkCMap-4 10 1193068438 ns/op 5065 B/op 41 allocs/op
BenchmarkCMapSS-4 10 2195078437 ns/op 536874022 B/op 33554471 allocs/op
Benchmarks with a SS suffix is the original string version. So using integers as keys takes less memory and runs faster, as anyone would expect. The string version allocates about 50 bytes more each insertion. (This is not the actual memory usage though.)
Basically, a string in go is just a struct:
type stringStruct struct {
str unsafe.Pointer
len int
}
So on a 64-bit machine, it takes at least 8 bytes (pointer) + 8 bytes (length) + len(underlying bytes) bytes to store a string. Turning it into a int32 or int64 will definitely save memory. However, I assume that CustomerInfo and the catalog sets takes the most memory and I don't think there will be a great improvement.
(By the way, tuning the SHARD_COUNT in the library might also help a bit.)

Convert byte to string using reflect.StringHeader still allocates new memory?

I've got this small code snippet to test 2 ways of converting byte slice to string object, one function to allocate a new string object, another uses unsafe pointer arithmetic to construct string*, which doesn't allocate new memory:
package main
import (
"fmt"
"reflect"
"unsafe"
)
func byteToString(b []byte) string {
return string(b)
}
func byteToStringNoAlloc(b []byte) string {
if len(b) == 0 {
return ""
}
sh := reflect.StringHeader{uintptr(unsafe.Pointer(&b[0])), len(b)}
return *(*string)(unsafe.Pointer(&sh))
}
func main() {
b := []byte("hello")
fmt.Printf("1st element of slice: %v\n", &b[0])
str := byteToString(b)
sh := (*reflect.StringHeader)(unsafe.Pointer(&str))
fmt.Printf("New alloc: %v\n", sh)
toStr := byteToStringNoAlloc(b)
shNoAlloc := (*reflect.StringHeader)(unsafe.Pointer(&toStr))
fmt.Printf("No alloc: %v\n", shNoAlloc) // why different from &b[0]
}
I run this program under go 1.13:
1st element of slice: 0xc000076068
New alloc: &{824634204304 5}
No alloc: &{824634204264 5}
I exptect that the "1st element of slice" should print out the same address like "No alloc", but acturally they're very different. Where did I get wrong?
First of all, type conversions are calling a internal functions, for this case it's slicebytetostring.
https://golang.org/src/runtime/string.go?h=slicebytetostring#L75
It does copy of slice's content into new allocated memory.
In the second case you're creating a new header of the slice and cast it into string header the new unofficial holder of slice's content.
The problem of this is that garbage collector doesn't handle such kind of cases and resulting string header will be marked as a single structure which has no relations with the actual slice which holds the actual content, so, your resulting string would be valid only while the actual content holders are alive (don't count this string header itself).
So once garbage collector sweep the actual content, your string will still point to the same address but already freed memory, and you'll get the panic error or undefined behavior if you touch it.
By the way, there's no need to use reflect package and its headers because direct cast already creates new header as a result:
*(*string)(unsafe.Pointer(&byte_slice))

Swift 3 how to store a struct in a "Data" object

What is the "right" way to stuff an arbitrary, odd sized struct into a swift 3 Data object ?
I think that I have got there, but it seems horribly convoluted for what from prior experience was no than
dataObject.append(&structInstance, sizeof(structInstance))
My case is as follows:
The structure of interest:
public struct CutEntry {
var itemA : UInt64
var itemB : UInt32
}
I have an array of these things that I want to stuff into a data object, in a specific manner as the data object becomes a file which is eventually read by a different application on a different architecture.
The function to put them into a Data object
open func encodeCutsData() -> Data
{
var data = Data()
for entry in cutsArray
{
// bigendian stuff, as a var, just so the you can get the address
var entryCopy = CutEntry(itemA: entry.itemA.bigEndian, itemB: entry.itemB.bigEndian)
// step 1 get the address of the item as a UnsafePointer
let d2 = withUnsafePointer(to: &entryCopy) { return $0}
// step 2 cast it to a raw pointer
let d3 = UnsafeRawPointer(d2)
// step 3 create a temp data object
let d4 = Data(bytes:d3, count: MemoryLayout<CutEntry>.size )
// step 4 add the temp to main data object
data.append(d4)
}
return data
}
Earlier when we only had NSMutableData it was
let item = NSMutableData()
for entry in cutsArray
{
var entryCopy = CutEntry(cutPts: entry.cutPts.bigEndian, cutType: entry.cutType.bigEndian)
item.append(&entryCopy, length: MemoryLayout<CutEntry>.size)
}
I've spent a few hours searching for examples of manipulating struct and Data objects. I though that I was close when I found references to unsafebufferpointer. That blew up in my face when I discovered that "buffer" bit uses core memory alignment (which can be useful) and it was stuffing 16 bytes into the data object instead of the expected 12.
I am quite prepared to say that I have missed the blindingly obvious bit of RTFM somewhere. Can anyone offer a cleaner solution ? or has Swift really gone backwards here ?
If I could find a way of getting a pointer to the item as a UInt8 pointer that would remove a couple of lines, but that looks just a difficult.
With checking the reference of Data, I can find two things which may be useful for you:
init(bytes: UnsafeRawPointer, count: Int)
func append(Data)
You can write something like this:
var data = Data()
for entry in cutsArray {
var entryCopy = CutEntry(cutPts: entry.cutPts.bigEndian, cutType: entry.cutType.bigEndian)
data.append(Data(bytes: &entryCopy, count: MemoryLayout<CutEntry>.size))
}

Go: Thread-Safe Concurrency Issue with Sparse Array Read & Write

I'm writing a search engine in Go in which I have an inverted index of words to the corresponding results for each word. There is a set dictionary of words and so the words are already converted into a StemID, which is an integer starting from 0. This allows me to use a slice of pointers (i.e. a sparse array) to map each StemID to the structure which contains the results of that query. E.g. var StemID_to_Index []*resultStruct. If aardvark is 0 then the pointer to the resultStruct for aardvark is located at StemID_to_Index[0], which will be nil if the result for this word is currently not loaded.
There is not enough memory on the server to store all of this in memory, so the structure for each StemID will be saved as separate files and these can be loaded into the StemID_to_Index slice. If StemID_to_Index is currently nil for this StemID then the result is not cached and needs to be loaded, otherwise it's already loaded (cached) and so can be used directly. Each time a new result is loaded the memory usage is checked and if it's over the threshold then 2/3 of the loaded results are thrown away (StemID_to_Index is set to nil for these StemIDs and a garbage collection is forced.)
My problem is the concurrency. What is the fastest and most efficient way in which I can have multiple threads searching at the same time without having problems with different threads trying to read and write to the same place at the same time? I'm trying to avoid using mutexes on everything as that would slow down every single access attempt.
Do you think I would get away with loading the results from disk in the working thread and then delivering the pointer to this structure to an "updater" thread using channels, which then updates the nil value in the StemID_to_Index slice to the pointer of the loaded result? This would mean that two threads would never attempt to write at the same time, but what would happen if another thread tried to read from that exact index of StemID_to_Index while the "updater" thread was updating the pointer? It doesn't matter if a thread is given a nil pointer for a result which is currently being loaded, because it will just be loaded twice and while that is a waste of resources it would still deliver the same result and since that is unlikely to happen very often, it's forgiveable.
Additionally, how would the working thread which send the pointer to be updated to the "updater" thread know when the "updater" thread has finished updating the pointer in the slice? Should it just sleep and keep checking, or is there an easy way for the updater to send a message back to the specific thread which pushed to the channel?
UPDATE
I made a little test script to see what would happen if attempting to access a pointer at the same time as modifying it... it seems to always be OK. No errors. Am I missing something?
package main
import (
"fmt"
"sync"
)
type tester struct {
a uint
}
var things *tester
func updater() {
var a uint
for {
what := new(tester)
what.a = a
things = what
a++
}
}
func test() {
var t *tester
for {
t = things
if t != nil {
if t.a < 0 {
fmt.Println(`Error1`)
}
} else {
fmt.Println(`Error2`)
}
}
}
func main() {
var wg sync.WaitGroup
things = new(tester)
go test()
go test()
go test()
go test()
go test()
go test()
go updater()
go test()
go test()
go test()
go test()
go test()
wg.Add(1)
wg.Wait()
}
UPDATE 2
Taking this further, even if I read and write from multiple threads to the same variable at the same time... it makes no difference, still no errors:
From above:
func test() {
var a uint
var t *tester
for {
t = things
if t != nil {
if t.a < 0 {
fmt.Println(`Error1`)
}
} else {
fmt.Println(`Error2`)
}
what := new(tester)
what.a = a
things = what
a++
}
}
This implies I don't have to worry about concurrency at all... again: am I missing something here?
This sounds like a perfect use case for a memory mapped file:
package main
import (
"log"
"os"
"unsafe"
"github.com/edsrzf/mmap-go"
)
func main() {
// Open the backing file
f, err := os.OpenFile("example.txt", os.O_RDWR|os.O_CREATE, 0644)
if err != nil {
log.Fatalln(err)
}
defer f.Close()
// Set it's size
f.Truncate(1024)
// Memory map it
m, err := mmap.Map(f, mmap.RDWR, 0)
if err != nil {
log.Fatalln(err)
}
defer m.Unmap()
// m is a byte slice
copy(m, "Hello World")
m.Flush()
// here's how to use it with a pointer
type Coordinate struct{ X, Y int }
// first get the memory address as a *byte pointer and convert it to an unsafe
// pointer
ptr := unsafe.Pointer(&m[20])
// next convert it into a different pointer type
coord := (*Coordinate)(ptr)
// now you can use it directly
*coord = Coordinate{1, 2}
m.Flush()
// and vice-versa
log.Println(*(*Coordinate)(unsafe.Pointer(&m[20])))
}
The memory map can be larger than real memory and the operating system will handle all the messy details for you.
You will still need to make sure that separate goroutines never read/write to the same segment of memory at the same time.
My top answer would be to use elasticsearch with a client like elastigo.
If that's not an option, it would really help to know how much you care about race-y behavior. If you don't care, a write could happen right after a read finishes, the user finishing the read will get stale data. You can just have a queue of write and read operations and have multiple threads feed into that queue and one dispatcher issue the operations to the map one-at-a-time as they come it. In all other scenarios, you will need a mutex if there are multiple readers and writers. Maps aren't thread safe in go.
Honestly though, I would just add a mutex to make things simple for now and optimize by analyzing where your bottlenecks actually lie. It seems like you checking a threshold and then purging 2/3 of your cache is a bit arbitrary, and I wouldn't be surprised if you kill performance by doing something like that. Here's on situation where that would break down:
Requesters 1, 2, 3, and 4 are frequently accessing many of the same words on files A & B.
Requester 5, 6, 7 and 8 are frequently accessing many of the same words stored on files C & D.
Now when requests interleaved between these requesters and files happen in rapid succession, you may end up purging your 2/3 of your cache over and over again of results that may be requested shortly after. There are a couple other approaches:
Cache words that are frequently accessed at the same time on the same box and have multiple caching boxes.
Cache on a per-word basis with some sort of ranking of how popular that word is. If a new word is accessed from a file while the cache is full, see if other more popular words live in that file and purge less popular entries in the cache in hopes that those words will have a higher hit rate.
Both approaches 1 & 2.

Can a zero-length and zero-cap slice still point to an underlying array and prevent garbage collection?

Let's take the following scenario:
a := make([]int, 10000)
a = a[len(a):]
As we know from "Go Slices: Usage and Internals" there's a "possible gotcha" in downslicing. For any slice a if you do a[start:end] it still points to the original memory, so if you don't copy, a small downslice could potentially keep a very large array in memory for a long time.
However, this case is chosen to result in a slice that should not only have zero length, but zero capacity. A similar question could be asked for the construct a = a[0:0:0].
Does the current implementation still maintain a pointer to the underlying memory, preventing it from being garbage collected, or does it recognize that a slice with no len or cap could not possibly reference anything, and thus garbage collect the original backing array during the next GC pause (assuming no other references exist)?
Edit: Playing with reflect and unsafe on the Playground reveals that the pointer is non-zero:
func main() {
a := make([]int, 10000)
a = a[len(a):]
aHeader := *(*reflect.SliceHeader)((unsafe.Pointer(&a)))
fmt.Println(aHeader.Data)
a = make([]int, 0, 0)
aHeader = *(*reflect.SliceHeader)((unsafe.Pointer(&a)))
fmt.Println(aHeader.Data)
}
http://play.golang.org/p/L0tuzN4ULn
However, this doesn't necessarily answer the question because the second slice that NEVER had anything in it also has a non-zero pointer as the data field. Even so, the pointer could simply be uintptr(&a[len(a)-1]) + sizeof(int) which would be outside the block of backing memory and thus not trigger actual garbage collection, though this seems unlikely since that would prevent garbage collection of other things. The non-zero value could also conceivably just be Playground weirdness.
As seen in your example, re-slicing copies the slice header, including the data pointer to the new slice, so I put together a small test to try and force the runtime to reuse the memory if possible.
I'd like this to be more deterministic, but at least with go1.3 on x86_64, it shows that the memory used by the original array is eventually reused (it does not work in the playground in this form).
package main
import (
"fmt"
"unsafe"
)
func check(i uintptr) {
fmt.Printf("Value at %d: %d\n", i, *(*int64)(unsafe.Pointer(i)))
}
func garbage() string {
s := ""
for i := 0; i < 100000; i++ {
s += "x"
}
return s
}
func main() {
s := make([]int64, 100000)
s[0] = 42
p := uintptr(unsafe.Pointer(&s[0]))
check(p)
z := s[0:0:0]
s = nil
fmt.Println(z)
garbage()
check(p)
}

Resources