Go: Thread-Safe Concurrency Issue with Sparse Array Read & Write - multithreading

I'm writing a search engine in Go in which I have an inverted index of words to the corresponding results for each word. There is a set dictionary of words and so the words are already converted into a StemID, which is an integer starting from 0. This allows me to use a slice of pointers (i.e. a sparse array) to map each StemID to the structure which contains the results of that query. E.g. var StemID_to_Index []*resultStruct. If aardvark is 0 then the pointer to the resultStruct for aardvark is located at StemID_to_Index[0], which will be nil if the result for this word is currently not loaded.
There is not enough memory on the server to store all of this in memory, so the structure for each StemID will be saved as separate files and these can be loaded into the StemID_to_Index slice. If StemID_to_Index is currently nil for this StemID then the result is not cached and needs to be loaded, otherwise it's already loaded (cached) and so can be used directly. Each time a new result is loaded the memory usage is checked and if it's over the threshold then 2/3 of the loaded results are thrown away (StemID_to_Index is set to nil for these StemIDs and a garbage collection is forced.)
My problem is the concurrency. What is the fastest and most efficient way in which I can have multiple threads searching at the same time without having problems with different threads trying to read and write to the same place at the same time? I'm trying to avoid using mutexes on everything as that would slow down every single access attempt.
Do you think I would get away with loading the results from disk in the working thread and then delivering the pointer to this structure to an "updater" thread using channels, which then updates the nil value in the StemID_to_Index slice to the pointer of the loaded result? This would mean that two threads would never attempt to write at the same time, but what would happen if another thread tried to read from that exact index of StemID_to_Index while the "updater" thread was updating the pointer? It doesn't matter if a thread is given a nil pointer for a result which is currently being loaded, because it will just be loaded twice and while that is a waste of resources it would still deliver the same result and since that is unlikely to happen very often, it's forgiveable.
Additionally, how would the working thread which send the pointer to be updated to the "updater" thread know when the "updater" thread has finished updating the pointer in the slice? Should it just sleep and keep checking, or is there an easy way for the updater to send a message back to the specific thread which pushed to the channel?
UPDATE
I made a little test script to see what would happen if attempting to access a pointer at the same time as modifying it... it seems to always be OK. No errors. Am I missing something?
package main
import (
"fmt"
"sync"
)
type tester struct {
a uint
}
var things *tester
func updater() {
var a uint
for {
what := new(tester)
what.a = a
things = what
a++
}
}
func test() {
var t *tester
for {
t = things
if t != nil {
if t.a < 0 {
fmt.Println(`Error1`)
}
} else {
fmt.Println(`Error2`)
}
}
}
func main() {
var wg sync.WaitGroup
things = new(tester)
go test()
go test()
go test()
go test()
go test()
go test()
go updater()
go test()
go test()
go test()
go test()
go test()
wg.Add(1)
wg.Wait()
}
UPDATE 2
Taking this further, even if I read and write from multiple threads to the same variable at the same time... it makes no difference, still no errors:
From above:
func test() {
var a uint
var t *tester
for {
t = things
if t != nil {
if t.a < 0 {
fmt.Println(`Error1`)
}
} else {
fmt.Println(`Error2`)
}
what := new(tester)
what.a = a
things = what
a++
}
}
This implies I don't have to worry about concurrency at all... again: am I missing something here?

This sounds like a perfect use case for a memory mapped file:
package main
import (
"log"
"os"
"unsafe"
"github.com/edsrzf/mmap-go"
)
func main() {
// Open the backing file
f, err := os.OpenFile("example.txt", os.O_RDWR|os.O_CREATE, 0644)
if err != nil {
log.Fatalln(err)
}
defer f.Close()
// Set it's size
f.Truncate(1024)
// Memory map it
m, err := mmap.Map(f, mmap.RDWR, 0)
if err != nil {
log.Fatalln(err)
}
defer m.Unmap()
// m is a byte slice
copy(m, "Hello World")
m.Flush()
// here's how to use it with a pointer
type Coordinate struct{ X, Y int }
// first get the memory address as a *byte pointer and convert it to an unsafe
// pointer
ptr := unsafe.Pointer(&m[20])
// next convert it into a different pointer type
coord := (*Coordinate)(ptr)
// now you can use it directly
*coord = Coordinate{1, 2}
m.Flush()
// and vice-versa
log.Println(*(*Coordinate)(unsafe.Pointer(&m[20])))
}
The memory map can be larger than real memory and the operating system will handle all the messy details for you.
You will still need to make sure that separate goroutines never read/write to the same segment of memory at the same time.

My top answer would be to use elasticsearch with a client like elastigo.
If that's not an option, it would really help to know how much you care about race-y behavior. If you don't care, a write could happen right after a read finishes, the user finishing the read will get stale data. You can just have a queue of write and read operations and have multiple threads feed into that queue and one dispatcher issue the operations to the map one-at-a-time as they come it. In all other scenarios, you will need a mutex if there are multiple readers and writers. Maps aren't thread safe in go.
Honestly though, I would just add a mutex to make things simple for now and optimize by analyzing where your bottlenecks actually lie. It seems like you checking a threshold and then purging 2/3 of your cache is a bit arbitrary, and I wouldn't be surprised if you kill performance by doing something like that. Here's on situation where that would break down:
Requesters 1, 2, 3, and 4 are frequently accessing many of the same words on files A & B.
Requester 5, 6, 7 and 8 are frequently accessing many of the same words stored on files C & D.
Now when requests interleaved between these requesters and files happen in rapid succession, you may end up purging your 2/3 of your cache over and over again of results that may be requested shortly after. There are a couple other approaches:
Cache words that are frequently accessed at the same time on the same box and have multiple caching boxes.
Cache on a per-word basis with some sort of ranking of how popular that word is. If a new word is accessed from a file while the cache is full, see if other more popular words live in that file and purge less popular entries in the cache in hopes that those words will have a higher hit rate.
Both approaches 1 & 2.

Related

Swift: How to interrupt a sort routine

I want to give users the ability to stop a sorting routine if it takes too long.
I tried to use DispatchWorkItem.cancel. However, this does not actually stop processes that have already started.
let myArray = [String]() // potentially 230M elements!
...
workItem = DispatchWorkItem {
let result = myArray.sorted()
DispatchQueue.main.async {
print ("done")
}
}
// if process is too long, user clicks "Cancel" =>
workItem.cancel() // does not stop sort
How can I kill the workItem's thread?
I don't have access to the sorted routine, therefore I cannot insert tests to check if current thread is in "cancelled" status...
As you have deduced, one simply cannot kill a work item without periodically checking isCancelled and manually performing an early exit if it is set.
There are two options:
You can used sorted(by:), test for isCancelled there, throwing an error if it is canceled. That achieves the desired early exit.
That might look like:
func cancellableSort() -> DispatchWorkItem {
var item: DispatchWorkItem!
item = DispatchWorkItem() {
let unsortedArray = (0..<10_000_000).shuffled()
let sortedArray = try? unsortedArray.sorted { (obj1, obj2) -> Bool in
if item.isCancelled {
throw SortError.cancelled
}
return obj1 < obj2
}
// do something with your sorted array
item = nil
}
DispatchQueue.global().async(execute: item)
return item
}
Where
enum SortError: Error {
case cancelled
}
Be forewarned that even in release builds, this can have a dramatic impact on performance. So you might want to benchmark this.
You could just write your own sort routine, inserting your own test of isCancelled inside the algorithm. This gives you more control over where precisely you perform the test (i.e., you might not want to do it for every comparison, but rather at some higher level loop within the algorithm, thereby minimizing the performance impact). And given the number of records, this gives you a chance to pick an algorithm best suited for your data set.
Obviously, when benchmarking these alternatives, make sure you test optimized/release builds so that your results are not skewed by your build settings.
As an aside, you might considering using Operation, too, as its handling of cancelation is more elegant, IMHO. Plus you can have a dedicated object for the sort operation, which is cleaner.

is a read or write operation on a pointer value atomic in golang? [duplicate]

Is assigning a pointer atomic in Go?
Do I need to assign a pointer in a lock? Suppose I just want to assign the pointer to nil, and would like other threads to be able to see it. I know in Java we can use volatile for this, but there is no volatile in Go.
The only things which are guaranteed to be atomic in go are the operations in sync.atomic.
So if you want to be certain you'll either need to take a lock, eg sync.Mutex or use one of the atomic primitives. I don't recommend using the atomic primitives though as you'll have to use them everywhere you use the pointer and they are difficult to get right.
Using the mutex is OK go style - you could define a function to return the current pointer with locking very easily, eg something like
import "sync"
var secretPointer *int
var pointerLock sync.Mutex
func CurrentPointer() *int {
pointerLock.Lock()
defer pointerLock.Unlock()
return secretPointer
}
func SetPointer(p *int) {
pointerLock.Lock()
secretPointer = p
pointerLock.Unlock()
}
These functions return a copy of the pointer to their clients which will stay constant even if the master pointer is changed. This may or may not be acceptable depending on how time critical your requirement is. It should be enough to avoid any undefined behaviour - the garbage collector will ensure that the pointers remain valid at all times even if the memory pointed to is no longer used by your program.
An alternative approach would be to only do the pointer access from one go routine and use channels to command that go routine into doing things. That would be considered more idiomatic go, but may not suit your application exactly.
Update
Here is an example showing how to use atomic.SetPointer. It is rather ugly due to the use of unsafe.Pointer. However unsafe.Pointer casts compile to nothing so the runtime cost is small.
import (
"fmt"
"sync/atomic"
"unsafe"
)
type Struct struct {
p unsafe.Pointer // some pointer
}
func main() {
data := 1
info := Struct{p: unsafe.Pointer(&data)}
fmt.Printf("info is %d\n", *(*int)(info.p))
otherData := 2
atomic.StorePointer(&info.p, unsafe.Pointer(&otherData))
fmt.Printf("info is %d\n", *(*int)(info.p))
}
Since the spec doesn't specify you should assume it is not. Even if it is currently atomic it's possible that it could change without ever violating the spec.
In addition to Nick's answer, since Go 1.4 there is atomic.Value type. Its Store(interface) and Load() interface methods take care of the unsafe.Pointer conversion.
Simple example:
package main
import (
"sync/atomic"
)
type stats struct{}
type myType struct {
stats atomic.Value
}
func main() {
var t myType
s := new(stats)
t.stats.Store(s)
s = t.stats.Load().(*stats)
}
Or a more extended example from the documentation on the Go playground.
Since Go 1.19 atomic.Pointer is added into atomic
The sync/atomic package defines new atomic types Bool, Int32, Int64, Uint32, Uint64, Uintptr, and Pointer. These types hide the underlying values so that all accesses are forced to use the atomic APIs. Pointer also avoids the need to convert to unsafe.Pointer at call sites. Int64 and Uint64 are automatically aligned to 64-bit boundaries in structs and allocated data, even on 32-bit systems.
Sample
type ServerConn struct {
Connection net.Conn
ID string
}
func ShowConnection(p *atomic.Pointer[ServerConn]) {
for {
time.Sleep(10 * time.Second)
fmt.Println(p, p.Load())
}
}
func main() {
c := make(chan bool)
p := atomic.Pointer[ServerConn]{}
s := ServerConn{ID: "first_conn"}
p.Store(&s)
go ShowConnection(&p)
go func() {
for {
time.Sleep(13 * time.Second)
newConn := ServerConn{ID: "new_conn"}
p.Swap(&newConn)
}
}()
<-c
}
Please note that atomicity has nothing to do with "I just want to assign the pointer to nil, and would like other threads to be able to see it". The latter property is called visibility.
The answer to the former, as of right now is yes, assigning (loading/storing) a pointer is atomic in Golang, this lies in the updated Go memory model
Otherwise, a read r of a memory location x that is not larger than a machine word must observe some write w such that r does not happen before w and there is no write w' such that w happens before w' and w' happens before r. That is, each read must observe a value written by a preceding or concurrent write.
Regarding visibility, the question does not have enough information to be answered concretely. If you merely want to know if you can dereference the pointer safely, then a plain load/store would be enough. However, the most likely cases are that you want to communicate some information based on the nullness of the pointer. This requires you using sync/atomic, which provides synchronisation capabilities.

Does the following lock-free code exhibit a race-condition?

Within the Kubernetes Go repo on Github.com,
There is a lock-free implementation of a HighWaterMark data-structure. This code relies on atomic operations to achieve thread-safe code that is free of data races.
// HighWaterMark is a thread-safe object for tracking the maximum value seen
// for some quantity.
type HighWaterMark int64
// Update returns true if and only if 'current' is the highest value ever seen.
func (hwm *HighWaterMark) Update(current int64) bool {
for {
old := atomic.LoadInt64((*int64)(hwm))
if current <= old {
return false
}
if atomic.CompareAndSwapInt64((*int64)(hwm), old, current) {
return true
}
}
}
This code relies on the atomic.LoadInt64 and atomic.CompareAndSwapInt64 functions within the standard library to achieve data race free code...which I believe it does but I believe there is another problem of a race condition.
If two competing threads (goroutines) are executing such code there exists an edge case where after the atomic.LoadInt64 occurs in the first thread, the second thread could have swapped out for a higher value. But after the first thread thinks the current int64 is actually larger than the old int64 a swap will happen. This swap would then effectively lower the stored value due to observing a stale old value.
If another thread got in the middle, the CompareAndSwap would fail and the loop would start over.
Think of CompareAndSwap as
if actual == expected {
actual = newval
}
but done atomically
So this code says:
old = hwm // but done in a thread safe atomic read way
if old < current {
set hwm to current if hwm == old // atomically compare and then set value
}
when CAS (CompareAndSwap) fails, it returns false, causing the loop to start over until either:
a) "current" is not bigger than hwm
or
b) It successfully performs Load and then CompareAndSwap without another thread in the middle

lock-free bounded MPMC ringbuffer failure

I've been banging my head against (my attempt) at a lock-free multiple producer multiple consumer ring buffer. The basis of the idea is to use the innate overflow of unsigned char and unsigned short types, fix the element buffer to either of those types, and then you have a free loop back to beginning of the ring buffer.
The problem is - my solution doesn't work for multiple producers (it does though work for N consumers, and also single producer single consumer).
#include <atomic>
template<typename Element, typename Index = unsigned char> struct RingBuffer
{
std::atomic<Index> readIndex;
std::atomic<Index> writeIndex;
std::atomic<Index> scratchIndex;
Element elements[1 << (sizeof(Index) * 8)];
RingBuffer() :
readIndex(0),
writeIndex(0),
scratchIndex(0)
{
;
}
bool push(const Element & element)
{
while(true)
{
const Index currentReadIndex = readIndex.load();
Index currentWriteIndex = writeIndex.load();
const Index nextWriteIndex = currentWriteIndex + 1;
if(nextWriteIndex == currentReadIndex)
{
return false;
}
if(scratchIndex.compare_exchange_strong(
currentWriteIndex, nextWriteIndex))
{
elements[currentWriteIndex] = element;
writeIndex = nextWriteIndex;
return true;
}
}
}
bool pop(Element & element)
{
Index currentReadIndex = readIndex.load();
while(true)
{
const Index currentWriteIndex = writeIndex.load();
const Index nextReadIndex = currentReadIndex + 1;
if(currentReadIndex == currentWriteIndex)
{
return false;
}
element = elements[currentReadIndex];
if(readIndex.compare_exchange_strong(
currentReadIndex, nextReadIndex))
{
return true;
}
}
}
};
The main idea for writing was to use a temporary index 'scratchIndex' that acts a pseudo-lock to allow only one producer at any one time to copy-construct into the elements buffer, before updating the writeIndex and allowing any other producer to make progress. Before I am called heathen for implying my approach is 'lock-free' I realise that this approach isn't exactly lock-free, but in practice (if it would work!) it is significantly faster than having a normal mutex!
I am aware of a (more complex) MPMC ringbuffer solution here http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue, but I am really experimenting with my idea to then compare against that approach and find out where each excels (or indeed whether my approach just flat out fails!).
Things I have tried;
Using compare_exchange_weak
Using more precise std::memory_order's that match the behaviour I want
Adding cacheline pads between the various indices I have
Making elements std::atomic instead of just Element array
I am sure that this boils down to a fundamental segfault in my head as to how to use atomic accesses to get round using mutex's, and I would be entirely grateful to whoever can point out which neurons are drastically misfiring in my head! :)
This is a form of the A-B-A problem. A successful producer looks something like this:
load currentReadIndex
load currentWriteIndex
cmpxchg store scratchIndex = nextWriteIndex
store element
store writeIndex = nextWriteIndex
If a producer stalls for some reason between steps 2 and 3 for long enough, it is possible for the other producers to produce an entire queue's worth of data and wrap back around to the exact same index so that the compare-exchange in step 3 succeeds (because scratchIndex happens to be equal to currentWriteIndex again).
By itself, that isn't a problem. The stalled producer is perfectly within its rights to increment scratchIndex to lock the queue—even if a magical ABA-detecting cmpxchg rejected the store, the producer would simply try again, reload exactly the same currentWriteIndex, and proceed normally.
The actual problem is the nextWriteIndex == currentReadIndex check between steps 2 and 3. The queue is logically empty if currentReadIndex == currentWriteIndex, so this check exists to make sure that no producer gets so far ahead that it overwrites elements that no consumer has popped yet. It appears to be safe to do this check once at the top, because all the consumers should be "trapped" between the observed currentReadIndex and the observed currentWriteIndex.
Except that another producer can come along and bump up the writeIndex, which frees the consumer from its trap. If a producer stalls between steps 2 and 3, when it wakes up the stored value of readIndex could be absolutely anything.
Here's an example, starting with an empty queue, that shows the problem happening:
Producer A runs steps 1 and 2. Both loaded indices are 0. The queue is empty.
Producer B interrupts and produces an element.
Consumer pops an element. Both indices are 1.
Producer B produces 255 more elements. The write index wraps around to 0, the read index is still 1.
Producer A awakens from its slumber. It had previously loaded both read and write indices as 0 (empty queue!), so it attempts step 3. Because the other producer coincidentally paused on index 0, the compare-exchange succeeds, and the store progresses. At completion the producer lets writeIndex = 1, and now both stored indices are 1, and the queue is logically empty. A full queue's worth of elements will now be completely ignored.
(I should mention that the only reason I can get away with talking about "stalling" and "waking up" is that all the atomics used are sequentially consistent, so I can pretend that we're in a single-threaded environment.)
Note that the way that you are using scratchIndex to guard concurrent writes is essentially a lock; whoever successfully completes the cmpxchg gets total write access to the queue until it releases the lock. The simplest way to fix this failure is to just replace scratchIndex with a spinlock—it won't suffer from A-B-A and it's what's actually happening.
bool push(const Element & element)
{
while(true)
{
const Index currentReadIndex = readIndex.load();
Index currentWriteIndex = writeIndex.load();
const Index nextWriteIndex = currentWriteIndex + 1;
if(nextWriteIndex == currentReadIndex)
{
return false;
}
if(scratchIndex.compare_exchange_strong(
currentWriteIndex, nextWriteIndex))
{
elements[currentWriteIndex] = element;
// Problem here!
writeIndex = nextWriteIndex;
return true;
}
}
}
I've marked the problematic spot. Multiple threads can get to the writeIndex = nextWriteIndex at the same time. The data will be written in any order, although each write will be atomic.
This is a problem because you're trying to update two values using the same atomic condition, which is generally not possible. Assuming the rest of your method is fine, one way around this would be to combine both scratchIndex and writeIndex into a single value of double-size. For example, treating two uint32_t values as a single uint64_t value and operating atomically on that.

Design pattern for asynchronous while loop

I have a function that boils down to:
while(doWork)
{
config = generateConfigurationForTesting();
result = executeWork(config);
doWork = isDone(result);
}
How can I rewrite this for efficient asynchronous execution, assuming all functions are thread safe, independent of previous iterations, and probably require more iterations than the maximum number of allowable threads ?
The problem here is we don't know how many iterations are required in advance so we can't make a dispatch_group or use dispatch_apply.
This is my first attempt, but it looks a bit ugly to me because of arbitrarily chosen values and sleeping;
int thread_count = 0;
bool doWork = true;
int max_threads = 20; // arbitrarily chosen number
dispatch_queue_t queue =
dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
while(doWork)
{
if(thread_count < max_threads)
{
dispatch_async(queue, ^{ Config myconfig = generateConfigurationForTesting();
Result myresult = executeWork();
dispatch_async(queue, checkResult(myresult)); });
thread_count++;
}
else
usleep(100); // don't consume too much CPU
}
void checkResult(Result value)
{
if(value == good) doWork = false;
thread_count--;
}
Based on your description, it looks like generateConfigurationForTesting is some kind of randomization technique or otherwise a generator which can make a near-infinite number of configuration (hence your comment that you don't know ahead of time how many iterations you will need). With that as an assumption, you are basically stuck with the model that you've created, since your executor needs to be limited by some reasonable assumptions about the queue and you don't want to over-generate, as that would just extend the length of the run after you have succeeded in finding value ==good measurements.
I would suggest you consider using a queue (or OSAtomicIncrement* and OSAtomicDecrement*) to protect access to thread_count and doWork. As it stands, the thread_count increment and decrement will happen in two different queues (main_queue for the main thread and the default queue for the background task) and thus could simultaneously increment and decrement the thread count. This could lead to an undercount (which would cause more threads to be created than you expect) or an overcount (which would cause you to never complete your task).
Another option to making this look a little nicer would be to have checkResult add new elements into the queue if value!=good. This way, you load up the initial elements of the queue using dispatch_apply( 20, queue, ^{ ... }) and you don't need the thread_count at all. The first 20 will be added using dispatch_apply (or an amount that dispatch_apply feels is appropriate for your configuration) and then each time checkResult is called you can either set doWork=false or add another operation to queue.
dispatch_apply() works for this, just pass ncpu as the number of iterations (apply never uses more than ncpu worker threads) and keep each instance of your worker block running for as long as there is more work to do (i.e. loop back to generateConfigurationForTesting() unless !doWork).

Resources