Race condition with bufio.NewWriter but not with io.MultiWriter - multithreading

I am looking at goroutine safety and I have the below code to illustrate. I get a race condition detected when I run the below code which is understandable.
var buffer1, buffer2 bytes.Buffer
// two writers to same buffer
writer1 = bufio.NewWriter(&buffer1)
writer2 = bufio.NewWriter(&buffer1)
c := func(dst io.Writer, src io.Reader) {
io.Copy(dst, src)
}
go c(writer1, os.Stdin)
go c(writer2, os.Stderr)
I got a one time race condition happen at runtime when I replace bufio.NewWriter with io.MultiWriter as seen below. But using the race condition detector doesn't show me any data races happening in the below code snippet.
var buffer1, buffer2, buffer3 bytes.Buffer
// two multiwriters writing to buffer3
writer1 = io.MultiWriter(&buffer3, &buffer1)
writer2 = io.MultiWriter(&buffer3, &buffer1)
c := func(dst io.Writer, src io.Reader) {
io.Copy(dst, src)
}
go c(writer1, os.Stdin)
go c(writer2, os.Stderr)
In my opinion, there is a visible race condition happening in the second case with the io.MultiWriter. is it safe and why is it seldomly occurring?

Related

Peterson's algorithm and deadlock

I am trying to experiment with some mutual execution algorithms. I have implemented the Peterson's algorithm. It prints the correct counter value but sometimes it seems just like some kind of a deadlock had occurred which stalls the execution indefinitely. This should not be possible since this algorithm is deadlock free.
PS: Is this related to problems with compiler optimizations often mentioned when addressing the danger of "benign" data races? If this is the case then how to disable such optimizations?
PPS: When atomically storing/loading the victim field, the problem seems to disappear which makes the compiler's optimizations more suspicious
package main
import (
"fmt"
"sync"
)
type mutex struct {
flag [2]bool
victim int
}
func (m *mutex) lock(id int) {
m.flag[id] = true // I'm interested
m.victim = id // you can go before me if you want
for m.flag[1-id] && m.victim == id {
// while the other thread is inside the CS
// and the victime was me (I expressed my interest after the other one already did)
}
}
func (m *mutex) unlock(id int) {
m.flag[id] = false // I'm not intersted anymore
}
func main() {
var wg sync.WaitGroup
var mu mutex
var cpt, n = 0, 100000
for i := 0; i < 2; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
for j := 0; j < n; j++ {
mu.lock(id)
cpt = cpt + 1
mu.unlock(id)
}
}(i)
}
wg.Wait()
fmt.Println(cpt)
}
There is no "benign" data race. Your program has data race, and the behavior is undefined.
At the core of the problem is the mutex implementation. Modifications made to a shared object from one goroutine are not necessarily observable from others until those goroutines communicate using one of the synchronization primitives. You are writing to mutex.victim from multiple goroutines, and won't be observed. You are also reading the mutex.flag elements written by other goroutines, and won't necessarily be seen. That is, there may be cases where the for-loop won't terminate even if the other goroutine changes the variables.
And since the mutex implementation is broken, the updates to cpt will not necessarily be correct either.
To implement this correctly, you need the sync/atomic package.
See the Go Memory Model: https://go.dev/ref/mem
For Peterson's algorithm (same goes for Dekker), you need to ensure that your code is sequential consistent. In Go you can do that using atomics. This will prevent the compiler and the hardware to mess things up.

Synchronize write to file from heavy operations in different threads

I need to elaborate a file (potentially a big file) one block at a time and write the result to a new file.
To put it simply, I have the basic function to elaborate a block:
func elaborateBlock(block []byte) []byte { ... }
Every block needs to be elaborated and then written to the output file sequentially (preserving original order).
The one-thread implementation is trivial:
for {
buffer := make([]byte, BlockSize)
_, err := inputFile.Read(buffer)
if err == io.EOF {
break
}
processedData := elaborateBlock(buffer)
outputFile.Write(processedData)
}
But the elaboration can be heavy and every block can be processed separately, so a multi-threaded implementation is the natural evolution.
The solution I came up with is to create an array of channels, compute every block in a different thread and sync the final write by looping the channel array:
Utility function:
func blockThread(channel chan []byte, block []byte) {
channel <- elaborateBlock(block)
}
In the main program:
chans = []chan []byte {}
for {
buffer := make([]byte, BlockSize)
_, err := inputFile.Read(buffer)
if err == io.EOF {
break
}
channel := make(chan []byte)
chans = append(chans, channel)
go blockThread(channel, buffer)
}
for i := range chans {
data := <- chans[i]
outputFile.Write(data)
}
This approach works but can be problematic with large files because it requires to load the whole file in memory before starting writing the output.
Do you think there can be a better solution, with also better performance overall?
If blocks do need to be written out in order
If you want to work on multiple blocks concurrently, obviously you need to hold multiple blocks in memory at the same time.
You may decide how many blocks you want to process concurrently, and it's enough to read as many into memory at the same time. E.g. you may say you want to process 5 blocks concurrently. This will limit memory usage, and still utilize your CPU resources potentially to the max. Recommended to pick a number based on your available CPU cores (if processing a block does not already use multi cores). This can be queried using runtime.GOMAXPROCS(0).
You should have a single goroutine that reads the input file sequentially, and prodocue the blocks wrapped in Jobs (which also contain the block index).
You should have multiple worker goroutines, preferable as many as cores you have (but experiment with smaller and higher values too). Each worker goroutine just receives jobs, and calls elaborateBlock() on the data, and delivers it on the results channel.
There should be a single, designated consumer which receives completed jobs, and writes them in order to the output file. Since goroutines run concurrently and we have no control in which order the blocks are completed, the consumer should keep track of the index of the next block to be written to the output. Blocks arriving out of order should only be stored, and only proceed with writing if the subsequent block arrives.
This is an (incomplete) example how to do all these:
const BlockSize = 1 << 20 // 1 MB
func elaborateBlock(in []byte) []byte { return in }
type Job struct {
Index int
Block []byte
}
func producer(jobsCh chan<- *Job) {
// Init input file:
var inputFile *os.File
for index := 0; ; index++ {
job := &Job{
Index: index,
Block: make([]byte, BlockSize),
}
_, err := inputFile.Read(job.Block)
if err != nil {
break
}
jobsCh <- job
}
}
func worker(jobsCh <-chan *Job, resultCh chan<- *Job) {
for job := range jobsCh {
job.Block = elaborateBlock(job.Block)
resultCh <- job
}
}
func consumer(resultCh <-chan *Job) {
// Init output file:
var outputFile *os.File
nextIdx := 0
jobMap := map[int]*Job{}
for job := range resultCh {
jobMap[job.Index] = job
// Write out all blocks we have in contiguous index range:
for {
j := jobMap[nextIdx]
if j == nil {
break
}
if _, err := outputFile.Write(j.Block); err != nil {
// handle error, maybe terminate?
}
delete(nextIdx) // This job is written out
nextIdx++
}
}
}
func main() {
jobsCh := make(chan *Job)
resultCh := make(chan *Job)
for i := 0; i < 5; i++ {
go worker(jobsCh, resultCh)
}
wg := sync.WaitGroup{}
wg.Add(1)
go func() {
defer wg.Done()
consumer(resultCh)
}()
// Start producing jobs:
producer(jobsCh)
// No more jobs:
close(jobsCh)
// Wait for consumer to complete:
wg.Wait()
}
One thing to note here: this alone won't guarantee limiting the used memory. Imagine a case where the first block would require an enormous time to calculate, while subsequent blocks do not. What would happen? The first block would occupy a worker, and the other workers would "quickly" complete the subsequent blocks. The consumer would store all in memory, waiting for the first block to complete (as that has to be written out first). This could increase memory usage.
How could we avoid this?
By introducing a job pool. New jobs could not be created arbitrarily, but taken from a pool. If the pool is empty, the producer has to wait. So when the producer needs a new Job, takes one from a pool. When the consumer has written out a Job, puts it back into the pool. Simple as that. This would also reduce pressure on the garbage collector, as jobs (and large []byte buffers) are not created and thrown away, they could be re-used.
For a simple Job pool implementation you could use a buffered channel. For details, see How to implement Memory Pooling in Golang.
If blocks can be written in any order
Another option could be to allocate the output file in advance. If the size of the output blocks are also deterministic, you can do so (e.g. outsize := (insize / blocksize) * outblockSize).
To what end?
If you have the output file pre-allocated, the consumer does not need to wait input blocks in order. Once an input block is calculated, you can calculate the position where it will go in the output, seek to that position and just write it. For this you may use File.Seek().
This solution still requires to send the block index from the producer to the consumer, but the consumer won't need to store blocks arriving out-of-order, so the consumer can be simpler, and does not need to store completed blocks until the subsequent one arrives in order to proceed with writing the output file.
Note that this solution naturally does not pose a memory threat, as completed jobs are never accumulated / cached, they are written out in the order of completion.
See related questions for more details and techniques:
Is this an idiomatic worker thread pool in Go?
How to collect values from N goroutines executed in a specific order?
here is a working example that should work and is as close as possible to your original code.
the idea is to turn your array into a channel of channels of bytes. then
first fire up a consumer that will read on this channel of channels , get the channel of bytes, read from it and write the result.
Back on the main thread you create a channel of bytes, write it to the channel of channels (now the consumer reading sequentially from them will read the results in order) and then fire up the process that will do the work and write on the allocated channel (producers).
what will happen now is that the there will be a "race" between the procuders and the consumer, as soon as a produced block is read from the consumer and written the resources associated with it will be deallocated. this could be an improvement to your original design.
here is the code and the playground link:
package main
import (
"bytes"
"fmt"
"io"
"sync"
)
func elaborateBlock(b []byte) []byte {
return []byte("werkwerkwerk")
}
func blockThread(channel chan []byte, block []byte, wg *sync.WaitGroup) {
channel <- elaborateBlock(block)
wg.Done()
}
func main() {
chans := make(chan chan []byte)
BlockSize := 3
inputBytes := bytes.NewBuffer([]byte("transmutemetowerkwerkwerk"))
producewg := sync.WaitGroup{}
consumewg := sync.WaitGroup{}
consumewg.Add(1)
go func() {
chancount := 0
for ch := range chans {
data := <-ch
fmt.Printf("got %d block, result:%s\n", chancount, data)
chancount++
}
fmt.Printf("done receiving\n")
consumewg.Done()
}()
for {
buffer := make([]byte, BlockSize)
_, err := inputBytes.Read(buffer)
if err == io.EOF {
go func() {
//wait for all the procuders to finish
producewg.Wait()
//then close the main channel to notify the consumer
close(chans)
}()
break
}
channel := make(chan []byte)
chans <- channel //give the channel that we return the result to the receiver
producewg.Add(1)
go blockThread(channel, buffer, &producewg)
}
consumewg.Wait()
fmt.Printf("main exiting")
}
playground link
as a minor point i don't feel right about the "read the whole file into memory" statement cause you are just reading a block every time from the Reader, maybe "holding the result of the whole computation in memory" is more appropriate?

Why do you need conditional variables when you have mutexes?

I've taken a look at a Wikipedia pseudo code showing the consumer-producer problem solution using semaphores and mutexes:
mutex buffer_mutex;
semaphore fillCount = 0;
semaphore emptyCount = BUFFER_SIZE;
procedure producer() {
while (true) {
item = produceItem();
down(emptyCount);
down(buffer_mutex);
putItemIntoBuffer(item);
up(buffer_mutex);
up(fillCount);
}
}
procedure consumer() {
while (true) {
down(fillCount);
down(buffer_mutex);
item = removeItemFromBuffer();
up(buffer_mutex);
up(emptyCount);
consumeItem(item);
}
}
This solution seems to me like it would work pretty well, But how exactly will condition variables help us here? From what I understood a CV will block the calling thread until a certain condition is met, But the 'down' operation or a 'lock' operation also blocks the calling thread if the the value is 0. So is it all about integers vs conditions or is there more to it?
Thanks.

Lock-free programming: reordering and memory order semantics

I am trying to find my feet in lock-free programming. Having read different explanations for memory ordering semantics, I would like to clear up what possible reordering may happen. As far as I understood, instructions may be reordered by the compiler (due to optimization when the program is compiled) and CPU (at runtime?).
For the relaxed semantics cpp reference provides the following example:
// Thread 1:
r1 = y.load(memory_order_relaxed); // A
x.store(r1, memory_order_relaxed); // B
// Thread 2:
r2 = x.load(memory_order_relaxed); // C
y.store(42, memory_order_relaxed); // D
It is said that with x and y initially zero the code is allowed to produce r1 == r2 == 42 because, although A is sequenced-before B within thread 1 and C is sequenced before D within thread 2, nothing prevents D from appearing before A in the modification order of y, and B from appearing before C in the modification order of x. How could that happen? Does it imply that C and D get reordered, so the execution order would be DABC? Is it allowed to reorder A and B?
For the acquire-release semantics there is the following sample code:
std::atomic<std::string*> ptr;
int data;
void producer()
{
std::string* p = new std::string("Hello");
data = 42;
ptr.store(p, std::memory_order_release);
}
void consumer()
{
std::string* p2;
while (!(p2 = ptr.load(std::memory_order_acquire)))
;
assert(*p2 == "Hello"); // never fires
assert(data == 42); // never fires
}
I'm wondering what if we used relaxed memory order instead of acquire? I guess, the value of data could be read before p2 = ptr.load(std::memory_order_relaxed), but what about p2?
Finally, why it is fine to use relaxed memory order in this case?
template<typename T>
class stack
{
std::atomic<node<T>*> head;
public:
void push(const T& data)
{
node<T>* new_node = new node<T>(data);
// put the current value of head into new_node->next
new_node->next = head.load(std::memory_order_relaxed);
// now make new_node the new head, but if the head
// is no longer what's stored in new_node->next
// (some other thread must have inserted a node just now)
// then put that new head into new_node->next and try again
while(!head.compare_exchange_weak(new_node->next, new_node,
std::memory_order_release,
std::memory_order_relaxed))
; // the body of the loop is empty
}
};
I mean both head.load(std::memory_order_relaxed) and head.compare_exchange_weak(new_node->next, new_node, std::memory_order_release, std::memory_order_relaxed).
To summarize all the above, my question is essentially when do I have to care about potential reordering and when I don't?
For #1, compiler may issue the store to y before the load from x (there are no dependencies), and even if it doesn't, the load from x can be delayed at cpu/memory level.
For #2, p2 would be nonzero, but neither *p2 nor data would necessarily have a meaningful value.
For #3 there is only one act of publishing non-atomic stores made by this thread, and it is a release
You should always care about reordering, or, better, not assume any order: neither C++ nor hardware executes code top to bottom, they only respect dependencies.

Go: Thread-Safe Concurrency Issue with Sparse Array Read & Write

I'm writing a search engine in Go in which I have an inverted index of words to the corresponding results for each word. There is a set dictionary of words and so the words are already converted into a StemID, which is an integer starting from 0. This allows me to use a slice of pointers (i.e. a sparse array) to map each StemID to the structure which contains the results of that query. E.g. var StemID_to_Index []*resultStruct. If aardvark is 0 then the pointer to the resultStruct for aardvark is located at StemID_to_Index[0], which will be nil if the result for this word is currently not loaded.
There is not enough memory on the server to store all of this in memory, so the structure for each StemID will be saved as separate files and these can be loaded into the StemID_to_Index slice. If StemID_to_Index is currently nil for this StemID then the result is not cached and needs to be loaded, otherwise it's already loaded (cached) and so can be used directly. Each time a new result is loaded the memory usage is checked and if it's over the threshold then 2/3 of the loaded results are thrown away (StemID_to_Index is set to nil for these StemIDs and a garbage collection is forced.)
My problem is the concurrency. What is the fastest and most efficient way in which I can have multiple threads searching at the same time without having problems with different threads trying to read and write to the same place at the same time? I'm trying to avoid using mutexes on everything as that would slow down every single access attempt.
Do you think I would get away with loading the results from disk in the working thread and then delivering the pointer to this structure to an "updater" thread using channels, which then updates the nil value in the StemID_to_Index slice to the pointer of the loaded result? This would mean that two threads would never attempt to write at the same time, but what would happen if another thread tried to read from that exact index of StemID_to_Index while the "updater" thread was updating the pointer? It doesn't matter if a thread is given a nil pointer for a result which is currently being loaded, because it will just be loaded twice and while that is a waste of resources it would still deliver the same result and since that is unlikely to happen very often, it's forgiveable.
Additionally, how would the working thread which send the pointer to be updated to the "updater" thread know when the "updater" thread has finished updating the pointer in the slice? Should it just sleep and keep checking, or is there an easy way for the updater to send a message back to the specific thread which pushed to the channel?
UPDATE
I made a little test script to see what would happen if attempting to access a pointer at the same time as modifying it... it seems to always be OK. No errors. Am I missing something?
package main
import (
"fmt"
"sync"
)
type tester struct {
a uint
}
var things *tester
func updater() {
var a uint
for {
what := new(tester)
what.a = a
things = what
a++
}
}
func test() {
var t *tester
for {
t = things
if t != nil {
if t.a < 0 {
fmt.Println(`Error1`)
}
} else {
fmt.Println(`Error2`)
}
}
}
func main() {
var wg sync.WaitGroup
things = new(tester)
go test()
go test()
go test()
go test()
go test()
go test()
go updater()
go test()
go test()
go test()
go test()
go test()
wg.Add(1)
wg.Wait()
}
UPDATE 2
Taking this further, even if I read and write from multiple threads to the same variable at the same time... it makes no difference, still no errors:
From above:
func test() {
var a uint
var t *tester
for {
t = things
if t != nil {
if t.a < 0 {
fmt.Println(`Error1`)
}
} else {
fmt.Println(`Error2`)
}
what := new(tester)
what.a = a
things = what
a++
}
}
This implies I don't have to worry about concurrency at all... again: am I missing something here?
This sounds like a perfect use case for a memory mapped file:
package main
import (
"log"
"os"
"unsafe"
"github.com/edsrzf/mmap-go"
)
func main() {
// Open the backing file
f, err := os.OpenFile("example.txt", os.O_RDWR|os.O_CREATE, 0644)
if err != nil {
log.Fatalln(err)
}
defer f.Close()
// Set it's size
f.Truncate(1024)
// Memory map it
m, err := mmap.Map(f, mmap.RDWR, 0)
if err != nil {
log.Fatalln(err)
}
defer m.Unmap()
// m is a byte slice
copy(m, "Hello World")
m.Flush()
// here's how to use it with a pointer
type Coordinate struct{ X, Y int }
// first get the memory address as a *byte pointer and convert it to an unsafe
// pointer
ptr := unsafe.Pointer(&m[20])
// next convert it into a different pointer type
coord := (*Coordinate)(ptr)
// now you can use it directly
*coord = Coordinate{1, 2}
m.Flush()
// and vice-versa
log.Println(*(*Coordinate)(unsafe.Pointer(&m[20])))
}
The memory map can be larger than real memory and the operating system will handle all the messy details for you.
You will still need to make sure that separate goroutines never read/write to the same segment of memory at the same time.
My top answer would be to use elasticsearch with a client like elastigo.
If that's not an option, it would really help to know how much you care about race-y behavior. If you don't care, a write could happen right after a read finishes, the user finishing the read will get stale data. You can just have a queue of write and read operations and have multiple threads feed into that queue and one dispatcher issue the operations to the map one-at-a-time as they come it. In all other scenarios, you will need a mutex if there are multiple readers and writers. Maps aren't thread safe in go.
Honestly though, I would just add a mutex to make things simple for now and optimize by analyzing where your bottlenecks actually lie. It seems like you checking a threshold and then purging 2/3 of your cache is a bit arbitrary, and I wouldn't be surprised if you kill performance by doing something like that. Here's on situation where that would break down:
Requesters 1, 2, 3, and 4 are frequently accessing many of the same words on files A & B.
Requester 5, 6, 7 and 8 are frequently accessing many of the same words stored on files C & D.
Now when requests interleaved between these requesters and files happen in rapid succession, you may end up purging your 2/3 of your cache over and over again of results that may be requested shortly after. There are a couple other approaches:
Cache words that are frequently accessed at the same time on the same box and have multiple caching boxes.
Cache on a per-word basis with some sort of ranking of how popular that word is. If a new word is accessed from a file while the cache is full, see if other more popular words live in that file and purge less popular entries in the cache in hopes that those words will have a higher hit rate.
Both approaches 1 & 2.

Resources