Why is the race detector not detecting this race condition? - multithreading

I am currently learning the Go programming language and I am now experimenting with the atomic package.
In this example, I am spawning a number of Goroutines that all need to increment a package level variable. There are several methods to avoid race conditions but for now I want to solve this using the atomic package.
When running the following code on my Windows PC (go run main.go) the results are not what I expect them to be (I expect the final result to be 1000). The final number is somewhere between 900 and 1000. When running the code in the Go Playground the value is 1000.
Here is the code I am using: https://play.golang.org/p/8gW-AsKGzwq
var counter int64
var wg sync.WaitGroup
func main() {
num := 1000
wg.Add(num )
for i := 0; i < num ; i++ {
go func() {
v := atomic.LoadInt64(&counter)
v++
atomic.StoreInt64(&counter, v)
// atomic.AddInt64(&counter, 1)
// fmt.Println(v)
wg.Done()
}()
}
wg.Wait()
fmt.Println("final", counter)
}
go run main.go
final 931
go run main.go
final 960
go run main.go
final 918
I would have expected the race detector to give an error, but it doesn't:
go run -race main.go
final 1000
And it outputs the correct value (1000).
I am using go version go1.12.7 windows/amd64 (latest version at this moment)
My questions:
Why is the race detector not giving an error, but am I seeing different values when running the code without the race detector?
My theory why the Load/Store combination is not working is that the two atomic calls are not atomic as a whole. In this case I should be using the atomic.AddInt64 method, is that right?
Any help would be greatly appreciated :)

There is nothing racy in your code, so that's why the race detector detects nothing. Your counter variable is always accessed via the atomic package from the launched goroutines and not directly.
The reason why sometimes you get 1000 and sometimes less is due to the number of active threads that run goroutines: GOMAXPROCS. On the Go Playground it's 1, so at any time you have one active goroutine (so basically your app is executed sequentially, without any parallelism). And the current goroutine scheduler does not put goroutines to park arbitrarily.
On your local machine you probably have a multicore CPU, and GOMAXPROCS defaults to the number of available logical CPUs, so GOMAXPROCS is greater than 1, so you have multiple goroutines running parallel (truly parallel, not just concurrent).
See this fragment:
v := atomic.LoadInt64(&counter)
v++
atomic.StoreInt64(&counter, v)
You load counter's value and assign it to v, you increment v, and you store back the value of the incremented v. What happens if 2 parallel goroutines do this at the same time? Let's say both load the value 100. Both increment their local copy: 101. Both write back 101, even though it should be at 102.
Yes, the proper way to increment counters atomically would be to use atomic.AddInt64() like this:
for i := 0; i < num; i++ {
go func() {
atomic.AddInt64(&counter, 1)
wg.Done()
}()
}
This way you'll always get 1000, no matter what GOMAXPROCS is.

Related

How to do idempotent microbenchmarks or measure/emulate CPU cycle used by a program in isolation?

My goal is the following:
Given program with fixed input and output, do a microbenchmark in an idempotent unit relative to the CPU work performed to compute the output. In other words, If you run the program multiple times with the same input, the benchmark should always result in the same value.
For instance, let's say I have this code:
// Brute force: O(n^2) | O(1)
function twoSum(nums, target) {
for (let i = 0; i < nums.length - 1; i++) { // O(n^2)
for (let j = i + 1; j < nums.length; j++) { // O(n)
if (nums[i] + nums[j] === target) {
return [i, j];
}
}
}
return [];
}
twoSum(Array(1e7).fill(2), 4);
I can easily do time node benchmarks/two-sum-implementations/runner.js and get time taken. But if I ran it multiple times, I get different times depending on what the OS was doing. Most frameworks will run it multiple times and then avg. the times, but I don't want that.
Some ideas come to the top of my mind, but not sure how to implement it, or if they work at all. So, maybe more experienced minds can shed some light here :)
Can I use docker to run a program and track how many CPU time was used by the container that runs my program and exit? Would that be a consistent metric?
Is there a program or tool that emulates CPU cycles so I can know much is a program using in isolation?
How do cloud providers like GPC and AWS bill CPU by time? What tools did they use to measure that?
Can you convert a program into its equivalent ASM (assembler) code and count the number of lines were executed by a program? Something similar to what the code coverage frameworks do with high-level code. They can count how many times a line was executed during a test.
Based on the previous question, How deep can code coverage tools go? If it can go deep enough and it's consistent, I can microbenchmark based on lines of code executed.
Any other ideas are welcome too!

What is the point of running same code under different threads - openMP?

From: https://bisqwit.iki.fi/story/howto/openmp/
The parallel construct
The parallel construct starts a parallel block. It creates a team
of N threads (where N is determined at runtime, usually from the
number of CPU cores, but may be affected by a few things), all of
which execute the next statement (or the next block, if the statement
is a {…} -enclosure). After the statement, the threads join back into
one.
#pragma omp parallel
{
// Code inside this region runs in parallel.
printf("Hello!\n");
}
I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
By using omp_get_thread_num() you can retrieve the thread ID which enables you to parametrize the so called "same code" with respect to that thread ID.
Take this example:
A is a 1000-dimensional integer array and you need to sum its values using 2 OpenMP threads.
You would design you code something like this:
int A_dim = 1000
long sum[2] = {0,0}
#pragma omp parallel
{
int threadID = omp_get_thread_num();
int start = threadID * (A_dim / 2)
int end = (threadID + 1) * (A_dim / 2)
for(int i = start; i < end; i++)
sum[threadID] += A[i]
}
start is the lower bound which your thread will start summing from (example: thread #0 will start summing from 0, while thread #1 will start summing from 500).
end is pretty much the same of start, but it's the upper bound of which array index the thread will sum up to (example: thread #0 will sum until 500, summing values from A[0] to A[499], while thread #1 will sum until 1000 is reached, values from A[500] to A[999])
I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
When you are running the same code on different data.
For example, if I want to invert 10 matrices, I might run the matrix inversion code on 10 threads ... to get (ideally) a 10-fold speedup compared to 1 thread and a for loop.
The basic idea of OpenMP is to distribute work. For this you need to create some threads.
The parallel construct creates this number of threads. Afterwards you can distibute/share work with other constructs like omp for or omp task.
A possible benefit of this distinction is e.g. when you have to allocate memory for each thread (i.e. thread-local data).
I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
One example: in physics you got a random process(collision, initial maxwellian etc) in your code and you need to run the code many times to get the average results, in this case you need to run the same code several times.

Goroutines are cooperatively scheduled. Does that mean that goroutines that don't yield execution will cause goroutines to run one by one?

From: http://blog.nindalf.com/how-goroutines-work/
As the goroutines are scheduled cooperatively, a goroutine that loops continuously can starve other goroutines on the same thread.
Goroutines are cheap and do not cause the thread on which they are multiplexed to block if they are blocked on
network input
sleeping
channel operations or
blocking on primitives in the sync package.
So given the above, say that you have some code like this that does nothing but loop a random number of times and print the sum:
func sum(x int) {
sum := 0
for i := 0; i < x; i++ {
sum += i
}
fmt.Println(sum)
}
if you use goroutines like
go sum(100)
go sum(200)
go sum(300)
go sum(400)
will the goroutines run one by one if you only have one thread?
A compilation and tidying of all of creker's comments.
Preemptive means that kernel (runtime) allows threads to run for a specific amount of time and then yields execution to other threads without them doing or knowing anything. In OS kernels that's usually implemented using hardware interrupts. Process can't block entire OS. In cooperative multitasking thread have to explicitly yield execution to others. If it doesn't it could block whole process or even whole machine. That's how Go does it. It has some very specific points where goroutine can yield execution. But if goroutine just executes for {} then it will lock entire process.
However, the quote doesn't mention recent changes in the runtime. fmt.Println(sum) could cause other goroutines to be scheduled as newer runtimes will call scheduler on function calls.
If you don't have any function calls, just some math, then yes, goroutine will lock the thread until it exits or hits something that could yield execution to others. That's why for {} doesn't work in Go. Even worse, it will still lead to process hanging even if GOMAXPROCS > 1 because of how GC works, but in any case you shouldn't depend on that. It's good to understand that stuff but don't count on it. There is even a proposal to insert scheduler calls in loops like yours
The main thing that Go's runtime does is it gives its best to allow everyone to execute and don't starve anyone. How it does that is not specified in the language specification and might change in the future. If the proposal about loops will be implemented then even without function calls switching could occur. At the moment the only thing you should remember is that in some circumstances function calls could cause goroutine to yield execution.
To explain the switching in Akavall's answer, when fmt.Printf is called, the first thing it does is checks whether it needs to grow the stack and calls the scheduler. It MIGHT switch to another goroutine. Whether it will switch depends on the state of other goroutines and exact implementation of the scheduler. Like any scheduler, it probably checks whether there're starving goroutines that should be executed instead. With many iterations function call has greater chance to make a switch because others are starving longer. With few iterations goroutine finishes before starvation happens.
For what its worth it. I can produce a simple example where it is clear that the goroutines are not ran one by one:
package main
import (
"fmt"
"runtime"
)
func sum_up(name string, count_to int, print_every int, done chan bool) {
my_sum := 0
for i := 0; i < count_to; i++ {
if i % print_every == 0 {
fmt.Printf("%s working on: %d\n", name, i)
}
my_sum += 1
}
fmt.Printf("%s: %d\n", name, my_sum)
done <- true
}
func main() {
runtime.GOMAXPROCS(1)
done := make(chan bool)
const COUNT_TO = 10000000
const PRINT_EVERY = 1000000
go sum_up("Amy", COUNT_TO, PRINT_EVERY, done)
go sum_up("Brian", COUNT_TO, PRINT_EVERY, done)
<- done
<- done
}
Result:
....
Amy working on: 7000000
Brian working on: 8000000
Amy working on: 8000000
Amy working on: 9000000
Brian working on: 9000000
Brian: 10000000
Amy: 10000000
Also if I add a function that just does a forever loop, that will block the entire process.
func dumb() {
for {
}
}
This blocks at some random point:
go dumb()
go sum_up("Amy", COUNT_TO, PRINT_EVERY, done)
go sum_up("Brian", COUNT_TO, PRINT_EVERY, done)
Well, let's say runtime.GOMAXPROCS is 1. The goroutines run concurrently one at a time. Go's scheduler just gives the upper hand to one of the spawned goroutines for a certain time, then to another, etc until all are finished.
So, you never know which goroutine is running at a given time, that's why you need to synchronize your variables. From your example, it's unlikely that sum(100) will run fully, then sum(200) will run fully, etc
The most probable is that one goroutine will do some iterations, then another will do some, then another again etc.
So, the overall is that they are not sequential, even if there is only one goroutine active at a time (GOMAXPROCS=1).
So, what's the advantage of using goroutines ? Plenty. It means that you can just do an operation in a goroutine because it is not crucial and continue the main program. Imagine an HTTP webserver. Treating each request in a goroutine is convenient because you do not have to care about queueing them and run them sequentially: you let Go's scheduler do the job.
Plus, sometimes goroutines are inactive, because you called time.Sleep, or they are waiting for an event, like receiving something for a channel. Go can see this and just executes other goroutines while some are in those idle states.
I know there are a handful of advantages I didn't present, but I don't know concurrency that much to tell you about them.
EDIT:
Related to your example code, if you add each iteration at the end of a channel, run that on one processor and print the content of the channel, you'll see that there is no context switching between goroutines: Each one runs sequentially after another one is done.
However, it is not a general rule and is not specified in the language. So, you should not rely on these results for drawing general conclusions.
#Akavall Try adding sleep after creating dumb goroutine, goruntime never executes sum_up goroutines.
From that it looks like go runtime spawns next go routines immediately, it might execute sum_up goroutine until go runtime schedules dumb() goroutine to run. Once dumb() is scheduled to run then go runtime won't schedule sum_up goroutines to run, as dumb runs for{}

Speed-up from multi-threading

I have a highly parallelizable problem. Hundreds of separate problems need to be solved by the same function. The problems each take an average of perhaps 120 ms (0.12 s) on a single core, but there is substantial variation, and some extreme and rare ones may take 10 times as long. Each problem needs memory, but this is allocated ahead of time. The problems do not need disk I/O, and they do not pass back and forth any variables once they are running. They do access different parts (array elements) of the same global struct, though.
I have C++ code, based on someone else's code, that works. (The global array of structs is not shown.) It runs 20 problems (for instance) and then returns. I think 20 is enough to even out the variability on 4 cores. I see the execution time flattening out from about 10 already.
There is a Win32 and an OpenMP version, and they behave almost identically in terms of execution time. I run the program on a 4-core Windows system. I include some OpenMP code below since it is shorter. (I changed names etc. to make it more generic and I may have made mistakes -- it won't compile stand-alone.)
The speed-up over the single-threaded version flattens out at about a factor of 2.3. So if it takes 230 seconds single-threaded, it takes 100 s multi-threaded. I am surprised that the speed-up is not a lot closer to 4, the number of cores.
Am I right to be disappointed?
Is there anything I can do to get closer to my theoretical expectation?
int split_bigtask(Inputs * inputs, Outputs * results)
{
for (int k = 0; k < MAXNO; k++)
results->solved[k].value = 0;
int res;
#pragma omp parallel shared(inputs, outputs)
{
#pragma omp for schedule(dynamic)
for (int k = 0; k < inputs->no; k++)
{
res = bigtask(inputs->values[k],
outputs->solved[k],
omp_get_thread_num()
);
}
}
return TRUE;
}
I Assume that there is no synchronization done within bigtask() (Obvious, but I'd still check it first).
It's possible that you run into a "dirty cache" problem: If you manipulate data that is close to each other (e.g. same cache line!) from multiple cores each manipulation will mark the cache line as dirty (which means that the processor needs to signal this to all other processeors which in turn involves synchronization again...).
you create too many threads (allocating a thread is quite an overhead. So creating one thread for each core is a lot more efficient than creating 5 threads each).
I personally would assume that you have case 2 ("Big Global Array").
Solution to the problem (if it's indeed case 2):
Write the results to a local array which is merged into the "Big Global Array" by the main thread after the end of the work
Split the global array into several smaller arrays (and give each thread one of these arrays)
Ensure that the records within the structure align on Cache-Line boundaries (this is a bit a hack since cache line boundaries may change for future processors)
You may want to try to create a local copy of the array for each thread (at least for the results)

Threads complete but loop doesn't end

I wrote some threading code with what seems to be an incorrect assumption, that integers were thread-safe. Now it seems that although they are, my use of them is NOT thread safe. I'm using a global integer ThreadCount to hold the number of threads. During thread create, I increment the ThreadCount. During thread destroy, I decrement it. After all threads are done being created, I wait for them to be done (ThreadCount should drop to 0) and then write my final report and exit.
Sometimes (5%) though, I never get to 0, even though a post-mortem examination of my log shows that all threads did run and complete. So all signs point to ThreadCount getting trampled. I have been telling myself that this is impossible since it's an integer, and I'm just using inc/dec.
Here some relevant code:
var // global
ThreadCount : integer; // Number of active threads
...
constructor TTiesUpsertThread.Create(const CmdStr : string);
begin
inherited create(false);
Self.FreeOnTerminate := true;
...
Inc(ThreadCount); // Number of threads created. Used for throttling.
end;
destructor TTiesUpsertThread.Destroy;
begin
inherited destroy;
Dec(ThreadCount); // When it reaches 0, the overall job is done.
end;
...
//down at the end of the main routine:
while (ThreadCount > 0) do // Sometimes this doesn't ever end.
begin
SpinWheels('.'); // sleeps for 1000ms and writes dots... to console
end;
I THINK my problem is with inc/dec. I think I'm getting collisions where an two or more dec() hit at the same time and both read the same value, so they replace it with the same value. ex: ThreadCount = 5, and two threads end at the same time, both read 5, replace with 4. But the new value should be 3.
This never runs into trouble in our test environment, (different hardware, topology, load, etc..) so I'm looking for confirmation that this is likely the problem, before I try to "sell" this solution to the business unit.
If this is my problem, do I use a critical selection to protect the inc/dec?
Thanks for taking a look.
If multiple threads modify the variable without protection then yes you have a data race. If two threads attempt to increment or decrement at the same instance then what happens is:
The variable is read into a register.
The modification is made in the register.
The new value is written back to the variable.
That read/modify/write is not atomic. If you have two threads executing at the same time then you have the canonical data race.
Thread 1 reads the value, N say.
Thread 2 reads the value, the same value as was read by thread 1, N.
Thread 1 writes N+1 to the variable.
Thread 2 writes N+1 to the variable.
And instead of the variable being incremented twice, it is incremented only once.
In this case there's no need for a full blown critical section. Use InterlockedIncrement to perform lock-free, thread safe modification.

Resources