Why do we need to call runtime.Gosched after call to atomic.AddUint64 and other similar atomic ops? - multithreading

Going through Go by Example: Atomic Counters. The code example calls runtime.Gosched after calling atomic.AddUint64.
atomic.AddUint64 is called to
ensure that this goroutine doesn’t starve the scheduler
Unfortunately, I am finding the explanation not so meaty and satisfying.
I tried running the sample code (comments removed for conciseness):
package main
import "fmt"
import "time"
import "sync/atomic"
import "runtime"
func main() {
var ops uint64 = 0
for i := 0; i < 50; i++ {
go func() {
for {
atomic.AddUint64(&ops, 1)
opsFinal := atomic.LoadUint64(&ops)
fmt.Println("ops:", opsFinal)
without the runtime.Gosched() (go run conc.go) and the program never exited even when I reduced the loop from 50 to 1.
What happens under the hood after the call to atomic.AddUint64 that it is necessary to call runtime.Gosched? And how does runtime.Gosched fixes this? I did not find any hint to such a thing in sync/atomic's documentation.

This is how cooperative multithreading works. If one thread remains ready to run, it continues to run and other threads don't. Explicit and implicit pre-emption points are used to allow other threads to run. If your thread has a loop that it stays in for lots of time with no implicit pre-emption points, you'll starve other threads if you don't add an explicit pre-emption point.
This answer has much more information about when Go uses cooperative multithreading.


Peterson's algorithm and deadlock

I am trying to experiment with some mutual execution algorithms. I have implemented the Peterson's algorithm. It prints the correct counter value but sometimes it seems just like some kind of a deadlock had occurred which stalls the execution indefinitely. This should not be possible since this algorithm is deadlock free.
PS: Is this related to problems with compiler optimizations often mentioned when addressing the danger of "benign" data races? If this is the case then how to disable such optimizations?
PPS: When atomically storing/loading the victim field, the problem seems to disappear which makes the compiler's optimizations more suspicious
package main
import (
type mutex struct {
flag [2]bool
victim int
func (m *mutex) lock(id int) {
m.flag[id] = true // I'm interested
m.victim = id // you can go before me if you want
for m.flag[1-id] && m.victim == id {
// while the other thread is inside the CS
// and the victime was me (I expressed my interest after the other one already did)
func (m *mutex) unlock(id int) {
m.flag[id] = false // I'm not intersted anymore
func main() {
var wg sync.WaitGroup
var mu mutex
var cpt, n = 0, 100000
for i := 0; i < 2; i++ {
go func(id int) {
defer wg.Done()
for j := 0; j < n; j++ {
cpt = cpt + 1
There is no "benign" data race. Your program has data race, and the behavior is undefined.
At the core of the problem is the mutex implementation. Modifications made to a shared object from one goroutine are not necessarily observable from others until those goroutines communicate using one of the synchronization primitives. You are writing to mutex.victim from multiple goroutines, and won't be observed. You are also reading the mutex.flag elements written by other goroutines, and won't necessarily be seen. That is, there may be cases where the for-loop won't terminate even if the other goroutine changes the variables.
And since the mutex implementation is broken, the updates to cpt will not necessarily be correct either.
To implement this correctly, you need the sync/atomic package.
See the Go Memory Model: https://go.dev/ref/mem
For Peterson's algorithm (same goes for Dekker), you need to ensure that your code is sequential consistent. In Go you can do that using atomics. This will prevent the compiler and the hardware to mess things up.

Understanding mutex behviour

I was thinking mutex in Go would lock the data and won't allow read/write by any other goroutine unless the fist goroutine releases the lock. It seems like my understanding was wrong. The only way to block read/write from other goroutine is to call lock in other goroutines as well. This would ensure critical section is accessed by one and only one goroutine.
So, I would expect this code to have a deadlock:
package main
type myMap struct {
m map[string]string
mutex sync.Mutex
func main() {
done := make(chan bool)
ch := make(chan bool)
myM := &myMap{
m: make(map[string]string),
go func() {
myM.m["x"] = "i"
fmt.Println("Locked. Won't release the Lock")
ch <- true
go func() {
<- ch
fmt.Println("Trying to write to the myMap")
myM.m["a"] = "b"
done <- true
<- done
Since the fist goroutine locks the struct, I would expect the second goroutine to fail to read/write to the struct but that not happening here.
If I will add mux.Lock() in second goroutine then there will be a deadlock.
I find it a little weird the way mutex works in Go. If I lock then Go shouldn't allow any other goroutine to read/write to it.
Can someone explain to me the mutex concept in Go?
There's no magical force field that surrounds a mutex, protecting any datastructure it happens to be embedded in. If you lock a mutex, it prevents other code from locking it until it's unlocked. Nothing more, nothing less. It's well documented in the sync package.
So in your code, where there's exactly one myM.mutex.Lock(), the effect is the same as if there was no mutex.
A correct use of a mutex that protects data involves locking the mutex before updating or reading the data, and then unlocking it afterwards. Often this code will be wrapped in a function so that defer can be used:
func doSomething(myM *myMap) {
defer myM.mutex.Unlock()
... read or update myM

Self-Synchronizing Goroutines end up with Deadlock

I have a stress test issue that I want to solve with simple synchronization in Go. So far I have tried to find documenation on my specific usecase regarding synchronization in Go, but didn't find anything that fits.
To be a bit more specific:
I must fulfill a task where I have to start a large amount of threads (in this example only illustrated with two threads) in the main routine. All of the initiated workers are supposed to prepare some initialization actions by themselves in unordered manner. Until they reach a small sequence of commands, which I want them to be executed by all goroutines at once, which is why I want to self-synchronize the goroutines with each other. It is very vital for my task that the delay through the main routine, which instantiates all other goroutines, does not affect the true parallelism of the workers execution (at the label #maximum parallel in the comment). For this purpose I do initialize a wait group with the amount of running goroutines in the main routine and pass it over to all routines so they can synchronize each others workflow.
The code looks similar to this example:
import sync
func worker_action(wait_group *sync.WaitGroup) {
// ...
// initialization
// ...
defer wait_group.Done()
wait_group.Wait() // #label: wait
// sequence of maximum parallel instructions // #label: maximum parallel
// ...
func main() {
var numThreads int = 2 // the number of threads shall be much higher for the actual stress test
var wait_group sync.WaitGroup
for i := 0; i < numThreads; i++ {
go worker_action(&wait_group)
// ...
Unfortunately my setup runs into a deadlock, as soon as all goroutines have reached the Wait instruction (labeled with #wait in the comment). This is true for any amount of threads that I start with the main routine (even two threads are caught in a deadlock within no time).
From my point of view a deadlock should not occur, due to the fact that immediately before the wait instruction each goroutine executes the done function on the same wait group.
Do I have a wrong understanding of how wait groups work? Is it for instance not allowed to execute the wait function inside of a goroutine other than the main routine? Or can someone give me a hint on what else I am missing?
Thank you very much in advance.
Thanks a lot #tkausl. It was indeed the unnecessary "defer" that caused the problem. I do not know how I could not see it myself.
There are several issues in your code. First the form. Idiomatic Go should use camelCase. wg is a better name for the WaitGroup.
But more important is the use where your code is waiting. Not inside your Goroutines. It should wait inside the main func:
func workerAction(wg *sync.WaitGroup) {
// ...
// initialization
// ...
defer wg.Done()
// wg.Wait() // #label: wait
// sequence of maximum parallel instructions // #label: maximum parallel
// ...
func main() {
var numThreads int = 2 // the number of threads shall be much higher for the actual stress test
var wg sync.WaitGroup
for i := 0; i < numThreads; i++ {
go workerAction(&wg)
wg.Wait() // you need to wait here
// ...
Again thanks #tkausl. The issue was resolved by removing the unnecessary "defer" instruction from the line that was meant to let the worker goroutines increment the number of finished threads.
I.e. "defer wait_group.Done()" -> "wait_group.Done()"

Why does this program run faster when it's allocated fewer threads?

I have a fairly simple Go program designed to compute random Fibonacci numbers to test some strange behavior I observed in a worker pool I wrote. When I allocate one thread, the program finishes in 1.78s. When I allocate 4, it finishes in 9.88s.
The code is as follows:
var workerWG sync.WaitGroup
func worker(fibNum chan int) {
for {
var tgt = <-fibNum
var a, b float64 = 0, 1
for i := 0; i < tgt; i++ {
a, b = a+b, a
func main() {
var fibNum = make(chan int)
for i := 0; i < 4; i++ {
go worker(fibNum)
for i := 0; i < 500000; i++ {
fibNum <- rand.Intn(1000)
If I replace runtime.GOMAXPROCS(1) with 4, the program takes four times as long to run.
What's going on here? Why does adding more available threads to a worker pool slow the entire pool down?
My personal theory is that it has to do with the processing time of the worker being less than the overhead of thread management, but I'm not sure. My reservation is caused by the following test:
When I replace the worker function with the following code:
for {
time.Sleep(500 * time.Millisecond)
both one available thread and four available threads take the same amount of time.
I revised your program to look like the following:
package main
import (
var workerWG sync.WaitGroup
func worker(fibNum chan int) {
for tgt := range fibNum {
var a, b float64 = 0, 1
for i := 0; i < tgt; i++ {
a, b = a+b, a
func main() {
var fibNum = make(chan int)
for i := 0; i < 4; i++ {
go worker(fibNum)
for i := 0; i < 500000; i++ {
fibNum <- rand.Intn(100000)
I cleaned up the wait group usage.
I changed rand.Intn(1000) to rand.Intn(100000)
On my machine that produces:
$ time go run threading.go (GOMAXPROCS=1)
real 0m20.934s
user 0m20.932s
sys 0m0.012s
$ time go run threading.go (GOMAXPROCS=8)
real 0m10.634s
user 0m44.184s
sys 0m1.928s
This means that in your original code, the work performed vs synchronization (channel read/write) was negligible. The slowdown came from having to synchronize across threads instead of one and only perform a very small amount of work inbetween.
In essence, synchronization is expensive compared to calculating fibonacci numbers up to 1000. This is why people tend to discourage micro-benchmarks. Upping that number gives a better perspective. But an even better idea is to benchmark actual work being done i.e. including IO, syscalls, processing, crunching, writing output, formatting, etc.
Edit: As an experiment, I upped the number of workers to 8 with GOMAXPROCS set to 8 and the result was:
$ time go run threading.go
real 0m4.971s
user 0m35.692s
sys 0m0.044s
The code written by #thwd is correct and idiomatic Go.
Your code was being serialized due to the atomic nature of sync.WaitGroup. Both workerWG.Add(1) and workerWG.Done() will block until they're able to atomically update the internal counter.
Since the workload is between 0 and 1000 recursive calls, the bottleneck of a single core was enough to keep data races on the waitgroup counter to a minimum.
On multiple cores, the processor spends a lot of time spinning to fix the collisions of waitgroup calls. Add that to the fact that the waitgroup counter is kept on one core and you now have added communication between cores (taking up even more cycles).
A couple hints for simplifying code:
For a small, set number of goroutines, a complete channel (chan struct{} to avoid allocations) is cheaper to use.
Use the send channel close as a kill signal for goroutines and have them signal that they've exited (waitgroup or channel). Then, close to complete channel to free them up for the GC.
If you need a waitgroup, aggressively minimize the number of calls to it. Those calls must be internally serialized, so extra calls forces added synchronization.
Your main computation routine in worker does not allow the scheduler to run.
Calling the scheduler manually like
for i := 0; i < tgt; i++ {
a, b = a+b, a
if i%300 == 0 {
Reduces wall clock by 30% when switching from one to two threads.
Such artificial microbenchmarks are really hard to get right.

How to measure system overload when using GO

I am rewriting an old system in GO, and in the old system I was measuring the system load average to know if I should increase the number of thread in my thread-pool.
In go people are not using threadpool or pool of goroutine because starting a goroutine is very cheap.
But still running too many goroutine is less efficient then just enough to keep the cpu usage near 100%
Thus is there a way to know how many goroutine are ready to run (not blocked) but not currently running. Or is there a way to get the number of scheduled runnable goroutine "Run queue".
Check out the runtime/pprof package.
To print "stack traces of all current goroutines" use:
pprof.Lookup("goroutine").WriteTo(os.Stdout, 1)
To print "stack traces that led to blocking on synchronization primitives" use:
pprof.Lookup("block").WriteTo(os.Stdout, 1)
You can combine these with the functions in the runtime package such as runtime.NumGoroutine to get some basic reporting.
This example deliberately creates many blocked goroutines and waits for them to complete. Every 5 seconds it prints the output of the block pprof profile, as well as the number of goroutines still in existence:
package main
import (
var (
wg sync.WaitGroup
m sync.Mutex
func randWait() {
defer wg.Done()
defer m.Unlock()
interval, err := time.ParseDuration(strconv.Itoa(rand.Intn(499)+1) + "ms")
if err != nil {
fmt.Errorf("%s\n", err)
func blockStats() {
for {
pprof.Lookup("block").WriteTo(os.Stdout, 1)
fmt.Println("# Goroutines:", runtime.NumGoroutine())
time.Sleep(5 * time.Second)
func main() {
for i := 0; i < 100; i++ {
go randWait()
go blockStats()
I'm not sure if that's what you're after, but you may be able to modify it to suit your needs.
is there a way to know how many goroutine are ready to run (not blocked) but not currently running.?
You will be able (Q4 2014/Q1 2015) to try and visualize those goroutines, with a new tracer being developed (Q4 2014): Go Execution Tracer
The trace contains:
events related to goroutine scheduling:
a goroutine starts executing on a processor,
a goroutine blocks on a synchronization primitive,
a goroutine creates or unblocks another goroutine;
network-related events:
a goroutine blocks on network IO,
a goroutine is unblocked on network IO;
syscalls-related events:
a goroutine enters into syscall,
a goroutine returns from syscall;
garbage-collector-related events:
GC start/stop,
concurrent sweep start/stop; and
user events.
By "processor" I mean a logical processor, unit of GOMAXPROCS.
Each event contains event id, a precise timestamp, OS thread id, processor id, goroutine id, stack trace and other relevant information (e.g. unblocked goroutine id).
