Simple worker pool in Go - multithreading

I am trying to implement a simple worker pool in go and keep running into issues. All I want to do is have a set number of workers that do a set amount of work before getting more work to do. The code I am using looks similar to:
jobs := make(chan imageMessage, 1)
results := make(chan imageMessage, 1)
for w := 0; w < 2; w++ {
go worker(jobs, results)
}
for j := 0; j < len(images); j++ {
jobs <- imageMessage{path: paths[j], img: images[j]}
}
close(jobs)
for r := 0; r < len(images); r++ {
<-results
}
}
func worker(jobs <-chan imageMessage, results chan<- imageMessage) {
for j := range jobs {
processImage(j.path, j.img)
results <- j
}
}
My understanding is that this should create 2 workers that can do 1 "thing" at a time and will continue to get more work as they complete that 1 thing until there is nothing else to do. However, I get fatal error: all goroutines are asleep - deadlock!
If I set the buffer to something huge like 100, this works, but I want to be able to limit the work done at a time.
I feel like I'm close, but obviously missing something.

The problem is that you only start "draining" the results channel, once you successfully sent all jobs on the jobs channel. But for you to be able to send all jobs, either the jobs channel must have big enough buffer, or worker goroutines must be able to consume jobs from it.
But a worker goroutines when consuming a job, before it could take the next one, sends the result on the results channel. If the buffer of the results channel is full, sending the result will block.
But the last part – a worker goroutine blocked in sending the result – can only be "unblocked" by receiving from the results channel – which you don't until you can send all the jobs. Deadlock if the buffer of the jobs channel and the results channel cannot hold all your jobs. This also explains why it works if you increase the buffer size to a big value: if the jobs can fit into the buffers, deadlock will not occur, and after all jobs are sent successfully, your final loop will drain the results channel.
The solution? Run generating-and-sending jobs in its own goroutine, so you can start receiving from the results channel "immediately" without having to wait to send all jobs, which means the worker goroutines will not get blocked forever trying to send results:
go func() {
for j := 0; j < len(images); j++ {
jobs <- imageMessage{path: paths[j], img: images[j]}
}
close(jobs)
}()
Try it on the Go Playground.
Also check out a similar implementation in Is this an idiomatic worker thread pool in Go?

You can use this worker. Simple and efficient. https://github.com/tamnguyenvt/go-worker
NewWorkerManager(WorkerManagerParams{
WorkerSize: <number of workers>,
RelaxAfter: <sleep for awhile to relax server after given duration>,
RelaxDuration: <relax duration>,
WorkerFunc: <your worker function here>,
LogEnable: <enable log or not>,
StopTimeout: <timeout all workers after given duration>,
}

Related

golang multithreaded web crawler runs into deadlock

I just started to learn multithreaded programming using golang, and I'm trying to write a multithreaded web crawler using BFS traversal, however I cannot get the code working. The error I get is fatal error: all goroutines are asleep - deadlock!
I will paste the code below, but let me explain conceptually how it works:
I have one master thread (the main function itself) and N worker threads. I intentionally chose to use BFS approach with a fixed amount of worker threads, because it seems using a DFS approach I will have to spawn a new thread for each single new URL to crawl, which might become a huge burden for context switch.
I am using two channels:
urlsToCrawl: master thread sends URLs to crawl to worker threads.
urlsDiscovered: worker threads send discovered URLs back to master.
Here is the code implementation, I removed some non relevant details (e.g. how to parse html page etc..)
The trick I'm trying to do here is: I am using the channel as a queue to do BFS, and when the queue's size is 0, it is impossible to know whether it is because "A. there is really no more URLs to crawl" OR because "B. some worker thread(s) are still working so there might be more URLs to crawl soon". Therefore I introduced this count variable, basically whenever a new url is sent to workers to be crawled, count is incremented, therefore when count == 0 and channel is empty, it would mean "A. there is really no more URLs to crawl"; otherwise when count > 0 and channel is empty, it would mean "B. some worker thread(s) are still working so there might be more URLs to crawl soon".
However as I mentioned, this doesn't seem to work and I run into deadlock. Would anyone please shed some light? Thanks!
package main
import (
"fmt"
)
var (
count = 0 // This tracks how many worker threads are actively working right now
)
func crawlUrl(urlsToCrawl chan string, urlsDiscovered chan Pair) {
for url := range urlsToCrawl {
urls := getUrls(url) // This returns an array of string, if no URL found, it returns an empty array
urlsDiscovered <- urls
}
}
func main() {
urlsToCrawl := make(chan string)
urlsDiscovered := make(chan string[])
i := 0
for i < 8 {
go crawlUrl(urlsToCrawl, urlsDiscovered)
i++
}
visited := map[string]bool{"some_seed_url": true}
count++
urlsToCrawl <- "some_seed_url"
for urls := range urlsDiscovered {
count-- // One message is received by master, meaning one worker thread has finished an job item, therefore decrementing count
for _, url := range urls {
_, ok := visited[url]
if ok {
continue // This URL has been crawled before
}
visited[url] = true
count ++ // One more work item will be sent to worker, therefore first increment count
urlsToCrawl <- url
}
if count == 0 {
close(urlsDiscovered)
close(urlsToCrawl)
break
} // else some worker must be working so let's wait to see if there is new msg coming through the channel
}
}
Your channels are unbuffered
urlsToCrawl := make(chan string)
urlsDiscovered := make(chan string[])
So a goroutine which reads from or writes to a channel will block until a goroutine on the other side is doing the opposite.
So you start 8 crawlUrl goroutines which all block while reading from urlsToCrawl, meaning that main can send 8 urls before blocking. The crawlUrl goroutines are blocked until main reads from urlsDiscovered. So if you have more than 8 URL's going around, all goroutines are waiting on each other(deadlock).
The solution to this is to use buffered channels with a capacity you are very unlikely to exceed:
urlsToCrawl := make(chan string, 1000)
urlsDiscovered := make(chan string[], 100)
If you expect you might still exceed the capacity of the channel in extreme cases, you can perform non-blocking operations which allow you to for example discard discovered URL's if the channel is full instead of blocking.
select {
case: urlsDiscovered <- urls:
// on success (url written)
default:
// channel is full, can't write without blocking
}
Try to integrate the WaitGroup package.

Self-Synchronizing Goroutines end up with Deadlock

I have a stress test issue that I want to solve with simple synchronization in Go. So far I have tried to find documenation on my specific usecase regarding synchronization in Go, but didn't find anything that fits.
To be a bit more specific:
I must fulfill a task where I have to start a large amount of threads (in this example only illustrated with two threads) in the main routine. All of the initiated workers are supposed to prepare some initialization actions by themselves in unordered manner. Until they reach a small sequence of commands, which I want them to be executed by all goroutines at once, which is why I want to self-synchronize the goroutines with each other. It is very vital for my task that the delay through the main routine, which instantiates all other goroutines, does not affect the true parallelism of the workers execution (at the label #maximum parallel in the comment). For this purpose I do initialize a wait group with the amount of running goroutines in the main routine and pass it over to all routines so they can synchronize each others workflow.
The code looks similar to this example:
import sync
func worker_action(wait_group *sync.WaitGroup) {
// ...
// initialization
// ...
defer wait_group.Done()
wait_group.Wait() // #label: wait
// sequence of maximum parallel instructions // #label: maximum parallel
// ...
}
func main() {
var numThreads int = 2 // the number of threads shall be much higher for the actual stress test
var wait_group sync.WaitGroup
wait_group.Add(numThreads)
for i := 0; i < numThreads; i++ {
go worker_action(&wait_group)
}
// ...
}
Unfortunately my setup runs into a deadlock, as soon as all goroutines have reached the Wait instruction (labeled with #wait in the comment). This is true for any amount of threads that I start with the main routine (even two threads are caught in a deadlock within no time).
From my point of view a deadlock should not occur, due to the fact that immediately before the wait instruction each goroutine executes the done function on the same wait group.
Do I have a wrong understanding of how wait groups work? Is it for instance not allowed to execute the wait function inside of a goroutine other than the main routine? Or can someone give me a hint on what else I am missing?
Thank you very much in advance.
EDIT:
Thanks a lot #tkausl. It was indeed the unnecessary "defer" that caused the problem. I do not know how I could not see it myself.
There are several issues in your code. First the form. Idiomatic Go should use camelCase. wg is a better name for the WaitGroup.
But more important is the use where your code is waiting. Not inside your Goroutines. It should wait inside the main func:
func workerAction(wg *sync.WaitGroup) {
// ...
// initialization
// ...
defer wg.Done()
// wg.Wait() // #label: wait
// sequence of maximum parallel instructions // #label: maximum parallel
// ...
}
func main() {
var numThreads int = 2 // the number of threads shall be much higher for the actual stress test
var wg sync.WaitGroup
wg.Add(numThreads)
for i := 0; i < numThreads; i++ {
go workerAction(&wg)
}
wg.Wait() // you need to wait here
// ...
}
Again thanks #tkausl. The issue was resolved by removing the unnecessary "defer" instruction from the line that was meant to let the worker goroutines increment the number of finished threads.
I.e. "defer wait_group.Done()" -> "wait_group.Done()"

Synchronize write to file from heavy operations in different threads

I need to elaborate a file (potentially a big file) one block at a time and write the result to a new file.
To put it simply, I have the basic function to elaborate a block:
func elaborateBlock(block []byte) []byte { ... }
Every block needs to be elaborated and then written to the output file sequentially (preserving original order).
The one-thread implementation is trivial:
for {
buffer := make([]byte, BlockSize)
_, err := inputFile.Read(buffer)
if err == io.EOF {
break
}
processedData := elaborateBlock(buffer)
outputFile.Write(processedData)
}
But the elaboration can be heavy and every block can be processed separately, so a multi-threaded implementation is the natural evolution.
The solution I came up with is to create an array of channels, compute every block in a different thread and sync the final write by looping the channel array:
Utility function:
func blockThread(channel chan []byte, block []byte) {
channel <- elaborateBlock(block)
}
In the main program:
chans = []chan []byte {}
for {
buffer := make([]byte, BlockSize)
_, err := inputFile.Read(buffer)
if err == io.EOF {
break
}
channel := make(chan []byte)
chans = append(chans, channel)
go blockThread(channel, buffer)
}
for i := range chans {
data := <- chans[i]
outputFile.Write(data)
}
This approach works but can be problematic with large files because it requires to load the whole file in memory before starting writing the output.
Do you think there can be a better solution, with also better performance overall?
If blocks do need to be written out in order
If you want to work on multiple blocks concurrently, obviously you need to hold multiple blocks in memory at the same time.
You may decide how many blocks you want to process concurrently, and it's enough to read as many into memory at the same time. E.g. you may say you want to process 5 blocks concurrently. This will limit memory usage, and still utilize your CPU resources potentially to the max. Recommended to pick a number based on your available CPU cores (if processing a block does not already use multi cores). This can be queried using runtime.GOMAXPROCS(0).
You should have a single goroutine that reads the input file sequentially, and prodocue the blocks wrapped in Jobs (which also contain the block index).
You should have multiple worker goroutines, preferable as many as cores you have (but experiment with smaller and higher values too). Each worker goroutine just receives jobs, and calls elaborateBlock() on the data, and delivers it on the results channel.
There should be a single, designated consumer which receives completed jobs, and writes them in order to the output file. Since goroutines run concurrently and we have no control in which order the blocks are completed, the consumer should keep track of the index of the next block to be written to the output. Blocks arriving out of order should only be stored, and only proceed with writing if the subsequent block arrives.
This is an (incomplete) example how to do all these:
const BlockSize = 1 << 20 // 1 MB
func elaborateBlock(in []byte) []byte { return in }
type Job struct {
Index int
Block []byte
}
func producer(jobsCh chan<- *Job) {
// Init input file:
var inputFile *os.File
for index := 0; ; index++ {
job := &Job{
Index: index,
Block: make([]byte, BlockSize),
}
_, err := inputFile.Read(job.Block)
if err != nil {
break
}
jobsCh <- job
}
}
func worker(jobsCh <-chan *Job, resultCh chan<- *Job) {
for job := range jobsCh {
job.Block = elaborateBlock(job.Block)
resultCh <- job
}
}
func consumer(resultCh <-chan *Job) {
// Init output file:
var outputFile *os.File
nextIdx := 0
jobMap := map[int]*Job{}
for job := range resultCh {
jobMap[job.Index] = job
// Write out all blocks we have in contiguous index range:
for {
j := jobMap[nextIdx]
if j == nil {
break
}
if _, err := outputFile.Write(j.Block); err != nil {
// handle error, maybe terminate?
}
delete(nextIdx) // This job is written out
nextIdx++
}
}
}
func main() {
jobsCh := make(chan *Job)
resultCh := make(chan *Job)
for i := 0; i < 5; i++ {
go worker(jobsCh, resultCh)
}
wg := sync.WaitGroup{}
wg.Add(1)
go func() {
defer wg.Done()
consumer(resultCh)
}()
// Start producing jobs:
producer(jobsCh)
// No more jobs:
close(jobsCh)
// Wait for consumer to complete:
wg.Wait()
}
One thing to note here: this alone won't guarantee limiting the used memory. Imagine a case where the first block would require an enormous time to calculate, while subsequent blocks do not. What would happen? The first block would occupy a worker, and the other workers would "quickly" complete the subsequent blocks. The consumer would store all in memory, waiting for the first block to complete (as that has to be written out first). This could increase memory usage.
How could we avoid this?
By introducing a job pool. New jobs could not be created arbitrarily, but taken from a pool. If the pool is empty, the producer has to wait. So when the producer needs a new Job, takes one from a pool. When the consumer has written out a Job, puts it back into the pool. Simple as that. This would also reduce pressure on the garbage collector, as jobs (and large []byte buffers) are not created and thrown away, they could be re-used.
For a simple Job pool implementation you could use a buffered channel. For details, see How to implement Memory Pooling in Golang.
If blocks can be written in any order
Another option could be to allocate the output file in advance. If the size of the output blocks are also deterministic, you can do so (e.g. outsize := (insize / blocksize) * outblockSize).
To what end?
If you have the output file pre-allocated, the consumer does not need to wait input blocks in order. Once an input block is calculated, you can calculate the position where it will go in the output, seek to that position and just write it. For this you may use File.Seek().
This solution still requires to send the block index from the producer to the consumer, but the consumer won't need to store blocks arriving out-of-order, so the consumer can be simpler, and does not need to store completed blocks until the subsequent one arrives in order to proceed with writing the output file.
Note that this solution naturally does not pose a memory threat, as completed jobs are never accumulated / cached, they are written out in the order of completion.
See related questions for more details and techniques:
Is this an idiomatic worker thread pool in Go?
How to collect values from N goroutines executed in a specific order?
here is a working example that should work and is as close as possible to your original code.
the idea is to turn your array into a channel of channels of bytes. then
first fire up a consumer that will read on this channel of channels , get the channel of bytes, read from it and write the result.
Back on the main thread you create a channel of bytes, write it to the channel of channels (now the consumer reading sequentially from them will read the results in order) and then fire up the process that will do the work and write on the allocated channel (producers).
what will happen now is that the there will be a "race" between the procuders and the consumer, as soon as a produced block is read from the consumer and written the resources associated with it will be deallocated. this could be an improvement to your original design.
here is the code and the playground link:
package main
import (
"bytes"
"fmt"
"io"
"sync"
)
func elaborateBlock(b []byte) []byte {
return []byte("werkwerkwerk")
}
func blockThread(channel chan []byte, block []byte, wg *sync.WaitGroup) {
channel <- elaborateBlock(block)
wg.Done()
}
func main() {
chans := make(chan chan []byte)
BlockSize := 3
inputBytes := bytes.NewBuffer([]byte("transmutemetowerkwerkwerk"))
producewg := sync.WaitGroup{}
consumewg := sync.WaitGroup{}
consumewg.Add(1)
go func() {
chancount := 0
for ch := range chans {
data := <-ch
fmt.Printf("got %d block, result:%s\n", chancount, data)
chancount++
}
fmt.Printf("done receiving\n")
consumewg.Done()
}()
for {
buffer := make([]byte, BlockSize)
_, err := inputBytes.Read(buffer)
if err == io.EOF {
go func() {
//wait for all the procuders to finish
producewg.Wait()
//then close the main channel to notify the consumer
close(chans)
}()
break
}
channel := make(chan []byte)
chans <- channel //give the channel that we return the result to the receiver
producewg.Add(1)
go blockThread(channel, buffer, &producewg)
}
consumewg.Wait()
fmt.Printf("main exiting")
}
playground link
as a minor point i don't feel right about the "read the whole file into memory" statement cause you are just reading a block every time from the Reader, maybe "holding the result of the whole computation in memory" is more appropriate?

How to safely interact with channels in goroutines in Golang

I am new to go and I am trying to understand the way channels in goroutines work. To my understanding, the keyword range could be used to iterate over a the values of the channel up until the channel is closed or the buffer runs out; hence, a for range c will repeatedly loops until the buffer runs out.
I have the following simple function that adds value to a channel:
func main() {
c := make(chan int)
go printchannel(c)
for i:=0; i<10 ; i++ {
c <- i
}
}
I have two implementations of printchannel and I am not sure why the behaviour is different.
Implementation 1:
func printchannel(c chan int) {
for range c {
fmt.Println(<-c)
}
}
output: 1 3 5 7
Implementation 2:
func printchannel(c chan int) {
for i:=range c {
fmt.Println(i)
}
}
output: 0 1 2 3 4 5 6 7 8
And I was expecting neither of those outputs!
Wanted output: 0 1 2 3 4 5 6 7 8 9
Shouldnt the main function and the printchannel function run on two threads in parallel, one adding values to the channel and the other reading the values up until the channel is closed? I might be missing some fundamental go/thread concept here and pointers to that would be helpful.
Feedback on this (and my understanding to channels manipulation in goroutines) is greatly appreciated!
Implementation 1. You're reading from the channel twice - range c and <-c are both reading from the channel.
Implementation 2. That's the correct approach. The reason you might not see 9 printed is that two goroutines might run in parallel threads. In that case it might go like this:
main goroutine sends 9 to the channel and blocks until it's read
second goroutine receives 9 from the channel
main goroutine unblocks and exits. That terminates whole program which doesn't give second goroutine a chance to print 9
In case like that you have to synchronize your goroutines. For example, like so
func printchannel(c chan int, wg *sync.WaitGroup) {
for i:=range c {
fmt.Println(i)
}
wg.Done() //notify that we're done here
}
func main() {
c := make(chan int)
wg := sync.WaitGroup{}
wg.Add(1) //increase by one to wait for one goroutine to finish
//very important to do it here and not in the goroutine
//otherwise you get race condition
go printchannel(c, &wg) //very important to pass wg by reference
//sync.WaitGroup is a structure, passing it
//by value would produce incorrect results
for i:=0; i<10 ; i++ {
c <- i
}
close(c) //close the channel to terminate the range loop
wg.Wait() //wait for the goroutine to finish
}
As to goroutines vs threads. You shouldn't confuse them and probably should understand the difference between them. Goroutines are green threads. There're countless blog posts, lectures and stackoverflow answers on that topic.
In implementation 1, range reads into channel once, then again in Println. Hence you're skipping over 2, 4, 6, 8.
In both implementations, once the final i (9) has been sent to goroutine, the program exits. Thus goroutine does not have the time to print out 9. To solve it, use a WaitGroup as has been mentioned in the other answer, or a done channel to avoid semaphore/mutex.
func main() {
c := make(chan int)
done := make(chan bool)
go printchannel(c, done)
for i:=0; i<10 ; i++ {
c <- i
}
close(c)
<- done
}
func printchannel(c chan int, done chan bool) {
for i := range c {
fmt.Println(i)
}
done <- true
}
The reason your first implementation only returns every other number is because you are, in effect "taking" from c twice each time the loop runs: first with range, then again with <-. It just happens that you're not actually binding or using the first value taken off the channel, so all you end up printing is every other one.
An alternative approach to your first implementation would be to not use range at all, e.g.:
func printchannel(c chan int) {
for {
fmt.Println(<-c)
}
}
I could not replicate the behavior of your second implementation, on my machine, but the reason for that is that both of your implementations are racy - they will terminate whenever main ends, regardless of what data may be pending in a channel or however many goroutines may be active.
As a closing note, I'd warn you not to think about goroutines as explicitly being "threads", though they have a similar mental model and interface. In a simple program like this it's not at all unlikely that Go might just do it all using a single OS thread.
Your first loop does not work as you have 2 blocking channel receivers and they do not execute at the same time.
When you call the goroutine the loop starts, and it waits for the first value to be sent to the channel. Effectively think of it as <-c .
When the for loop in the main function runs it sends 0 on the Chan. At this point the range c recieves the value and stops blocking the execution of the loop.
Then it is blocked by the reciever at fmt.println(<-c) . When 1 is sent on the second iteration of the loop in main the recieved at fmt.println(<-c) reads from the channel, allowing fmt.println to execute thus finishing the loop and waiting for a value at the for range c .
Your second implementation of the looping mechanism is the correct one.
The reason it exits before printing to 9 is that after the for loop in main finishes the program goes ahead and completes execution of main.
In Go func main is launched as a goroutine itself while executing. Thus when the for loop in main completes it goes ahead and exits, and as the print is within a parallel goroutine that is closed, it is never executed. There is no time for it to print as there is nothing to block main from completing and exiting the program.
One way to solve this is to use wait groups http://www.golangprograms.com/go-language/concurrency.html
In order to get the expected result you need to have a blocking process running in main that provides enough time or waits for confirmation of the execution of the goroutine before allowing the program to continue.

Why does this program run faster when it's allocated fewer threads?

I have a fairly simple Go program designed to compute random Fibonacci numbers to test some strange behavior I observed in a worker pool I wrote. When I allocate one thread, the program finishes in 1.78s. When I allocate 4, it finishes in 9.88s.
The code is as follows:
var workerWG sync.WaitGroup
func worker(fibNum chan int) {
for {
var tgt = <-fibNum
workerWG.Add(1)
var a, b float64 = 0, 1
for i := 0; i < tgt; i++ {
a, b = a+b, a
}
workerWG.Done()
}
}
func main() {
rand.Seed(time.Now().UnixNano())
runtime.GOMAXPROCS(1) // LINE IN QUESTION
var fibNum = make(chan int)
for i := 0; i < 4; i++ {
go worker(fibNum)
}
for i := 0; i < 500000; i++ {
fibNum <- rand.Intn(1000)
}
workerWG.Wait()
}
If I replace runtime.GOMAXPROCS(1) with 4, the program takes four times as long to run.
What's going on here? Why does adding more available threads to a worker pool slow the entire pool down?
My personal theory is that it has to do with the processing time of the worker being less than the overhead of thread management, but I'm not sure. My reservation is caused by the following test:
When I replace the worker function with the following code:
for {
<-fibNum
time.Sleep(500 * time.Millisecond)
}
both one available thread and four available threads take the same amount of time.
I revised your program to look like the following:
package main
import (
"math/rand"
"runtime"
"sync"
"time"
)
var workerWG sync.WaitGroup
func worker(fibNum chan int) {
for tgt := range fibNum {
var a, b float64 = 0, 1
for i := 0; i < tgt; i++ {
a, b = a+b, a
}
}
workerWG.Done()
}
func main() {
rand.Seed(time.Now().UnixNano())
runtime.GOMAXPROCS(1) // LINE IN QUESTION
var fibNum = make(chan int)
for i := 0; i < 4; i++ {
go worker(fibNum)
workerWG.Add(1)
}
for i := 0; i < 500000; i++ {
fibNum <- rand.Intn(100000)
}
close(fibNum)
workerWG.Wait()
}
I cleaned up the wait group usage.
I changed rand.Intn(1000) to rand.Intn(100000)
On my machine that produces:
$ time go run threading.go (GOMAXPROCS=1)
real 0m20.934s
user 0m20.932s
sys 0m0.012s
$ time go run threading.go (GOMAXPROCS=8)
real 0m10.634s
user 0m44.184s
sys 0m1.928s
This means that in your original code, the work performed vs synchronization (channel read/write) was negligible. The slowdown came from having to synchronize across threads instead of one and only perform a very small amount of work inbetween.
In essence, synchronization is expensive compared to calculating fibonacci numbers up to 1000. This is why people tend to discourage micro-benchmarks. Upping that number gives a better perspective. But an even better idea is to benchmark actual work being done i.e. including IO, syscalls, processing, crunching, writing output, formatting, etc.
Edit: As an experiment, I upped the number of workers to 8 with GOMAXPROCS set to 8 and the result was:
$ time go run threading.go
real 0m4.971s
user 0m35.692s
sys 0m0.044s
The code written by #thwd is correct and idiomatic Go.
Your code was being serialized due to the atomic nature of sync.WaitGroup. Both workerWG.Add(1) and workerWG.Done() will block until they're able to atomically update the internal counter.
Since the workload is between 0 and 1000 recursive calls, the bottleneck of a single core was enough to keep data races on the waitgroup counter to a minimum.
On multiple cores, the processor spends a lot of time spinning to fix the collisions of waitgroup calls. Add that to the fact that the waitgroup counter is kept on one core and you now have added communication between cores (taking up even more cycles).
A couple hints for simplifying code:
For a small, set number of goroutines, a complete channel (chan struct{} to avoid allocations) is cheaper to use.
Use the send channel close as a kill signal for goroutines and have them signal that they've exited (waitgroup or channel). Then, close to complete channel to free them up for the GC.
If you need a waitgroup, aggressively minimize the number of calls to it. Those calls must be internally serialized, so extra calls forces added synchronization.
Your main computation routine in worker does not allow the scheduler to run.
Calling the scheduler manually like
for i := 0; i < tgt; i++ {
a, b = a+b, a
if i%300 == 0 {
runtime.Gosched()
}
}
Reduces wall clock by 30% when switching from one to two threads.
Such artificial microbenchmarks are really hard to get right.

Resources