How would you define a pool of goroutines to be executed at once? - multithreading

TL;DR: Please just go to the last part and tell me how you would solve this problem.
I've begun using Go this morning coming from Python. I want to call a closed-source executable from Go several times, with a bit of concurrency, with different command line arguments. My resulting code is working just well but I'd like to get your input in order to improve it. Since I'm at an early learning stage, I'll also explain my workflow.
For the sake of simplicity, assume here that this "external closed-source program" is zenity, a Linux command line tool that can display graphical message boxes from the command line.
Calling an executable file from Go
So, in Go, I would go like this:
package main
import "os/exec"
func main() {
cmd := exec.Command("zenity", "--info", "--text='Hello World'")
cmd.Run()
}
This should be working just right. Note that .Run() is a functional equivalent to .Start() followed by .Wait(). This is great, but if I wanted to execute this program just once, the whole programming stuff would not be worth it. So let's just do that multiple times.
Calling an executable multiple times
Now that I had this working, I'd like to call my program multiple times, with custom command line arguments (here just i for the sake of simplicity).
package main
import (
"os/exec"
"strconv"
)
func main() {
NumEl := 8 // Number of times the external program is called
for i:=0; i<NumEl; i++ {
cmd := exec.Command("zenity", "--info", "--text='Hello from iteration n." + strconv.Itoa(i) + "'")
cmd.Run()
}
}
Ok, we did it! But I still can't see the advantage of Go over Python … This piece of code is actually executed in a serial fashion. I have a multiple-core CPU and I'd like to take advantage of it. So let's add some concurrency with goroutines.
Goroutines, or a way to make my program parallel
a) First attempt: just add "go"s everywhere
Let's rewrite our code to make things easier to call and reuse and add the famous go keyword:
package main
import (
"os/exec"
"strconv"
)
func main() {
NumEl := 8
for i:=0; i<NumEl; i++ {
go callProg(i) // <--- There!
}
}
func callProg(i int) {
cmd := exec.Command("zenity", "--info", "--text='Hello from iteration n." + strconv.Itoa(i) + "'")
cmd.Run()
}
Nothing! What is the problem? All the goroutines are executed at once. I don't really know why zenity is not executed but AFAIK, the Go program exited before the zenity external program could even be initialized. This was confirmed by the use of time.Sleep: waiting for a couple of seconds was enough to let the 8 instance of zenity launch themselves. I don't know if this can be considered a bug though.
To make it worse, the real program I'd actually like to call takes a while to execute itself. If I execute 8 instances of this program in parallel on my 4-core CPU, it's gonna waste some time doing a lot of context switching … I don't know how plain Go goroutines behave, but exec.Command will launch zenity 8 times in 8 different threads. To make it even worse, I want to execute this program more than 100,000 times. Doing all of that at once in goroutines won't be efficient at all. Still, I'd like to leverage my 4-core CPU!
b) Second attempt: use pools of goroutines
The online resources tend to recommend the use of sync.WaitGroup for this kind of work. The problem with that approach is that you are basically working with batches of goroutines: if I create of WaitGroup of 4 members, the Go program will wait for all the 4 external programs to finish before calling a new batch of 4 programs. This is not efficient: CPU is wasted, once again.
Some other resources recommended the use of a buffered channel to do the work:
package main
import (
"os/exec"
"strconv"
)
func main() {
NumEl := 8 // Number of times the external program is called
NumCore := 4 // Number of available cores
c := make(chan bool, NumCore - 1)
for i:=0; i<NumEl; i++ {
go callProg(i, c)
c <- true // At the NumCoreth iteration, c is blocking
}
}
func callProg(i int, c chan bool) {
defer func () {<- c}()
cmd := exec.Command("zenity", "--info", "--text='Hello from iteration n." + strconv.Itoa(i) + "'")
cmd.Run()
}
This seems ugly. Channels were not intended for this purpose: I'm exploiting a side-effect. I love the concept of defer but I hate having to declare a function (even a lambda) to pop a value out of the dummy channel that I created. Oh, and of course, using a dummy channel is, by itself, ugly.
c) Third attempt: die when all the children are dead
Now we are nearly finished. I have just to take into account yet another side effect: the Go program closes before all the zenity pop-ups are closed. This is because when the loop is finised (at the 8th iteration), nothing prevents the program from finishing. This time, sync.WaitGroup will be useful.
package main
import (
"os/exec"
"strconv"
"sync"
)
func main() {
NumEl := 8 // Number of times the external program is called
NumCore := 4 // Number of available cores
c := make(chan bool, NumCore - 1)
wg := new(sync.WaitGroup)
wg.Add(NumEl) // Set the number of goroutines to (0 + NumEl)
for i:=0; i<NumEl; i++ {
go callProg(i, c, wg)
c <- true // At the NumCoreth iteration, c is blocking
}
wg.Wait() // Wait for all the children to die
close(c)
}
func callProg(i int, c chan bool, wg *sync.WaitGroup) {
defer func () {
<- c
wg.Done() // Decrease the number of alive goroutines
}()
cmd := exec.Command("zenity", "--info", "--text='Hello from iteration n." + strconv.Itoa(i) + "'")
cmd.Run()
}
Done.
My questions
Do you know any other proper way to limit the number of goroutines executed at once?
I don't mean threads; how Go manages goroutines internally is not relevant. I really mean limiting the number of goroutines launched at once: exec.Command creates a new thread each time it is called, so I should control the number of time it is called.
Does that code look fine to you?
Do you know how to avoid the use of a dummy channel in that case?
I can't convince myself that such dummy channels are the way to go.

I would spawn 4 worker goroutines that read the tasks from a common channel. Goroutines that are faster than others (because they are scheduled differently or happen to get simple tasks) will receive more task from this channel than others. In addition to that, I would use a sync.WaitGroup to wait for all workers to finish. The remaining part is just the creation of the tasks. You can see an example implementation of that approach here:
package main
import (
"os/exec"
"strconv"
"sync"
)
func main() {
tasks := make(chan *exec.Cmd, 64)
// spawn four worker goroutines
var wg sync.WaitGroup
for i := 0; i < 4; i++ {
wg.Add(1)
go func() {
for cmd := range tasks {
cmd.Run()
}
wg.Done()
}()
}
// generate some tasks
for i := 0; i < 10; i++ {
tasks <- exec.Command("zenity", "--info", "--text='Hello from iteration n."+strconv.Itoa(i)+"'")
}
close(tasks)
// wait for the workers to finish
wg.Wait()
}
There are probably other possible approaches, but I think this is a very clean solution that is easy to understand.

A simple approach to throttling (execute f() N times but maximum maxConcurrency concurrently), just a scheme:
package main
import (
"sync"
)
const maxConcurrency = 4 // for example
var throttle = make(chan int, maxConcurrency)
func main() {
const N = 100 // for example
var wg sync.WaitGroup
for i := 0; i < N; i++ {
throttle <- 1 // whatever number
wg.Add(1)
go f(i, &wg, throttle)
}
wg.Wait()
}
func f(i int, wg *sync.WaitGroup, throttle chan int) {
defer wg.Done()
// whatever processing
println(i)
<-throttle
}
Playground
I wouldn't probably call the throttle channel "dummy". IMHO it's an elegant way (it's not my invention of course), how to limit concurrency.
BTW: Please note that you're ignoring the returned error from cmd.Run().

🧩 Modules
Golang Concurrency Manager
📃 Template
package main
import (
"fmt"
"github.com/zenthangplus/goccm"
"math/rand"
"runtime"
)
func main() {
semaphore := goccm.New(runtime.NumCPU())
for {
semaphore.Wait()
go func() {
fmt.Println(rand.Int())
semaphore.Done()
}()
}
semaphore.WaitAllDone()
}
🎰 Optimal routine quantity
If the operation is CPU bounded: runtime.NumCPU()
Otherwise test with: time go run *.go
🔨 Configure
export GOPATH="$(pwd)/gopath"
go mod init *.go
go mod tidy
🧹 CleanUp
find "${GOPATH}" -exec chmod +w {} \;
rm --recursive --force "${GOPATH}"

try this:
https://github.com/korovkin/limiter
limiter := NewConcurrencyLimiter(10)
limiter.Execute(func() {
zenity(...)
})
limiter.Wait()

You could use Worker Pool pattern described here in this post.
This is how an implementation would look like ...
package main
import (
"os/exec"
"strconv"
)
func main() {
NumEl := 8
pool := 4
intChan := make(chan int)
for i:=0; i<pool; i++ {
go callProg(intChan) // <--- launch the worker routines
}
for i:=0;i<NumEl;i++{
intChan <- i // <--- push data which will be received by workers
}
close(intChan) // <--- will safely close the channel & terminate worker routines
}
func callProg(intChan chan int) {
for i := range intChan{
cmd := exec.Command("zenity", "--info", "--text='Hello from iteration n." + strconv.Itoa(i) + "'")
cmd.Run()
}
}

Related

How to "keep main thread running" even though routine has "runtime error" occurred in Golang?

I am new to Goland and I did Java in the past. I've written a Golang function to calculate the integer number part of the result. What I am thinking is using a timer to do the calculation and generate the random number. But one problem I met is if the routine has some error, the main thread will stop. Is there any way to keep the main thread running? Even though there are errors in routine?
Below is the code for test:
func main() {
ticker := time.NewTicker(1*1000 * time.Millisecond)
for _ = range ticker.C {
rand.Seed(time.Now().Unix())
divisor := rand.Intn(20)
go calculate(divisor)
}
}
func calculate(divisor int){
result:= 100/divisor
fmt.Print("1/"+strconv.Itoa(divisor)+"=")
fmt.Println(result)
}
As the error handling for Golang really confused me, as what I am thinking is the error occurs in the "thread", the main function is just responsible to create the thread and assign task, it should never mind whether there are exceptions occurs in the "threads" and main should always keep going. If I do this in Java, I could use try catch to surround with
try{
result = 1/divisor;
}
catch(Exception e){
e.printTrace();
}
even every time I give divisor a 0 value in a separate thread, the main progress will not exit, but for Golang, I think
go calculate(divisor)
is opening a new "thread" and run calculate
inside the "thread", but why the main progress will quit.
Is there any possible method to prevent the main progress to quit?
Thanks.
use the defer/ recover feature
package main
import (
"fmt"
"time"
"math/rand"
)
func main() {
ticker := time.NewTicker(1*1000 * time.Millisecond)
for _ = range ticker.C {
rand.Seed(time.Now().Unix())
divisor := rand.Intn(20)
go calculate(divisor)
}
fmt.Println("that's all")
}
func calculate(divisor int){
defer func() {
if r := recover(); r != nil {
fmt.Println("Recovered in f", r)
}
}()
result:= 100/divisor
fmt.Printf("1/%d=", divisor)
fmt.Println(result)
}

Golang scheduler mystery: Linux vs Mac OS X

I've run into some mysterious behavior with the Go scheduler, and I'm very curious about what's going on. The gist is that runtime.Gosched() doesn't work as expected in Linux unless it is preceded by a log.Printf() call, but it works as expected in both cases on OS X. Here's a minimal setup that reproduces the behavior:
The main goroutine sleeps for 1000 periods of 1ms, and after each sleep pushes a dummy message onto another goroutine via a channel. The second goroutine listens for new messages, and every time it gets one it does 10ms of work. So without any runtime.Gosched() calls, the program will take 10 seconds to run.
When I add periodic runtime.Gosched() calls in the second goroutine, as expected the program runtime shrinks down to 1 second on my Mac. However, when I try running the same program on Ubuntu, it still takes 10 seconds. I made sure to set runtime.GOMAXPROCS(1) in both cases.
Here's where it gets really strange: if I just add a logging statement before the runtime.Gosched() calls, then suddenly the program runs in the expected 1 second on Ubuntu as well.
package main
import (
"time"
"log"
"runtime"
)
func doWork(c chan int) {
for {
<-c
// This outer loop will take ~10ms.
for j := 0; j < 100 ; j++ {
// The following block of CPU work takes ~100 microseconds
for i := 0; i < 300000; i++ {
_ = i * 17
}
// Somehow this print statement saves the day in Ubuntu
log.Printf("donkey")
runtime.Gosched()
}
}
}
func main() {
runtime.GOMAXPROCS(1)
c := make(chan int, 1000)
go doWork(c)
start := time.Now().UnixNano()
for i := 0; i < 1000; i++ {
time.Sleep(1 * time.Millisecond)
// Queue up 10ms of work in the other goroutine, which will backlog
// this goroutine without runtime.Gosched() calls.
c <- 0
}
// Whole program should take about 1 second to run if the Gosched() calls
// work, otherwise 10 seconds.
log.Printf("Finished in %f seconds.", float64(time.Now().UnixNano() - start) / 1e9)
}
Additional details: I'm running go1.10 darwin/amd64, and compiling the linux binary with
env GOOS=linux GOARCH=amd64 go build ...
I've tried a few simple variants:
Just making a log.Printf() call, without the Gosched()
Making two calls to Gosched()
Keeping the Gosched() call but replacing the log.Printf() call to a dummy function call
All of these are ~10x slower than calling log.Printf() and then Gosched().
Any insights would be appreciated! This example is of course very artificial, but the issue came up while writing a websocket broadcast server which led to significantly degraded performance.
EDIT: I got rid of the extraneous bits in my example to make things more transparent. I've discovered that without the print statement, the runtime.Gosched() calls are still getting run, it's just that they seem to be delayed by a fixed 5ms, leading to a total runtime of almost exactly 5seconds in the example below, when the program should finish instantaneously (and does on my Mac, or on Ubuntu with the print statement).
package main
import (
"log"
"runtime"
"time"
)
func doWork() {
for {
// This print call makes the code run 20x faster
log.Printf("donkey")
// Without this line, the program never terminates (as expected). With this line
// and the print call above it, the program takes <300ms as expected, dominated by
// the sleep calls in the main goroutine. But without the print statement, it
// takes almost exactly 5 seconds.
runtime.Gosched()
}
}
func main() {
runtime.GOMAXPROCS(1)
go doWork()
start := time.Now().UnixNano()
for i := 0; i < 1000; i++ {
time.Sleep(10 * time.Microsecond)
runtime.Gosched()
}
log.Printf("Finished in %f seconds.", float64(time.Now().UnixNano() - start) / 1e9)
}
When I add periodic runtime.Gosched() calls in the second goroutine,
as expected the program runtime shrinks down to 1 second on my Mac.
However, when I try running the same program on Ubuntu, it still takes
10 seconds.
On Ubuntu, I'm unable to reproduce your issue, one second, not ten seconds,
Output:
$ uname -srvm
Linux 4.13.0-37-generic #42-Ubuntu SMP Wed Mar 7 14:13:23 UTC 2018 x86_64
$ go version
go version devel +f1deee0e8c Mon Apr 2 20:18:14 2018 +0000 linux/amd64
$ go build rampatowl.go && time ./rampatowl
2018/04/02 16:52:04 Finished in 1.122870 seconds.
real 0m1.128s
user 0m1.116s
sys 0m0.012s
$
rampatowl.go:
package main
import (
"log"
"runtime"
"time"
)
func doWork(c chan int) {
for {
<-c
// This outer loop will take ~10ms.
for j := 0; j < 100; j++ {
// The following block of CPU work takes ~100 microseconds
for i := 0; i < 300000; i++ {
_ = i * 17
}
// Somehow this print statement saves the day in Ubuntu
//log.Printf("donkey")
runtime.Gosched()
}
}
}
func main() {
runtime.GOMAXPROCS(1)
c := make(chan int, 1000)
go doWork(c)
start := time.Now().UnixNano()
for i := 0; i < 1000; i++ {
time.Sleep(1 * time.Millisecond)
// Queue up 10ms of work in the other goroutine, which will backlog
// this goroutine without runtime.Gosched() calls.
c <- 0
}
// Whole program should take about 1 second to run if the Gosched() calls
// work, otherwise 10 seconds.
log.Printf("Finished in %f seconds.", float64(time.Now().UnixNano()-start)/1e9)
}

How does Golang share variables between goroutines? [duplicate]

This question already has answers here:
Why does Go handle closures differently in goroutines?
(2 answers)
Closed 10 months ago.
I'm learning Go and trying to understand its concurrency features.
I have the following program.
package main
import (
"fmt"
"sync"
)
func main() {
var wg sync.WaitGroup
for i := 0; i < 5; i++ {
wg.Add(1)
x := i
go func() {
defer wg.Done()
fmt.Println(x)
}()
}
wg.Wait()
fmt.Println("Done")
}
When executed I got:
4
0
1
3
2
It's just what I want. However, if I make slight modification to it:
package main
import (
"fmt"
"sync"
)
func main() {
var wg sync.WaitGroup
for i := 0; i < 5; i++ {
wg.Add(1)
go func() {
defer wg.Done()
fmt.Println(i)
}()
}
wg.Wait()
fmt.Println("Done")
}
What I got will be:
5
5
5
5
5
I don't quite understand the difference. Can anyone help to explain what happened here and how Go runtime execute this code?
You have new variable on each run of x := i,
This code shows difference well, by printing the address of x inside goroutine:
The Go Playground:
package main
import (
"fmt"
"sync"
)
func main() {
var wg sync.WaitGroup
for i := 0; i < 5; i++ {
wg.Add(1)
x := i
go func() {
defer wg.Done()
fmt.Println(&x)
}()
}
wg.Wait()
fmt.Println("Done")
}
output:
0xc0420301e0
0xc042030200
0xc0420301e8
0xc0420301f0
0xc0420301f8
Done
And build your second example with go build -race and run it:
You will see: WARNING: DATA RACE
And this will be fine The Go Playground:
//go build -race
package main
import (
"fmt"
"sync"
)
func main() {
var wg sync.WaitGroup
for i := 0; i < 5; i++ {
wg.Add(1)
go func(i int) {
defer wg.Done()
fmt.Println(i)
}(i)
}
wg.Wait()
fmt.Println("Done")
}
output:
0
4
1
2
3
Done
The general rule is, don't share data between goroutines. In the first example, you essentially give each goroutine their own copy of x, and they print it out in whatever order they get to the print statement. In the second example, they all reference the same loop variable, and it is incremented to 5 by the time any of them print it. I don't believe the output there is guaranteed, it just happens that the loop creating goroutines finished faster than the goroutines themselves got to the printing part.
It's a bit hard to explain in plain english, but I'll try my best.
You see, every time you spawn a new goroutine, there is an initialization time, no matter how minuscule it might be, it's always there. So, in your second case, the entire loop has finished incrementing the variable 5 times before any of the goroutines even started. And when the goroutines finish initialization, all they see is the final variable value which is 5.
In your first case though, the x variable keeps a copy of the i variable so that when the goroutines start, x get's passed to them. Remember, it is i that is being incremented here, not x. x is fixed. So, when the goroutines start, they get a fixed value.

Can Go spawn and communicate with external processes without starting one OS-thread per external process?

Short version:
Is it possible in Golang to spawn a number of external processes (shell commands) in parallel, such that it does not start one operating system thread per external process ... and still be able to receive its output when it is finished?
Longer version:
In Elixir, if you use ports, you can spawn thousands of external processes without really increasing the number of threads in the Erlang virtual machine.
E.g. the following code snippet, which starts 2500 external sleep processes, is managed by only 20 operating system threads under the Erlang VM:
defmodule Exmultiproc do
for _ <- 1..2500 do
cmd = "sleep 3600"
IO.puts "Starting another process ..."
Port.open({:spawn, cmd}, [:exit_status, :stderr_to_stdout])
end
System.cmd("sleep", ["3600"])
end
(Provided you set ulimit -n to a high number, such as 10000)
On the other hand, the following code in Go, which is supposed to do the same thing - starting 2500 external sleep processes - does also start 2500 operating system threads. So it obviously starts one operating system thread per (blocking?) system call (so as not to block the whole CPU, or similar, if I understand correctly):
package main
import (
"fmt"
"os/exec"
"sync"
)
func main() {
wg := new(sync.WaitGroup)
for i := 0; i < 2500; i++ {
wg.Add(1)
go func(i int) {
fmt.Println("Starting sleep ", i, "...")
cmd := exec.Command("sleep", "3600")
_, err := cmd.Output()
if err != nil {
panic(err)
}
fmt.Println("Finishing sleep ", i, "...")
wg.Done()
}(i)
}
fmt.Println("Waiting for WaitGroup ...")
wg.Wait()
fmt.Println("WaitGroup finished!")
}
Thus, I was wondering if there is a way to write the Go code so that it does the similar thing as the Elixir code, not opening one operating system thread per external process?
I'm basically looking for a way to manage at least a few thousand external long-running (up to 10 days) processes, in a way that causes as little problems as possible with any virtual or physical limits in the operating system.
(Sorry for any mistakes in the codes, as I'm new to Elixir and, and quite new to Go. I'm eager to get to know any mistakes I'm doing.)
EDIT: Clarified about the requirement to run the long-running processes in parallel.
I find that if we not wait processes, the Go runtime will not start 2500 operating system threads. so please use cmd.Start() other than cmd.Output().
But seems it is impossible to read the process's stdout without consuming a OS thread by golang os package. I think it is because os package not use non-block io to read the pipe.
The bottom, following program runs well on my Linux, although it block the process's stdout as #JimB said in comment, maybe it is because we have small output and it fit the system buffers.
func main() {
concurrentProcessCount := 50
wtChan := make(chan *result, concurrentProcessCount)
for i := 0; i < concurrentProcessCount; i++ {
go func(i int) {
fmt.Println("Starting process ", i, "...")
cmd := exec.Command("bash", "-c", "for i in 1 2 3 4 5; do echo to sleep $i seconds;sleep $i;echo done;done;")
outPipe,_ := cmd.StdoutPipe()
err := cmd.Start()
if err != nil {
panic(err)
}
<-time.Tick(time.Second)
fmt.Println("Finishing process ", i, "...")
wtChan <- &result{cmd.Process, outPipe}
}(i)
}
fmt.Println("root:",os.Getpid());
waitDone := 0
forLoop:
for{
select{
case r:=<-wtChan:
r.p.Wait()
waitDone++
output := &bytes.Buffer{}
io.Copy(output, r.b)
fmt.Println(waitDone, output.String())
if waitDone == concurrentProcessCount{
break forLoop
}
}
}
}

Why does this program run faster when it's allocated fewer threads?

I have a fairly simple Go program designed to compute random Fibonacci numbers to test some strange behavior I observed in a worker pool I wrote. When I allocate one thread, the program finishes in 1.78s. When I allocate 4, it finishes in 9.88s.
The code is as follows:
var workerWG sync.WaitGroup
func worker(fibNum chan int) {
for {
var tgt = <-fibNum
workerWG.Add(1)
var a, b float64 = 0, 1
for i := 0; i < tgt; i++ {
a, b = a+b, a
}
workerWG.Done()
}
}
func main() {
rand.Seed(time.Now().UnixNano())
runtime.GOMAXPROCS(1) // LINE IN QUESTION
var fibNum = make(chan int)
for i := 0; i < 4; i++ {
go worker(fibNum)
}
for i := 0; i < 500000; i++ {
fibNum <- rand.Intn(1000)
}
workerWG.Wait()
}
If I replace runtime.GOMAXPROCS(1) with 4, the program takes four times as long to run.
What's going on here? Why does adding more available threads to a worker pool slow the entire pool down?
My personal theory is that it has to do with the processing time of the worker being less than the overhead of thread management, but I'm not sure. My reservation is caused by the following test:
When I replace the worker function with the following code:
for {
<-fibNum
time.Sleep(500 * time.Millisecond)
}
both one available thread and four available threads take the same amount of time.
I revised your program to look like the following:
package main
import (
"math/rand"
"runtime"
"sync"
"time"
)
var workerWG sync.WaitGroup
func worker(fibNum chan int) {
for tgt := range fibNum {
var a, b float64 = 0, 1
for i := 0; i < tgt; i++ {
a, b = a+b, a
}
}
workerWG.Done()
}
func main() {
rand.Seed(time.Now().UnixNano())
runtime.GOMAXPROCS(1) // LINE IN QUESTION
var fibNum = make(chan int)
for i := 0; i < 4; i++ {
go worker(fibNum)
workerWG.Add(1)
}
for i := 0; i < 500000; i++ {
fibNum <- rand.Intn(100000)
}
close(fibNum)
workerWG.Wait()
}
I cleaned up the wait group usage.
I changed rand.Intn(1000) to rand.Intn(100000)
On my machine that produces:
$ time go run threading.go (GOMAXPROCS=1)
real 0m20.934s
user 0m20.932s
sys 0m0.012s
$ time go run threading.go (GOMAXPROCS=8)
real 0m10.634s
user 0m44.184s
sys 0m1.928s
This means that in your original code, the work performed vs synchronization (channel read/write) was negligible. The slowdown came from having to synchronize across threads instead of one and only perform a very small amount of work inbetween.
In essence, synchronization is expensive compared to calculating fibonacci numbers up to 1000. This is why people tend to discourage micro-benchmarks. Upping that number gives a better perspective. But an even better idea is to benchmark actual work being done i.e. including IO, syscalls, processing, crunching, writing output, formatting, etc.
Edit: As an experiment, I upped the number of workers to 8 with GOMAXPROCS set to 8 and the result was:
$ time go run threading.go
real 0m4.971s
user 0m35.692s
sys 0m0.044s
The code written by #thwd is correct and idiomatic Go.
Your code was being serialized due to the atomic nature of sync.WaitGroup. Both workerWG.Add(1) and workerWG.Done() will block until they're able to atomically update the internal counter.
Since the workload is between 0 and 1000 recursive calls, the bottleneck of a single core was enough to keep data races on the waitgroup counter to a minimum.
On multiple cores, the processor spends a lot of time spinning to fix the collisions of waitgroup calls. Add that to the fact that the waitgroup counter is kept on one core and you now have added communication between cores (taking up even more cycles).
A couple hints for simplifying code:
For a small, set number of goroutines, a complete channel (chan struct{} to avoid allocations) is cheaper to use.
Use the send channel close as a kill signal for goroutines and have them signal that they've exited (waitgroup or channel). Then, close to complete channel to free them up for the GC.
If you need a waitgroup, aggressively minimize the number of calls to it. Those calls must be internally serialized, so extra calls forces added synchronization.
Your main computation routine in worker does not allow the scheduler to run.
Calling the scheduler manually like
for i := 0; i < tgt; i++ {
a, b = a+b, a
if i%300 == 0 {
runtime.Gosched()
}
}
Reduces wall clock by 30% when switching from one to two threads.
Such artificial microbenchmarks are really hard to get right.

Resources