Combining multiple maps that are stored on channel (Same key's values get summed.) in Go - string

My objective is to create a program that counts every unique word's occurrence in a text file in a parallellised fashion, all occurrences have to be presented in a single map.
What I do here is dividing the textfile into string and then to an array. That array is then divided into two slices of equal length and fed concurrently to the mapper function.
func WordCount(text string) (map[string]int) {
wg := new(sync.WaitGroup)
s := strings.Fields(newText)
freq := make(map[string]int,len(s))
channel := make(chan map[string]int,2)
wg.Add(1)
go mappers(s[0:(len(s)/2)], freq, channel,wg)
wg.Add(1)
go mappers(s[(len(s)/2):], freq, channel,wg)
wg.Wait()
actualMap := <-channel
return actualMap
func mappers(slice []string, occurrences map[string]int, ch chan map[string]int, wg *sync.WaitGroup) {
var l = sync.Mutex{}
for _, word := range slice {
l.Lock()
occurrences[word]++
l.Unlock()
}
ch <- occurrences
wg.Done()
}
The bottom line is, is that I get a huge multiline error that starts with
fatal error: concurrent map writes
When I run the code. Which I thought I guarded for through mutual exclusion
l.Lock()
occurrences[word]++
l.Unlock()
What am I doing wrong here? And furthermore. How can I combine all the maps in a channel? And with combine I mean same key's values get summed in the new map.

The main problem is that you use a separate lock in each goroutine. That doesn't do any help to serialize access to the map. The same lock has to be used in each goroutine.
And since you use the same map in each goroutine, you don't have to merge them, and you don't need a channel to deliver the result.
Even if you use the same mutex in each goroutine, since you use a single map, this probably won't help in performance, the goroutines will have to compete with each other for the map's lock.
You should create a separate map in each goroutine, use that to count locally, and then deliver the result map on the channel. This might give you a performance boost.
But then you don't need a lock, since each goroutine will have its own map which it can read/write without a mutex.
But then you'll do have to deliver the result on the channel, and then merge it.
And since goroutines deliver results on the channel, the waitgroup becomes unnecessary.
func WordCount(text string) map[string]int {
s := strings.Fields(text)
channel := make(chan map[string]int, 2)
go mappers(s[0:(len(s)/2)], channel)
go mappers(s[(len(s)/2):], channel)
total := map[string]int{}
for i := 0; i < 2; i++ {
m := <-channel
for k, v := range m {
total[k] += v
}
}
return total
}
func mappers(slice []string, ch chan map[string]int) {
occurrences := map[string]int{}
for _, word := range slice {
occurrences[word]++
}
ch <- occurrences
}
Example testing it:
fmt.Println(WordCount("aa ab cd cd de ef a x cd aa"))
Output (try it on the Go Playground):
map[a:1 aa:2 ab:1 cd:3 de:1 ef:1 x:1]
Also note that in theory this looks "good", but in practice you may still not achieve any performance boost, as the goroutines do too "little" work, and launching them and merging the results requires effort which may outweight the benefits.

Related

golang multithreaded web crawler runs into deadlock

I just started to learn multithreaded programming using golang, and I'm trying to write a multithreaded web crawler using BFS traversal, however I cannot get the code working. The error I get is fatal error: all goroutines are asleep - deadlock!
I will paste the code below, but let me explain conceptually how it works:
I have one master thread (the main function itself) and N worker threads. I intentionally chose to use BFS approach with a fixed amount of worker threads, because it seems using a DFS approach I will have to spawn a new thread for each single new URL to crawl, which might become a huge burden for context switch.
I am using two channels:
urlsToCrawl: master thread sends URLs to crawl to worker threads.
urlsDiscovered: worker threads send discovered URLs back to master.
Here is the code implementation, I removed some non relevant details (e.g. how to parse html page etc..)
The trick I'm trying to do here is: I am using the channel as a queue to do BFS, and when the queue's size is 0, it is impossible to know whether it is because "A. there is really no more URLs to crawl" OR because "B. some worker thread(s) are still working so there might be more URLs to crawl soon". Therefore I introduced this count variable, basically whenever a new url is sent to workers to be crawled, count is incremented, therefore when count == 0 and channel is empty, it would mean "A. there is really no more URLs to crawl"; otherwise when count > 0 and channel is empty, it would mean "B. some worker thread(s) are still working so there might be more URLs to crawl soon".
However as I mentioned, this doesn't seem to work and I run into deadlock. Would anyone please shed some light? Thanks!
package main
import (
"fmt"
)
var (
count = 0 // This tracks how many worker threads are actively working right now
)
func crawlUrl(urlsToCrawl chan string, urlsDiscovered chan Pair) {
for url := range urlsToCrawl {
urls := getUrls(url) // This returns an array of string, if no URL found, it returns an empty array
urlsDiscovered <- urls
}
}
func main() {
urlsToCrawl := make(chan string)
urlsDiscovered := make(chan string[])
i := 0
for i < 8 {
go crawlUrl(urlsToCrawl, urlsDiscovered)
i++
}
visited := map[string]bool{"some_seed_url": true}
count++
urlsToCrawl <- "some_seed_url"
for urls := range urlsDiscovered {
count-- // One message is received by master, meaning one worker thread has finished an job item, therefore decrementing count
for _, url := range urls {
_, ok := visited[url]
if ok {
continue // This URL has been crawled before
}
visited[url] = true
count ++ // One more work item will be sent to worker, therefore first increment count
urlsToCrawl <- url
}
if count == 0 {
close(urlsDiscovered)
close(urlsToCrawl)
break
} // else some worker must be working so let's wait to see if there is new msg coming through the channel
}
}
Your channels are unbuffered
urlsToCrawl := make(chan string)
urlsDiscovered := make(chan string[])
So a goroutine which reads from or writes to a channel will block until a goroutine on the other side is doing the opposite.
So you start 8 crawlUrl goroutines which all block while reading from urlsToCrawl, meaning that main can send 8 urls before blocking. The crawlUrl goroutines are blocked until main reads from urlsDiscovered. So if you have more than 8 URL's going around, all goroutines are waiting on each other(deadlock).
The solution to this is to use buffered channels with a capacity you are very unlikely to exceed:
urlsToCrawl := make(chan string, 1000)
urlsDiscovered := make(chan string[], 100)
If you expect you might still exceed the capacity of the channel in extreme cases, you can perform non-blocking operations which allow you to for example discard discovered URL's if the channel is full instead of blocking.
select {
case: urlsDiscovered <- urls:
// on success (url written)
default:
// channel is full, can't write without blocking
}
Try to integrate the WaitGroup package.

Synchronize write to file from heavy operations in different threads

I need to elaborate a file (potentially a big file) one block at a time and write the result to a new file.
To put it simply, I have the basic function to elaborate a block:
func elaborateBlock(block []byte) []byte { ... }
Every block needs to be elaborated and then written to the output file sequentially (preserving original order).
The one-thread implementation is trivial:
for {
buffer := make([]byte, BlockSize)
_, err := inputFile.Read(buffer)
if err == io.EOF {
break
}
processedData := elaborateBlock(buffer)
outputFile.Write(processedData)
}
But the elaboration can be heavy and every block can be processed separately, so a multi-threaded implementation is the natural evolution.
The solution I came up with is to create an array of channels, compute every block in a different thread and sync the final write by looping the channel array:
Utility function:
func blockThread(channel chan []byte, block []byte) {
channel <- elaborateBlock(block)
}
In the main program:
chans = []chan []byte {}
for {
buffer := make([]byte, BlockSize)
_, err := inputFile.Read(buffer)
if err == io.EOF {
break
}
channel := make(chan []byte)
chans = append(chans, channel)
go blockThread(channel, buffer)
}
for i := range chans {
data := <- chans[i]
outputFile.Write(data)
}
This approach works but can be problematic with large files because it requires to load the whole file in memory before starting writing the output.
Do you think there can be a better solution, with also better performance overall?
If blocks do need to be written out in order
If you want to work on multiple blocks concurrently, obviously you need to hold multiple blocks in memory at the same time.
You may decide how many blocks you want to process concurrently, and it's enough to read as many into memory at the same time. E.g. you may say you want to process 5 blocks concurrently. This will limit memory usage, and still utilize your CPU resources potentially to the max. Recommended to pick a number based on your available CPU cores (if processing a block does not already use multi cores). This can be queried using runtime.GOMAXPROCS(0).
You should have a single goroutine that reads the input file sequentially, and prodocue the blocks wrapped in Jobs (which also contain the block index).
You should have multiple worker goroutines, preferable as many as cores you have (but experiment with smaller and higher values too). Each worker goroutine just receives jobs, and calls elaborateBlock() on the data, and delivers it on the results channel.
There should be a single, designated consumer which receives completed jobs, and writes them in order to the output file. Since goroutines run concurrently and we have no control in which order the blocks are completed, the consumer should keep track of the index of the next block to be written to the output. Blocks arriving out of order should only be stored, and only proceed with writing if the subsequent block arrives.
This is an (incomplete) example how to do all these:
const BlockSize = 1 << 20 // 1 MB
func elaborateBlock(in []byte) []byte { return in }
type Job struct {
Index int
Block []byte
}
func producer(jobsCh chan<- *Job) {
// Init input file:
var inputFile *os.File
for index := 0; ; index++ {
job := &Job{
Index: index,
Block: make([]byte, BlockSize),
}
_, err := inputFile.Read(job.Block)
if err != nil {
break
}
jobsCh <- job
}
}
func worker(jobsCh <-chan *Job, resultCh chan<- *Job) {
for job := range jobsCh {
job.Block = elaborateBlock(job.Block)
resultCh <- job
}
}
func consumer(resultCh <-chan *Job) {
// Init output file:
var outputFile *os.File
nextIdx := 0
jobMap := map[int]*Job{}
for job := range resultCh {
jobMap[job.Index] = job
// Write out all blocks we have in contiguous index range:
for {
j := jobMap[nextIdx]
if j == nil {
break
}
if _, err := outputFile.Write(j.Block); err != nil {
// handle error, maybe terminate?
}
delete(nextIdx) // This job is written out
nextIdx++
}
}
}
func main() {
jobsCh := make(chan *Job)
resultCh := make(chan *Job)
for i := 0; i < 5; i++ {
go worker(jobsCh, resultCh)
}
wg := sync.WaitGroup{}
wg.Add(1)
go func() {
defer wg.Done()
consumer(resultCh)
}()
// Start producing jobs:
producer(jobsCh)
// No more jobs:
close(jobsCh)
// Wait for consumer to complete:
wg.Wait()
}
One thing to note here: this alone won't guarantee limiting the used memory. Imagine a case where the first block would require an enormous time to calculate, while subsequent blocks do not. What would happen? The first block would occupy a worker, and the other workers would "quickly" complete the subsequent blocks. The consumer would store all in memory, waiting for the first block to complete (as that has to be written out first). This could increase memory usage.
How could we avoid this?
By introducing a job pool. New jobs could not be created arbitrarily, but taken from a pool. If the pool is empty, the producer has to wait. So when the producer needs a new Job, takes one from a pool. When the consumer has written out a Job, puts it back into the pool. Simple as that. This would also reduce pressure on the garbage collector, as jobs (and large []byte buffers) are not created and thrown away, they could be re-used.
For a simple Job pool implementation you could use a buffered channel. For details, see How to implement Memory Pooling in Golang.
If blocks can be written in any order
Another option could be to allocate the output file in advance. If the size of the output blocks are also deterministic, you can do so (e.g. outsize := (insize / blocksize) * outblockSize).
To what end?
If you have the output file pre-allocated, the consumer does not need to wait input blocks in order. Once an input block is calculated, you can calculate the position where it will go in the output, seek to that position and just write it. For this you may use File.Seek().
This solution still requires to send the block index from the producer to the consumer, but the consumer won't need to store blocks arriving out-of-order, so the consumer can be simpler, and does not need to store completed blocks until the subsequent one arrives in order to proceed with writing the output file.
Note that this solution naturally does not pose a memory threat, as completed jobs are never accumulated / cached, they are written out in the order of completion.
See related questions for more details and techniques:
Is this an idiomatic worker thread pool in Go?
How to collect values from N goroutines executed in a specific order?
here is a working example that should work and is as close as possible to your original code.
the idea is to turn your array into a channel of channels of bytes. then
first fire up a consumer that will read on this channel of channels , get the channel of bytes, read from it and write the result.
Back on the main thread you create a channel of bytes, write it to the channel of channels (now the consumer reading sequentially from them will read the results in order) and then fire up the process that will do the work and write on the allocated channel (producers).
what will happen now is that the there will be a "race" between the procuders and the consumer, as soon as a produced block is read from the consumer and written the resources associated with it will be deallocated. this could be an improvement to your original design.
here is the code and the playground link:
package main
import (
"bytes"
"fmt"
"io"
"sync"
)
func elaborateBlock(b []byte) []byte {
return []byte("werkwerkwerk")
}
func blockThread(channel chan []byte, block []byte, wg *sync.WaitGroup) {
channel <- elaborateBlock(block)
wg.Done()
}
func main() {
chans := make(chan chan []byte)
BlockSize := 3
inputBytes := bytes.NewBuffer([]byte("transmutemetowerkwerkwerk"))
producewg := sync.WaitGroup{}
consumewg := sync.WaitGroup{}
consumewg.Add(1)
go func() {
chancount := 0
for ch := range chans {
data := <-ch
fmt.Printf("got %d block, result:%s\n", chancount, data)
chancount++
}
fmt.Printf("done receiving\n")
consumewg.Done()
}()
for {
buffer := make([]byte, BlockSize)
_, err := inputBytes.Read(buffer)
if err == io.EOF {
go func() {
//wait for all the procuders to finish
producewg.Wait()
//then close the main channel to notify the consumer
close(chans)
}()
break
}
channel := make(chan []byte)
chans <- channel //give the channel that we return the result to the receiver
producewg.Add(1)
go blockThread(channel, buffer, &producewg)
}
consumewg.Wait()
fmt.Printf("main exiting")
}
playground link
as a minor point i don't feel right about the "read the whole file into memory" statement cause you are just reading a block every time from the Reader, maybe "holding the result of the whole computation in memory" is more appropriate?

Can go channel keep a value for multiple reads [duplicate]

This question already has answers here:
Multiple goroutines listening on one channel
(7 answers)
Closed 5 years ago.
I understand the regular behavior of a channel is that it empties after a read. Is there a way to keep an unbuffered channel value for multiple reads without the value been removed from the channel?
For example, I have a goroutine that generates a single data for multiple down stream go routines to use. I don't want to have to create multiple channels or use a buffered channel which would require me to duplicate the source data (I don't even know how many copies I will need). Effectively, I want to be able to do something like the following:
main{
ch := make(ch chan dType)
ch <- sourceDataGenerator()
for _,_ := range DynamicRange{
go TargetGoRoutine(ch)
}
close(ch) // would want this to remove the value and the channel
}
func(ch chan dType) TargetGoRoutine{
targetCollection <- ch // want to keep the channel value after read
}
EDIT
Some feel this is a duplicate question. Perhaps, but not sure. The solution here seems simple in the end as n-canter pointed out. All it needs is for every go routine to "recycle" the data by putting it back to the channel after use. None of the supposedly "duplicates" provided this solution. Here is a sample:
package main
import (
"fmt"
"sync"
)
func main() {
c := make(chan string)
var wg sync.WaitGroup
wg.Add(5)
for i := 0; i < 5; i++ {
go func(i int) {
wg.Done()
msg := <-c
fmt.Printf("Data:%s, From go:%d\n", msg, i)
c <-msg
}(i)
}
c <- "Original"
wg.Wait()
fmt.Println(<-c)
}
https://play.golang.org/p/EXBbf1_icG
You may readd value back to the channel after reading, but then all your gouroutines will read shared value sequentially and also you'll need some synchronization primitives for last goroutine not to block.
As far as I know the only case when you can use the single channel for broadcasting is closing it. In this case all readers will be notified.
If you don't want to duplicate large data, maybe you'd better use some global variable. But use it carefully, because it violates golang rule: "Don't communicate by sharing memory; share memory by communicating."
Also look at this question How to broadcast message using channel

Can a zero-length and zero-cap slice still point to an underlying array and prevent garbage collection?

Let's take the following scenario:
a := make([]int, 10000)
a = a[len(a):]
As we know from "Go Slices: Usage and Internals" there's a "possible gotcha" in downslicing. For any slice a if you do a[start:end] it still points to the original memory, so if you don't copy, a small downslice could potentially keep a very large array in memory for a long time.
However, this case is chosen to result in a slice that should not only have zero length, but zero capacity. A similar question could be asked for the construct a = a[0:0:0].
Does the current implementation still maintain a pointer to the underlying memory, preventing it from being garbage collected, or does it recognize that a slice with no len or cap could not possibly reference anything, and thus garbage collect the original backing array during the next GC pause (assuming no other references exist)?
Edit: Playing with reflect and unsafe on the Playground reveals that the pointer is non-zero:
func main() {
a := make([]int, 10000)
a = a[len(a):]
aHeader := *(*reflect.SliceHeader)((unsafe.Pointer(&a)))
fmt.Println(aHeader.Data)
a = make([]int, 0, 0)
aHeader = *(*reflect.SliceHeader)((unsafe.Pointer(&a)))
fmt.Println(aHeader.Data)
}
http://play.golang.org/p/L0tuzN4ULn
However, this doesn't necessarily answer the question because the second slice that NEVER had anything in it also has a non-zero pointer as the data field. Even so, the pointer could simply be uintptr(&a[len(a)-1]) + sizeof(int) which would be outside the block of backing memory and thus not trigger actual garbage collection, though this seems unlikely since that would prevent garbage collection of other things. The non-zero value could also conceivably just be Playground weirdness.
As seen in your example, re-slicing copies the slice header, including the data pointer to the new slice, so I put together a small test to try and force the runtime to reuse the memory if possible.
I'd like this to be more deterministic, but at least with go1.3 on x86_64, it shows that the memory used by the original array is eventually reused (it does not work in the playground in this form).
package main
import (
"fmt"
"unsafe"
)
func check(i uintptr) {
fmt.Printf("Value at %d: %d\n", i, *(*int64)(unsafe.Pointer(i)))
}
func garbage() string {
s := ""
for i := 0; i < 100000; i++ {
s += "x"
}
return s
}
func main() {
s := make([]int64, 100000)
s[0] = 42
p := uintptr(unsafe.Pointer(&s[0]))
check(p)
z := s[0:0:0]
s = nil
fmt.Println(z)
garbage()
check(p)
}

Why do my goroutines wait for each other instead of finishing when done?

I'm pretty new to Go and there is one thing in my code which I don't understand.
I wrote a simple bubblesort algorithm (I know it's not really efficient ;)).
Now I want to start 3 GoRoutines. Each thread should sort his array independent from the other ones. When finished, the func. should print a "done"-Message.
Here is my Code:
package main
import (
"fmt"
"time" //for time functions e.g. Now()
"math/rand" //for pseudo random numbers
)
/* Simple bubblesort algorithm*/
func bubblesort(str string, a []int) []int {
for n:=len(a); n>1; n-- {
for i:=0; i<n-1; i++ {
if a[i] > a[i+1] {
a[i], a[i+1] = a[i+1], a[i] //swap
}
}
}
fmt.Println(str+" done") //done message
return a
}
/*fill slice with pseudo numbers*/
func random_fill(a []int) []int {
for i:=0; i<len(a); i++ {
a[i] = rand.Int()
}
return a
}
func main() {
rand.Seed( time.Now().UTC().UnixNano()) //set seed for rand.
a1 := make([]int, 34589) //create slice
a2 := make([]int, 42) //create slice
a3 := make([]int, 9999) //create slice
a1 = random_fill(a1) //fill slice
a2 = random_fill(a2) //fill slice
a3 = random_fill(a3) //fill slice
fmt.Println("Slices filled ...")
go bubblesort("Thread 1", a1) //1. Routine Start
go bubblesort("Thread 2", a2) //2. Routine Start
go bubblesort("Thread 3", a3) //3. Routine Start
fmt.Println("Main working ...")
time.Sleep(1*60*1e9) //Wait 1 minute for the "done" messages
}
This is what I get:
Slices filled ...
Main working ...
Thread 1 done
Thread 2 done
Thread 3 done
Should'nt Thread 2 finish first, since his slice is the smallest?
It seems that all the threads are waiting for the others to finish, because the "done"-messages appear at the same time, no matter how big the slices are..
Where is my brainbug? =)
Thanks in advance.
*Edit:
When putting "time.Sleep(1)" in the for-loop in the bubblesort func. it seems to work.. but I want to clock the duration on different machines with this code (I know, i have to change the random thing), so sleep would falsify the results.
Indeed, there is no garantee regarding the order in which your goroutines will be executed.
However if you force the true parallel processing by explicitly letting 2 processor cores run :
import (
"fmt"
"time" //for time functions e.g. Now()
"math/rand" //for pseudo random numbers
"runtime"
)
...
func main() {
runtime.GOMAXPROCS(2)
rand.Seed( time.Now().UTC().UnixNano()) //set seed for rand.
...
Then you will get the expected result :
Slices filled ...
Main working ...
Thread 2 done
Thread 3 done
Thread 1 done
Best regards
The important thing is the ability to "yield" the processor to other processes, before the whole potentialy long-running workload is finished. This holds true as well in single-core context or multi-core context (because Concurrency is not the same as Parallelism).
This is exactly what the runtime.Gosched() function does :
Gosched yields the processor, allowing other goroutines to run. It
does not suspend the current goroutine, so execution resumes
automatically.
Be aware that a "context switch" is not free : it costs a little time each time.
On my machine without yielding, your program runs in 5.1s.
If you yield in the outer loop (for n:=len(a); n>1; n--), it runs in 5.2s : small overhead.
If you yield in the inner loop (for i:=0; i<n-1; i++), it runs in 61.7s : huge overhead !!
Here is the modified program correctly yielding, with the small overhead :
package main
import (
"fmt"
"math/rand"
"runtime"
"time"
)
/* Simple bubblesort algorithm*/
func bubblesort(str string, a []int, ch chan []int) {
for n := len(a); n > 1; n-- {
for i := 0; i < n-1; i++ {
if a[i] > a[i+1] {
a[i], a[i+1] = a[i+1], a[i] //swap
}
}
runtime.Gosched() // yield after part of the workload
}
fmt.Println(str + " done") //done message
ch <- a
}
/*fill slice with pseudo numbers*/
func random_fill(a []int) []int {
for i := 0; i < len(a); i++ {
a[i] = rand.Int()
}
return a
}
func main() {
rand.Seed(time.Now().UTC().UnixNano()) //set seed for rand.
a1 := make([]int, 34589) //create slice
a2 := make([]int, 42) //create slice
a3 := make([]int, 9999) //create slice
a1 = random_fill(a1) //fill slice
a2 = random_fill(a2) //fill slice
a3 = random_fill(a3) //fill slice
fmt.Println("Slices filled ...")
ch1 := make(chan []int) //create channel of result
ch2 := make(chan []int) //create channel of result
ch3 := make(chan []int) //create channel of result
go bubblesort("Thread 1", a1, ch1) //1. Routine Start
go bubblesort("Thread 2", a2, ch2) //2. Routine Start
go bubblesort("Thread 3", a3, ch3) //3. Routine Start
fmt.Println("Main working ...")
<-ch1 // Wait for result 1
<-ch2 // Wait for result 2
<-ch3 // Wait for result 3
}
Output :
Slices filled ...
Main working ...
Thread 2 done
Thread 3 done
Thread 1 done
I also used channels to implement the rendez-vous, as suggested in my previous comment.
Best regards :)
Since the release of Go 1.2, the original program now works may work fine without modification. You may try it in Playground.
This is explained in the Go 1.2 release notes :
In prior releases, a goroutine that was looping forever could starve
out other goroutines on the same thread, a serious problem when
GOMAXPROCS provided only one user thread. In Go 1.2, this is partially
addressed: The scheduler is invoked occasionally upon entry to a
function.

Resources