Can go channel keep a value for multiple reads [duplicate] - multithreading

This question already has answers here:
Multiple goroutines listening on one channel
(7 answers)
Closed 5 years ago.
I understand the regular behavior of a channel is that it empties after a read. Is there a way to keep an unbuffered channel value for multiple reads without the value been removed from the channel?
For example, I have a goroutine that generates a single data for multiple down stream go routines to use. I don't want to have to create multiple channels or use a buffered channel which would require me to duplicate the source data (I don't even know how many copies I will need). Effectively, I want to be able to do something like the following:
main{
ch := make(ch chan dType)
ch <- sourceDataGenerator()
for _,_ := range DynamicRange{
go TargetGoRoutine(ch)
}
close(ch) // would want this to remove the value and the channel
}
func(ch chan dType) TargetGoRoutine{
targetCollection <- ch // want to keep the channel value after read
}
EDIT
Some feel this is a duplicate question. Perhaps, but not sure. The solution here seems simple in the end as n-canter pointed out. All it needs is for every go routine to "recycle" the data by putting it back to the channel after use. None of the supposedly "duplicates" provided this solution. Here is a sample:
package main
import (
"fmt"
"sync"
)
func main() {
c := make(chan string)
var wg sync.WaitGroup
wg.Add(5)
for i := 0; i < 5; i++ {
go func(i int) {
wg.Done()
msg := <-c
fmt.Printf("Data:%s, From go:%d\n", msg, i)
c <-msg
}(i)
}
c <- "Original"
wg.Wait()
fmt.Println(<-c)
}
https://play.golang.org/p/EXBbf1_icG

You may readd value back to the channel after reading, but then all your gouroutines will read shared value sequentially and also you'll need some synchronization primitives for last goroutine not to block.
As far as I know the only case when you can use the single channel for broadcasting is closing it. In this case all readers will be notified.
If you don't want to duplicate large data, maybe you'd better use some global variable. But use it carefully, because it violates golang rule: "Don't communicate by sharing memory; share memory by communicating."
Also look at this question How to broadcast message using channel

Related

golang multithreaded web crawler runs into deadlock

I just started to learn multithreaded programming using golang, and I'm trying to write a multithreaded web crawler using BFS traversal, however I cannot get the code working. The error I get is fatal error: all goroutines are asleep - deadlock!
I will paste the code below, but let me explain conceptually how it works:
I have one master thread (the main function itself) and N worker threads. I intentionally chose to use BFS approach with a fixed amount of worker threads, because it seems using a DFS approach I will have to spawn a new thread for each single new URL to crawl, which might become a huge burden for context switch.
I am using two channels:
urlsToCrawl: master thread sends URLs to crawl to worker threads.
urlsDiscovered: worker threads send discovered URLs back to master.
Here is the code implementation, I removed some non relevant details (e.g. how to parse html page etc..)
The trick I'm trying to do here is: I am using the channel as a queue to do BFS, and when the queue's size is 0, it is impossible to know whether it is because "A. there is really no more URLs to crawl" OR because "B. some worker thread(s) are still working so there might be more URLs to crawl soon". Therefore I introduced this count variable, basically whenever a new url is sent to workers to be crawled, count is incremented, therefore when count == 0 and channel is empty, it would mean "A. there is really no more URLs to crawl"; otherwise when count > 0 and channel is empty, it would mean "B. some worker thread(s) are still working so there might be more URLs to crawl soon".
However as I mentioned, this doesn't seem to work and I run into deadlock. Would anyone please shed some light? Thanks!
package main
import (
"fmt"
)
var (
count = 0 // This tracks how many worker threads are actively working right now
)
func crawlUrl(urlsToCrawl chan string, urlsDiscovered chan Pair) {
for url := range urlsToCrawl {
urls := getUrls(url) // This returns an array of string, if no URL found, it returns an empty array
urlsDiscovered <- urls
}
}
func main() {
urlsToCrawl := make(chan string)
urlsDiscovered := make(chan string[])
i := 0
for i < 8 {
go crawlUrl(urlsToCrawl, urlsDiscovered)
i++
}
visited := map[string]bool{"some_seed_url": true}
count++
urlsToCrawl <- "some_seed_url"
for urls := range urlsDiscovered {
count-- // One message is received by master, meaning one worker thread has finished an job item, therefore decrementing count
for _, url := range urls {
_, ok := visited[url]
if ok {
continue // This URL has been crawled before
}
visited[url] = true
count ++ // One more work item will be sent to worker, therefore first increment count
urlsToCrawl <- url
}
if count == 0 {
close(urlsDiscovered)
close(urlsToCrawl)
break
} // else some worker must be working so let's wait to see if there is new msg coming through the channel
}
}
Your channels are unbuffered
urlsToCrawl := make(chan string)
urlsDiscovered := make(chan string[])
So a goroutine which reads from or writes to a channel will block until a goroutine on the other side is doing the opposite.
So you start 8 crawlUrl goroutines which all block while reading from urlsToCrawl, meaning that main can send 8 urls before blocking. The crawlUrl goroutines are blocked until main reads from urlsDiscovered. So if you have more than 8 URL's going around, all goroutines are waiting on each other(deadlock).
The solution to this is to use buffered channels with a capacity you are very unlikely to exceed:
urlsToCrawl := make(chan string, 1000)
urlsDiscovered := make(chan string[], 100)
If you expect you might still exceed the capacity of the channel in extreme cases, you can perform non-blocking operations which allow you to for example discard discovered URL's if the channel is full instead of blocking.
select {
case: urlsDiscovered <- urls:
// on success (url written)
default:
// channel is full, can't write without blocking
}
Try to integrate the WaitGroup package.

Combining multiple maps that are stored on channel (Same key's values get summed.) in Go

My objective is to create a program that counts every unique word's occurrence in a text file in a parallellised fashion, all occurrences have to be presented in a single map.
What I do here is dividing the textfile into string and then to an array. That array is then divided into two slices of equal length and fed concurrently to the mapper function.
func WordCount(text string) (map[string]int) {
wg := new(sync.WaitGroup)
s := strings.Fields(newText)
freq := make(map[string]int,len(s))
channel := make(chan map[string]int,2)
wg.Add(1)
go mappers(s[0:(len(s)/2)], freq, channel,wg)
wg.Add(1)
go mappers(s[(len(s)/2):], freq, channel,wg)
wg.Wait()
actualMap := <-channel
return actualMap
func mappers(slice []string, occurrences map[string]int, ch chan map[string]int, wg *sync.WaitGroup) {
var l = sync.Mutex{}
for _, word := range slice {
l.Lock()
occurrences[word]++
l.Unlock()
}
ch <- occurrences
wg.Done()
}
The bottom line is, is that I get a huge multiline error that starts with
fatal error: concurrent map writes
When I run the code. Which I thought I guarded for through mutual exclusion
l.Lock()
occurrences[word]++
l.Unlock()
What am I doing wrong here? And furthermore. How can I combine all the maps in a channel? And with combine I mean same key's values get summed in the new map.
The main problem is that you use a separate lock in each goroutine. That doesn't do any help to serialize access to the map. The same lock has to be used in each goroutine.
And since you use the same map in each goroutine, you don't have to merge them, and you don't need a channel to deliver the result.
Even if you use the same mutex in each goroutine, since you use a single map, this probably won't help in performance, the goroutines will have to compete with each other for the map's lock.
You should create a separate map in each goroutine, use that to count locally, and then deliver the result map on the channel. This might give you a performance boost.
But then you don't need a lock, since each goroutine will have its own map which it can read/write without a mutex.
But then you'll do have to deliver the result on the channel, and then merge it.
And since goroutines deliver results on the channel, the waitgroup becomes unnecessary.
func WordCount(text string) map[string]int {
s := strings.Fields(text)
channel := make(chan map[string]int, 2)
go mappers(s[0:(len(s)/2)], channel)
go mappers(s[(len(s)/2):], channel)
total := map[string]int{}
for i := 0; i < 2; i++ {
m := <-channel
for k, v := range m {
total[k] += v
}
}
return total
}
func mappers(slice []string, ch chan map[string]int) {
occurrences := map[string]int{}
for _, word := range slice {
occurrences[word]++
}
ch <- occurrences
}
Example testing it:
fmt.Println(WordCount("aa ab cd cd de ef a x cd aa"))
Output (try it on the Go Playground):
map[a:1 aa:2 ab:1 cd:3 de:1 ef:1 x:1]
Also note that in theory this looks "good", but in practice you may still not achieve any performance boost, as the goroutines do too "little" work, and launching them and merging the results requires effort which may outweight the benefits.

Go Memory Model Happens Before (Channels with Shared State)

I'm trying to more fully understand the nature of the Happens-Before relationship between channels and other shared state. Specifically, I want to see if some kind of memory fence is created on a channel send and receive operation.
For example, if I send a message on a channel, do all other operations surrounding modification of shared state "happen before" the send/receive operation. In my particular example, I'm only ever writing from a single go routine and then reading from a single go routine.
(Aside: the obvious answer in the example below is to put an instance of the Person struct on the channel directly, but that's not what I'm asking.)
package main
func main() {
channel := make(chan int, 128)
go func() {
person := &sharedState[0]
person.Name = "Hello, World!"
channel <- 0
}()
index := <-channel
person := sharedState[index]
if person.Name != "Hello, World!" {
// unintended race condition
}
}
type Person struct{ Name string }
var sharedState = make([]Person, 1024)
The memory model guarantees that when the channel write operation executes, all operations in that goroutine that comes before the channel operation are visible. So in your example, the "unintended race condition" cannot happen, because when the channel is read, the assignment happened in the goroutine is visible. This, of course, assumes that there isn't another goroutine that's writing to that same variable. If there was another goroutine writing to that same variable, then you would need to synchronize that goroutine as well to avoid the race.

How to safely interact with channels in goroutines in Golang

I am new to go and I am trying to understand the way channels in goroutines work. To my understanding, the keyword range could be used to iterate over a the values of the channel up until the channel is closed or the buffer runs out; hence, a for range c will repeatedly loops until the buffer runs out.
I have the following simple function that adds value to a channel:
func main() {
c := make(chan int)
go printchannel(c)
for i:=0; i<10 ; i++ {
c <- i
}
}
I have two implementations of printchannel and I am not sure why the behaviour is different.
Implementation 1:
func printchannel(c chan int) {
for range c {
fmt.Println(<-c)
}
}
output: 1 3 5 7
Implementation 2:
func printchannel(c chan int) {
for i:=range c {
fmt.Println(i)
}
}
output: 0 1 2 3 4 5 6 7 8
And I was expecting neither of those outputs!
Wanted output: 0 1 2 3 4 5 6 7 8 9
Shouldnt the main function and the printchannel function run on two threads in parallel, one adding values to the channel and the other reading the values up until the channel is closed? I might be missing some fundamental go/thread concept here and pointers to that would be helpful.
Feedback on this (and my understanding to channels manipulation in goroutines) is greatly appreciated!
Implementation 1. You're reading from the channel twice - range c and <-c are both reading from the channel.
Implementation 2. That's the correct approach. The reason you might not see 9 printed is that two goroutines might run in parallel threads. In that case it might go like this:
main goroutine sends 9 to the channel and blocks until it's read
second goroutine receives 9 from the channel
main goroutine unblocks and exits. That terminates whole program which doesn't give second goroutine a chance to print 9
In case like that you have to synchronize your goroutines. For example, like so
func printchannel(c chan int, wg *sync.WaitGroup) {
for i:=range c {
fmt.Println(i)
}
wg.Done() //notify that we're done here
}
func main() {
c := make(chan int)
wg := sync.WaitGroup{}
wg.Add(1) //increase by one to wait for one goroutine to finish
//very important to do it here and not in the goroutine
//otherwise you get race condition
go printchannel(c, &wg) //very important to pass wg by reference
//sync.WaitGroup is a structure, passing it
//by value would produce incorrect results
for i:=0; i<10 ; i++ {
c <- i
}
close(c) //close the channel to terminate the range loop
wg.Wait() //wait for the goroutine to finish
}
As to goroutines vs threads. You shouldn't confuse them and probably should understand the difference between them. Goroutines are green threads. There're countless blog posts, lectures and stackoverflow answers on that topic.
In implementation 1, range reads into channel once, then again in Println. Hence you're skipping over 2, 4, 6, 8.
In both implementations, once the final i (9) has been sent to goroutine, the program exits. Thus goroutine does not have the time to print out 9. To solve it, use a WaitGroup as has been mentioned in the other answer, or a done channel to avoid semaphore/mutex.
func main() {
c := make(chan int)
done := make(chan bool)
go printchannel(c, done)
for i:=0; i<10 ; i++ {
c <- i
}
close(c)
<- done
}
func printchannel(c chan int, done chan bool) {
for i := range c {
fmt.Println(i)
}
done <- true
}
The reason your first implementation only returns every other number is because you are, in effect "taking" from c twice each time the loop runs: first with range, then again with <-. It just happens that you're not actually binding or using the first value taken off the channel, so all you end up printing is every other one.
An alternative approach to your first implementation would be to not use range at all, e.g.:
func printchannel(c chan int) {
for {
fmt.Println(<-c)
}
}
I could not replicate the behavior of your second implementation, on my machine, but the reason for that is that both of your implementations are racy - they will terminate whenever main ends, regardless of what data may be pending in a channel or however many goroutines may be active.
As a closing note, I'd warn you not to think about goroutines as explicitly being "threads", though they have a similar mental model and interface. In a simple program like this it's not at all unlikely that Go might just do it all using a single OS thread.
Your first loop does not work as you have 2 blocking channel receivers and they do not execute at the same time.
When you call the goroutine the loop starts, and it waits for the first value to be sent to the channel. Effectively think of it as <-c .
When the for loop in the main function runs it sends 0 on the Chan. At this point the range c recieves the value and stops blocking the execution of the loop.
Then it is blocked by the reciever at fmt.println(<-c) . When 1 is sent on the second iteration of the loop in main the recieved at fmt.println(<-c) reads from the channel, allowing fmt.println to execute thus finishing the loop and waiting for a value at the for range c .
Your second implementation of the looping mechanism is the correct one.
The reason it exits before printing to 9 is that after the for loop in main finishes the program goes ahead and completes execution of main.
In Go func main is launched as a goroutine itself while executing. Thus when the for loop in main completes it goes ahead and exits, and as the print is within a parallel goroutine that is closed, it is never executed. There is no time for it to print as there is nothing to block main from completing and exiting the program.
One way to solve this is to use wait groups http://www.golangprograms.com/go-language/concurrency.html
In order to get the expected result you need to have a blocking process running in main that provides enough time or waits for confirmation of the execution of the goroutine before allowing the program to continue.

Go: Thread-Safe Concurrency Issue with Sparse Array Read & Write

I'm writing a search engine in Go in which I have an inverted index of words to the corresponding results for each word. There is a set dictionary of words and so the words are already converted into a StemID, which is an integer starting from 0. This allows me to use a slice of pointers (i.e. a sparse array) to map each StemID to the structure which contains the results of that query. E.g. var StemID_to_Index []*resultStruct. If aardvark is 0 then the pointer to the resultStruct for aardvark is located at StemID_to_Index[0], which will be nil if the result for this word is currently not loaded.
There is not enough memory on the server to store all of this in memory, so the structure for each StemID will be saved as separate files and these can be loaded into the StemID_to_Index slice. If StemID_to_Index is currently nil for this StemID then the result is not cached and needs to be loaded, otherwise it's already loaded (cached) and so can be used directly. Each time a new result is loaded the memory usage is checked and if it's over the threshold then 2/3 of the loaded results are thrown away (StemID_to_Index is set to nil for these StemIDs and a garbage collection is forced.)
My problem is the concurrency. What is the fastest and most efficient way in which I can have multiple threads searching at the same time without having problems with different threads trying to read and write to the same place at the same time? I'm trying to avoid using mutexes on everything as that would slow down every single access attempt.
Do you think I would get away with loading the results from disk in the working thread and then delivering the pointer to this structure to an "updater" thread using channels, which then updates the nil value in the StemID_to_Index slice to the pointer of the loaded result? This would mean that two threads would never attempt to write at the same time, but what would happen if another thread tried to read from that exact index of StemID_to_Index while the "updater" thread was updating the pointer? It doesn't matter if a thread is given a nil pointer for a result which is currently being loaded, because it will just be loaded twice and while that is a waste of resources it would still deliver the same result and since that is unlikely to happen very often, it's forgiveable.
Additionally, how would the working thread which send the pointer to be updated to the "updater" thread know when the "updater" thread has finished updating the pointer in the slice? Should it just sleep and keep checking, or is there an easy way for the updater to send a message back to the specific thread which pushed to the channel?
UPDATE
I made a little test script to see what would happen if attempting to access a pointer at the same time as modifying it... it seems to always be OK. No errors. Am I missing something?
package main
import (
"fmt"
"sync"
)
type tester struct {
a uint
}
var things *tester
func updater() {
var a uint
for {
what := new(tester)
what.a = a
things = what
a++
}
}
func test() {
var t *tester
for {
t = things
if t != nil {
if t.a < 0 {
fmt.Println(`Error1`)
}
} else {
fmt.Println(`Error2`)
}
}
}
func main() {
var wg sync.WaitGroup
things = new(tester)
go test()
go test()
go test()
go test()
go test()
go test()
go updater()
go test()
go test()
go test()
go test()
go test()
wg.Add(1)
wg.Wait()
}
UPDATE 2
Taking this further, even if I read and write from multiple threads to the same variable at the same time... it makes no difference, still no errors:
From above:
func test() {
var a uint
var t *tester
for {
t = things
if t != nil {
if t.a < 0 {
fmt.Println(`Error1`)
}
} else {
fmt.Println(`Error2`)
}
what := new(tester)
what.a = a
things = what
a++
}
}
This implies I don't have to worry about concurrency at all... again: am I missing something here?
This sounds like a perfect use case for a memory mapped file:
package main
import (
"log"
"os"
"unsafe"
"github.com/edsrzf/mmap-go"
)
func main() {
// Open the backing file
f, err := os.OpenFile("example.txt", os.O_RDWR|os.O_CREATE, 0644)
if err != nil {
log.Fatalln(err)
}
defer f.Close()
// Set it's size
f.Truncate(1024)
// Memory map it
m, err := mmap.Map(f, mmap.RDWR, 0)
if err != nil {
log.Fatalln(err)
}
defer m.Unmap()
// m is a byte slice
copy(m, "Hello World")
m.Flush()
// here's how to use it with a pointer
type Coordinate struct{ X, Y int }
// first get the memory address as a *byte pointer and convert it to an unsafe
// pointer
ptr := unsafe.Pointer(&m[20])
// next convert it into a different pointer type
coord := (*Coordinate)(ptr)
// now you can use it directly
*coord = Coordinate{1, 2}
m.Flush()
// and vice-versa
log.Println(*(*Coordinate)(unsafe.Pointer(&m[20])))
}
The memory map can be larger than real memory and the operating system will handle all the messy details for you.
You will still need to make sure that separate goroutines never read/write to the same segment of memory at the same time.
My top answer would be to use elasticsearch with a client like elastigo.
If that's not an option, it would really help to know how much you care about race-y behavior. If you don't care, a write could happen right after a read finishes, the user finishing the read will get stale data. You can just have a queue of write and read operations and have multiple threads feed into that queue and one dispatcher issue the operations to the map one-at-a-time as they come it. In all other scenarios, you will need a mutex if there are multiple readers and writers. Maps aren't thread safe in go.
Honestly though, I would just add a mutex to make things simple for now and optimize by analyzing where your bottlenecks actually lie. It seems like you checking a threshold and then purging 2/3 of your cache is a bit arbitrary, and I wouldn't be surprised if you kill performance by doing something like that. Here's on situation where that would break down:
Requesters 1, 2, 3, and 4 are frequently accessing many of the same words on files A & B.
Requester 5, 6, 7 and 8 are frequently accessing many of the same words stored on files C & D.
Now when requests interleaved between these requesters and files happen in rapid succession, you may end up purging your 2/3 of your cache over and over again of results that may be requested shortly after. There are a couple other approaches:
Cache words that are frequently accessed at the same time on the same box and have multiple caching boxes.
Cache on a per-word basis with some sort of ranking of how popular that word is. If a new word is accessed from a file while the cache is full, see if other more popular words live in that file and purge less popular entries in the cache in hopes that those words will have a higher hit rate.
Both approaches 1 & 2.

Resources