Troubles with gitlab scraping via golang - multithreading

I'm newbie in programming and I need help. Trying to write gitlab scraper on golang.
Something goes wrong when i'm trying to get information about projects in multithreading mode.
Here is the code:
func (g *Gitlab) getAPIResponce(url string, structure interface{}) error {
responce, responce_error := http.Get(url)
if responce_error != nil {
return responce_error
}
ret, _ := ioutil.ReadAll(responce.Body)
if string(ret) != "[]" {
err := json.Unmarshal(ret, structure)
return err
}
return errors.New(error_emptypage)
}
...
func (g *Gitlab) GetProjects() {
projects_chan := make(chan Project, g.LatestProjectID)
var waitGroup sync.WaitGroup
queue := make(chan struct{}, 50)
for i := g.LatestProjectID; i > 0; i-- {
url := g.BaseURL + projects_url + "/" + strconv.Itoa(i) + g.Token
waitGroup.Add(1)
go func(url string, channel chan Project) {
queue <- struct{}{}
defer waitGroup.Done()
var oneProject Project
err := g.getAPIResponce(url, &oneProject)
if err != nil {
fmt.Println(err.Error())
}
fmt.Printf(".")
channel <- oneProject
<-queue
}(url, projects_chan)
}
go func() {
waitGroup.Wait()
close(projects_chan)
}()
for project := range projects_chan {
if project.ID != 0 {
g.Projects = append(g.Projects, project)
}
}
}
And here is the output:
$ ./gitlab-auditor
latest project = 1532
Gathering projects...
.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Get https://gitlab.example.com/api/v4/projects/563&private_token=SeCrEt_ToKeN: unexpected EOF
Get https://gitlab.example.com/api/v4/projects/558&private_token=SeCrEt_ToKeN: unexpected EOF
..Get https://gitlab.example.com/api/v4/projects/531&private_token=SeCrEt_ToKeN: unexpected EOF
Get https://gitlab.example.com/api/v4/projects/571&private_token=SeCrEt_ToKeN: unexpected EOF
.Get https://gitlab.example.com/api/v4/projects/570&private_token=SeCrEt_ToKeN: unexpected EOF
..Get https://gitlab.example.com/api/v4/projects/467&private_token=SeCrEt_ToKeN: unexpected EOF
Get https://gitlab.example.com/api/v4/projects/573&private_token=SeCrEt_ToKeN: unexpected EOF
................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Every time it's different projects, but it's id is around 550.
When I'm trying to curl links from output, i'm getting normal JSON. When I'm trying to run this code with queue := make(chan struct{}, 1) (in single thread) - everything is fine.
What can it be?

i would say this not a very clear way to achieve concurrency.
what seems to be happening here is
you create a buffered channel that has a size of 50.
then you fire up 1532 goroutines
the first 50 of them enqueue themselves and start processing. by the time they <-queue and free up somespace a random one from the next manages to get on the queue.
as people say in the comments most certainly you hit some limits around the time it the blast has made it around id 550. Then gitlab's API is angry at you and rate limits.
then another goroutine is fired that will close the channel to notify the main goroutine
the main goroutine reads messages.
the talk go concurrency patterns
as well as this blog post concurrency in go might help.
personally i rarely use buffered channels. for your problem i would go like:
define a number of workers
have the main goroutine fire up the workers with a func listening on a channel of ints , doing the api call, writing to a channel of projects
have the main goroutine send to a channel of ints the project number to be fetched and read from the channel of projects.
maybe ratelimit by firing a ticker and have main read from it before it sends the next request?
main closes the number channel to notify the others to die.

Related

How to make work concurrent while sending data on a stream in golang?

I have a golang grpc server which has streaming endpoint. Earlier I was doing all the work sequentially and sending on the stream but then I realize I can make the work concurrent and then send on stream. From grpc-go docs: I understood that I can make the work concurrent, but you can't make sending on the stream concurrent so I got below code which does the job.
Below is the code I have in my streaming endpoint which sends data back to client in a streaming way. This does all the work concurrently.
// get "allCids" from lot of files and load in memory.
allCids := .....
var data = allCids.([]int64)
out := make(chan *custPbV1.CustomerResponse, len(data))
wg := &sync.WaitGroup{}
wg.Add(len(data))
go func() {
wg.Wait()
close(out)
}()
for _, cid := range data {
go func (id int64) {
defer wg.Done()
pd := repo.GetCustomerData(strconv.FormatInt(cid, 10))
if !pd.IsCorrect {
return
}
resources := us.helperCom.GenerateResourceString(pd)
val, err := us.GenerateInfo(clientId, resources, cfg)
if err != nil {
return
}
out <- val
}(cid)
}
for val := range out {
if err := stream.Send(val); err != nil {
log.Printf("send error %v", err)
}
}
Now problem I have is size of data slice can be approx a million so I don't want to spawn million go routine doing the job. How do I handle that scenario here? If instead of len(data) I use 100 then will that work for me or I need to slice data as well in 100 sub arrays? I am just confuse on what is the best way to deal with this problem?
I recently started with golang so pardon me if there are any mistakes in my above code while making it concurrent.
Please check this pseudo code
func main() {
works := make(chan int, 100)
errChan := make(chan error, 100)
out := make(chan *custPbV1.CustomerResponse, 100)
// spawn fixed workers
var workerWg sync.WaitGroup
for i := 0; i < 100; i++ {
workerWg.Add(1)
go worker(&workerWg, works, errChan, out)
}
// give input
go func() {
for _, cid := range data {
// this will be blocked if all the workers are busy and no space is left in the channel.
works <- cid
}
close(works)
}()
var analyzeResults sync.WaitGroup
analyzeResults.Add(2)
// process errors
go func() {
for err := range errChan {
log.Printf("error %v", err)
}
analyzeResults.Done()
}()
// process outout
go func() {
for val := range out {
if err := stream.Send(val); err != nil {
log.Printf("send error %v", err)
}
}
analyzeResults.Done()
}()
workerWg.Wait()
close(out)
close(errChan)
analyzeResults.Wait()
}
func worker(job *sync.WaitGroup, works chan int, errChan chan error, out chan *custPbV1.CustomerResponse) {
defer job.Done()
// Idle worker takes the work from this channel.
for cid := range works {
pd := repo.GetCustomerData(strconv.FormatInt(cid, 10))
if !pd.IsCorrect {
errChan <- errors.New(fmt.Sprintf("pd %d is incorrect", pd))
// we can not return here as the total number of workers will be reduced. If all the workers does this then there is a chance that no workers are there to do the job
continue
}
resources := us.helperCom.GenerateResourceString(pd)
val, err := us.GenerateInfo(clientId, resources, cfg)
if err != nil {
errChan <- errors.New(fmt.Sprintf("got error", err))
continue
}
out <- val
}
}
Explanation:
This is a worker pool implementation where we spawn a fixed number of goroutines(100 workers here) to do the same job(GetCustomerData() & GenerateInfo() here) but with different input data(cid here). 100 workers here does not mean that it is parallel but concurrent(depends on the GOMAXPROCS). If one worker is waiting for io result(basically some blocking operation)then that particular goroutine will be context switched and other worker goroutine gets a chance to execute. But increasing goroutuines (workers) may not give much performance but can leads to contention on the channel as more workers are waiting for the input job on that channel.
The benefit over splitting the 1 million data to subslice is that. Lets say we have 1000 jobs and 100 workers. each worker will get assigned to the jobs 1-10, 11-20 etc... What if the first 10 jobs is taking more time than others. In that case the first worker is overloaded and the other workers will finish the tasks and will be idle even though there are pending tasks. So to avoid this situation, this is the best solution as the idle worker will take the next job. So that no worker is more overloaded compared to the other workers

Stop channels when ws not able to connect

I have the following code which works ok, the issue is that when the socket.Connect() fails to connect I want to stop the process, I’ve tried with the following code
but it’s not working, I.e. if the socket connect fails to connect the program still runs.
What I want to happen is that if the connect fails, the process stops and the channe…what am I missing here?
func run (appName string) (err error) {
done = make(chan bool)
defer close(done)
serviceURL, e := GetContext().getServiceURL(appName)
if e != nil {
err = errors.New("process failed" + err.Error())
LogDebug("Exiting %v func[err =%v]", methodName, err)
return err
}
url := "wss://" + serviceURL + route
socket := gowebsocket.New(url)
addPass(&socket, user, pass)
socket.OnConnectError = OnConnectErrorHandler
socket.OnConnected = OnConnectedHandler
socket.OnTextMessage = socketTextMessageHandler
socket.OnDisconnected = OnDisconnectedHandler
LogDebug("In %v func connecting to URL %v", methodName, url)
socket.Connect()
jsonBytes, e := json.Marshal(payload)
if e != nil {
err = errors.New("build process failed" + e.Error())
LogDebug("Exiting %v func[err =%v]", methodName, err)
return err
}
jsonStr := string(jsonBytes)
LogDebug("In %v Connecting to payload JSON is %v", methodName, jsonStr)
socket.SendText(jsonStr)
<-done
LogDebug("Exiting %v func[err =%v]", methodName, err)
return err
}
func OnConnectErrorHandler(err error, socket gowebsocket.Socket) {
methodName := "OnConnectErrorHandler"
LogDebug("Starting %v parameters [err = %v , socket = %v]", methodName, err, socket)
LogInfo("Disconnected from server ")
done <- true
}
The process should open one ws connection for process that runs about 60-90 sec (like execute npm install) and get the logs of the process via web socket and when it finish , and of course handle the issue that could happen like network issue or some error running the process
So, #Slabgorb is correct - if you look here (https://github.com/sacOO7/GoWebsocket/blob/master/gowebsocket.go#L87) you will see that the OnConnectErrorHandler is called synchronously during the execution of your call to Connect(). The Connect() function doesn't kick off a separate goroutine to handle the websocket until after the connection is fully established and the OnConnected callback has completed. So when you try to write to the unbuffered channel done, you are blocking the same goroutine that called into the run() function to begin with, and you deadlock yourself, because no goroutine will ever be able to read from the channel to unblock you.
So you could go with his solution and turn it into a buffered channel, and that will work, but my suggestion would be not to write to a channel for this sort of one-time flag behavior, but use close signaling instead. Define a channel for each condition you want to terminate run(), and in the appropriate websocket handler function, close the channel when that condition happens. At the bottom of run(), you can select on all the channels, and exit when the first one closes. It would look something like this:
package main
import "errors"
func run(appName string) (err error) {
// first, define one channel per socket-closing-reason (DO NOT defer close these channels.)
connectErrorChan := make(chan struct{})
successDoneChan := make(chan struct{})
surpriseDisconnectChan := make(chan struct{})
// next, wrap calls to your handlers in a closure `https://gobyexample.com/closures`
// that captures a reference to the channel you care about
OnConnectErrorHandler := func(err error, socket gowebsocket.Socket) {
MyOnConnectErrorHandler(connectErrorChan, err, socket)
}
OnDisconnectedHandler := func(err error, socket gowebsocket.Socket) {
MyOnDisconectedHandler(surpriseDisconnectChan, err, socket)
}
// ... declare any other handlers that might close the connection here
// Do your setup logic here
// serviceURL, e := GetContext().getServiceURL(appName)
// . . .
// socket := gowebsocket.New(url)
socket.OnConnectError = OnConnectErrorHandler
socket.OnConnected = OnConnectedHandler
socket.OnTextMessage = socketTextMessageHandler
socket.OnDisconnected = OnDisconnectedHandler
// Prepare and send your message here...
// LogDebug("In %v func connecting to URL %v", methodName, url)
// . . .
// socket.SendText(jsonStr)
// now wait for one of your signalling channels to close.
select { // this will block until one of the handlers signals an exit
case <-connectError:
err = errors.New("never connected :( ")
case <-successDone:
socket.Close()
LogDebug("mission accomplished! :) ")
case <-surpriseDisconnect:
err = errors.New("somebody cut the wires! :O ")
}
if err != nil {
LogDebug(err)
}
return err
}
// *Your* connect error handler will take an extra channel as a parameter
func MyOnConnectErrorHandler(done chan struct{}, err error, socket gowebsocket.Socket) {
methodName := "OnConnectErrorHandler"
LogDebug("Starting %v parameters [err = %v , socket = %v]", methodName, err, socket)
LogInfo("Disconnected from server ")
close(done) // signal we are done.
}
This has a few advantages:
1) You don't need to guess which callbacks happen in-process and which happen in background goroutines (and you don't have to make all your channels buffered 'just in case')
2) Selecting on the multiple channels lets you find out why you are exiting and maybe handle cleanup or logging differently.
Note 1: If you choose to use close signaling, you have to use different channels for each source in order to avoid race conditions that might cause a channel to get closed twice from different goroutines (e.g. a timeout happens just as you get back a response, and both handlers fire; the second handler to close the same channel causes a panic.) This is also why you don't want to defer close all the channel at the top of the function.
Note 2: Not directly relevant to your question, but -- you don't need to close every channel - once all the handles to it go out of scope, the channel will get garbage collected whether or not it has been closed.
Ok, what is happening is the channel is blocking when you try to add something to it. Try initializing the done channel with a buffer (I used 1) like this:
done = make(chan bool, 1)

golang: how to bind code with thread?

I have nearly implemented a face recognition Go server. My face recognition algorithm uses caffe, caffe is a thread-binding graphical library, that means I have to init and call algorithm in same thread, so I checked LockOSThread().
LockOSThread uses 1 thread, but my server owns 4 GPU.
In C/C++, I could create 4 threads, initialize algorithm in each thread, use sem_wait and sem_post to assign task, 1 thread use 1 GPU.
How to do the same thing in Go, how to bind code with thread?
You spawn some number of goroutines, run runtime.LockOSThread() in each
and then initialize your graphical library in each.
You then use regular Go communication primitives to send tasks to those
goroutines. Usually, the most simple way is to have each goroutine read "tasks" from a channel and send the results back, like in
type Task struct {
Data DataTypeToContainRecognitionTask
Result chan<- DataTypeToContainRecognitionResult
}
func GoroutineLoop(tasks <-chan Task) {
for task := range tasks {
task.Result <- recognize(Data)
}
}
tasks := make(chan Task)
for n := 4; n > 0; n-- {
go GoroutineLoop(tasks)
}
for {
res := make(chan DataTypeToContainRecognitionResult)
tasks <- Task{
Data: makeRecognitionData(),
Result: res,
}
result <- res
// Do something with the result
}
As to figuring out how much goroutines to start, there exist different
strategies.
The simplest approach is probably to query runtime.NumCPU()
and using this number.

golang concurrent http request handling

I am creating a simple http server using golang. I have two questions, one is more theoretic and another one about the real program.
Concurrent request handling
I create a server and use s.ListenAndServe() to handle the requests.
As much as I understand the requests served concurrently. I use a simple handler to check it:
func ServeHTTP(rw http.ResponseWriter, request *http.Request) {
fmt.Println("1")
time.Sleep(1 * time.Second) //Phase 2 delete this line
fmt.Fprintln(rw, "Hello, world.")
fmt.Println("2")
}
I see that if I send several requests, I will see all the "1" appear and only after a second all the "2" appear.
But if I delete the Sleep line, I see that program never start a request before it finishes with the previous one (the output is 1 2 1 2 1 2 ...).
So I don't understand, if they are concurrent or not really. If they are I would expect to see some mess in prints...
The real life problem
In the real handler, I send the request to another server and return the answer to the user (with some changes to request and the answer but in idea it is kind of a proxy). All this of course takes time and from what can see (by adding some prints to the handler), the requests are handled one by one, with no concurrency between them (my prints show me that a request starts, go through all the steps, ends and only then I see a new start....).
What can I do to make them really concurrent?
Putting the handler function as goroutine gives an error, that body of the request is already closed. Also if it is already concurrent adding more goroutines will make things only worse.
Thank you!
Your example makes it very hard to tell what is happening.
The below example will clearly illustrate that the requests are run in parallel.
package main
import (
"fmt"
"log"
"net/http"
"time"
)
func main() {
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
if len(r.FormValue("case-two")) > 0 {
fmt.Println("case two")
} else {
fmt.Println("case one start")
time.Sleep(time.Second * 5)
fmt.Println("case one end")
}
})
if err := http.ListenAndServe(":8000", nil); err != nil {
log.Fatal(err)
}
}
Make one request to http://localhost:8000
Make another request to http://localhost:8000?case-two=true within 5 seconds
console output will be
case one start
case two
case one end
It does serve requests in a concurrent fashion as can be seen here in the source https://golang.org/src/net/http/server.go#L2293.
Here is a contrived example:
package main
import (
"fmt"
"log"
"net/http"
"sync"
"time"
)
func main() {
go startServer()
sendRequest := func() {
resp, _ := http.Get("http://localhost:8000/")
defer resp.Body.Close()
}
start := time.Now()
var wg sync.WaitGroup
ch := make(chan int, 10)
for i := 0; i < 10; i++ {
wg.Add(1)
go func(n int) {
defer wg.Done()
sendRequest()
ch <- n
}(i)
}
go func() {
wg.Wait()
close(ch)
}()
fmt.Printf("completion sequence :")
for routineNumber := range ch {
fmt.Printf("%d ", routineNumber)
}
fmt.Println()
fmt.Println("time:", time.Since(start))
}
func startServer() {
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
time.Sleep(1 * time.Second)
})
if err := http.ListenAndServe(":8000", nil); err != nil {
log.Fatal(err)
}
}
Over several runs it is easy to visualize that the completion ordering of the go routines which send the requests are completely random and given the fact that channels are fifo, we can summarize that the server handled the requests in a concurrent fashion, irrespective of whether HandleFunc sleeps or not.
(Assumption being all the requests start at about the same time).
In addition to the above, if you did sleep for a second in HandleFunc, the time it takes to complete all 10 routines is consistently 1.xxx seconds which further shows that the server handled the requests concurrently as else the total time to complete all requests should have been 10+ seconds.
Example:
completion sequence :3 0 6 2 9 4 5 1 7 8
time: 1.002279359s
completion sequence :7 2 3 0 6 4 1 9 5 8
time: 1.001573873s
completion sequence :6 1 0 8 5 4 2 7 9 3
time: 1.002026465s
Analyzing concurrency by printing without synchronization is almost always indeterminate.
While Go serves requests concurrently, the client might actually be blocking (wait for the first request to complete before sending the second), and then one will see exactly the behavior reported initially.
I was having the same problem, and my client code was sending a "GET" request via XMLHttpRequest to a slow handler (I used a similar handler code as posted above, with 10 sec timeout). In turns out that such requests block each other. Example JavaScript client code:
for (var i = 0; i < 3; i++) {
var xhr = new XMLHttpRequest();
xhr.open("GET", "/slowhandler");
xhr.send();
}
Note that xhr.send() will return immediately, since this is an asynchronous call, but that doesn't guarantee that the browser will send the actual "GET" request right away.
GET requests are subject to caching, and if one tries to GET the same URL, caching might (in fact, will) affect how the requests to the server are made. POST requests are not cached, so if one changes "GET" to "POST" in the example above, the Go server code will show that the /slowhandler will be fired concurrently (you will see "1 1 1 [...pause...] 2 2 2" printed).

Golang WaitGroup causing memory leak, what can I do to improve this function

I've been struggling to find a memory leak in our application and have been using the pprof tool to understand what's going on.
When I look at the heap, I constantly see the following function and I don't understand why (or if) it's actually a problem.
func CreateClients(raw []byte) bool {
macs := []string{}
conn := FormatConn(raw)
if conn.Ap_Mac != "" {
var wg sync.WaitGroup
var array []Client
c1 := make(chan Client)
clients := FormatClients(conn)
wg.Add(len(clients))
for _, c := range clients {
go func(d Client) {
defer wg.Done()
c1 <- UpdateClients(d)
}(c)
}
go func() {
defer wg.Done()
for {
select {
case client := <-c1:
array = append(array, client)
macs = append(macs, client.Client_Mac)
}
}
}()
wg.Wait()
// Do some other stuff
...
}
The UpdateClients function updates the client model in Mongo. When it returns, I need each client - so I can index it with ES plus I need an array of the macs to do some other stuff.
I've gone through the online examples and thought this was the recommended way to loop through a channel.
My pprof heap looks like this, and grows steadily over a few days:
7.53MB of 9.53MB total (79.00%)
Dropped 234 nodes (cum <= 0.05MB)
Showing top 5 nodes out of 28 (cum >= 1MB)
flat flat% sum% cum cum%
2MB 21.00% 21.00% 2MB 21.00% strings.Replace
1.51MB 15.89% 36.89% 1.51MB 15.89% github.com/PolkaSpots/worker/worker.func·006
1.51MB 15.87% 52.76% 1.51MB 15.87% github.com/PolkaSpots/worker/worker.func·008
1.50MB 15.75% 68.51% 1.50MB 15.75% newproc_m
1MB 10.50% 79.00% 1MB 10.50% gopkg.in/mgo.v2/bson.(*decoder).readStr
Is there a more efficient / recommended way to achieve this?
EDIT_
As suggested, I've altered the loop as so
done := make(chan bool)
go func() {
for {
select {
case client := <-c1:
array = append(array, client)
macs = append(macs, client.Client_Mac)
case <-done:
return
}
}
}()
wg.Wait()
close(done)
The receive loop never breaks:
for {
select {
case client := <-c1:
...
}
It has no stop condition, no timeout, nothing - so it will just hang there forever - even if your whole function exits. and it will leak both the goroutine and the channel.
On top of that, you're deferring a wg.Done when this loop exits, but you're not doing wg.Add to match it. So if this loop ever exits, you will panic.
What you need to do is find some way to stop the for/select loop. Simplest way IMO - add a second channel that receives data after wg.Wait(), but do not do wg.Done() in that goroutine.

Resources