Download files by chunks in multiple threads in Go

Download files by chunks in multiple threads in Go - multithreading

I need to download files, chunk by chunk in multiple threads.
For example, I have 1k files, each file ~100Mb-1Gb and I can download these files only by chunks 4096Kb(each http get request gives me only 4kb).
It might be to long to download it in one thread, so I want to download them, let's say in 20 threads(one thread for one file) and I also need to download a few chunks in each of these threads, simultaneously.
Is there any example that shows such logic?

This is an example of how to set up a concurrent downloader. Things to be aware of are bandwidth, memory, and disk space. You can kill your bandwidth by trying to do to much at once, the same goes for memory. Your downloading pretty big files so memory can be an issue. Another thing to note is that by using gorountines you are losing request order. So if the order of the returned bytes matter, then this will not work because you will have to know the byte order to assemble the file in the end, which would mean that a downloading one at a time is best, unless you implement a way to keep track of the order (maybe some kind of global map[order int][]bytes with mutex to prevent race conditions). An alternative that doesn't involve Go (assuming you have a unix machine for ease) is to use Curl see here http://osxdaily.com/2014/02/13/download-with-curl/
package main
import (
"bytes"
"fmt"
"io"
"io/ioutil"
"log"
"net/http"
"sync"
)
// now your going to have to be careful because you can potentially run out of memory downloading to many files at once..
// however here is an example that can be modded
func downloader(wg *sync.WaitGroup, sema chan struct{}, fileNum int, URL string) {
sema <- struct{}{}
defer func() {
<-sema
wg.Done()
}()
client := &http.Client{Timeout: 10}
res, err := client.Get(URL)
if err != nil {
log.Fatal(err)
}
defer res.Body.Close()
var buf bytes.Buffer
// I'm copying to a buffer before writing it to file
// I could also just use IO copy to write it to the file
// directly and save memory by dumping to the disk directly.
io.Copy(&buf, res.Body)
// write the bytes to file
ioutil.WriteFile(fmt.Sprintf("file%d.txt", fileNum), buf.Bytes(), 0644)
return
}
func main() {
links := []string{
"url1",
"url2", // etc...
}
var wg sync.WaitGroup
// limit to four downloads at a time, this is called a semaphore
limiter := make(chan struct{}, 4)
for i, link := range links {
wg.Add(1)
go downloader(&wg, limiter, i, link)
}
wg.Wait()
}

You can check the implementation in aws-go-sdk.
https://github.com/aws/aws-sdk-go-v2/blob/main/feature/s3/manager/download.go
create n concurrent go routine
download the chunks in the n go routine
use writer.WriteAt() to seek and write to certain position.

Related

Troubles with gitlab scraping via golang

I'm newbie in programming and I need help. Trying to write gitlab scraper on golang.
Something goes wrong when i'm trying to get information about projects in multithreading mode.
Here is the code:
func (g *Gitlab) getAPIResponce(url string, structure interface{}) error {
responce, responce_error := http.Get(url)
if responce_error != nil {
return responce_error
}
ret, _ := ioutil.ReadAll(responce.Body)
if string(ret) != "[]" {
err := json.Unmarshal(ret, structure)
return err
}
return errors.New(error_emptypage)
}
...
func (g *Gitlab) GetProjects() {
projects_chan := make(chan Project, g.LatestProjectID)
var waitGroup sync.WaitGroup
queue := make(chan struct{}, 50)
for i := g.LatestProjectID; i > 0; i-- {
url := g.BaseURL + projects_url + "/" + strconv.Itoa(i) + g.Token
waitGroup.Add(1)
go func(url string, channel chan Project) {
queue <- struct{}{}
defer waitGroup.Done()
var oneProject Project
err := g.getAPIResponce(url, &oneProject)
if err != nil {
fmt.Println(err.Error())
}
fmt.Printf(".")
channel <- oneProject
<-queue
}(url, projects_chan)
}
go func() {
waitGroup.Wait()
close(projects_chan)
}()
for project := range projects_chan {
if project.ID != 0 {
g.Projects = append(g.Projects, project)
}
}
}
And here is the output:
$ ./gitlab-auditor
latest project = 1532
Gathering projects...
.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Get https://gitlab.example.com/api/v4/projects/563&private_token=SeCrEt_ToKeN: unexpected EOF
Get https://gitlab.example.com/api/v4/projects/558&private_token=SeCrEt_ToKeN: unexpected EOF
..Get https://gitlab.example.com/api/v4/projects/531&private_token=SeCrEt_ToKeN: unexpected EOF
Get https://gitlab.example.com/api/v4/projects/571&private_token=SeCrEt_ToKeN: unexpected EOF
.Get https://gitlab.example.com/api/v4/projects/570&private_token=SeCrEt_ToKeN: unexpected EOF
..Get https://gitlab.example.com/api/v4/projects/467&private_token=SeCrEt_ToKeN: unexpected EOF
Get https://gitlab.example.com/api/v4/projects/573&private_token=SeCrEt_ToKeN: unexpected EOF
................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Every time it's different projects, but it's id is around 550.
When I'm trying to curl links from output, i'm getting normal JSON. When I'm trying to run this code with queue := make(chan struct{}, 1) (in single thread) - everything is fine.
What can it be?

i would say this not a very clear way to achieve concurrency.
what seems to be happening here is
you create a buffered channel that has a size of 50.
then you fire up 1532 goroutines
the first 50 of them enqueue themselves and start processing. by the time they <-queue and free up somespace a random one from the next manages to get on the queue.
as people say in the comments most certainly you hit some limits around the time it the blast has made it around id 550. Then gitlab's API is angry at you and rate limits.
then another goroutine is fired that will close the channel to notify the main goroutine
the main goroutine reads messages.
the talk go concurrency patterns
as well as this blog post concurrency in go might help.
personally i rarely use buffered channels. for your problem i would go like:
define a number of workers
have the main goroutine fire up the workers with a func listening on a channel of ints , doing the api call, writing to a channel of projects
have the main goroutine send to a channel of ints the project number to be fetched and read from the channel of projects.
maybe ratelimit by firing a ticker and have main read from it before it sends the next request?
main closes the number channel to notify the others to die.

Concurrent Read/Close in Go, in a crossplatform way

I recently realized that I don't know how to properly Read and Close in Go concurrently. In my particular case, I need to do that with a serial port, but the problem is more generic.
If we do that without any extra effort to synchronize things, it leads to a race condition. Simple example:
package main
import (
"fmt"
"os"
"time"
)
func main() {
f, err := os.Open("/dev/ttyUSB0")
if err != nil {
panic(err)
}
// Start a goroutine which keeps reading from a serial port
go reader(f)
time.Sleep(1000 * time.Millisecond)
fmt.Println("closing")
f.Close()
time.Sleep(1000 * time.Millisecond)
}
func reader(f *os.File) {
b := make([]byte, 100)
for {
f.Read(b)
}
}
If we save the above as main.go, and run go run --race main.go, the output will look as follows:
closing
==================
WARNING: DATA RACE
Write at 0x00c4200143c0 by main goroutine:
os.(*file).close()
/usr/local/go/src/os/file_unix.go:143 +0x124
os.(*File).Close()
/usr/local/go/src/os/file_unix.go:132 +0x55
main.main()
/home/dimon/mydata/projects/go/src/dmitryfrank.com/testfiles/main.go:20 +0x13f
Previous read at 0x00c4200143c0 by goroutine 6:
os.(*File).read()
/usr/local/go/src/os/file_unix.go:228 +0x50
os.(*File).Read()
/usr/local/go/src/os/file.go:101 +0x6f
main.reader()
/home/dimon/mydata/projects/go/src/dmitryfrank.com/testfiles/main.go:27 +0x8b
Goroutine 6 (running) created at:
main.main()
/home/dimon/mydata/projects/go/src/dmitryfrank.com/testfiles/main.go:16 +0x81
==================
Found 1 data race(s)
exit status 66
Ok, but how to handle that properly? Of course, we can't just lock some mutex before calling f.Read(), because the mutex will end up locked basically all the time. To make it work properly, we'd need some sort of cooperation between reading and locking, like conditional variables do: the mutex gets unlocked before putting the goroutine to wait, and it's locked back when the goroutine wakes up.
I would implement something like this manually, but then I need some way to select things while reading. Like this: (pseudocode)
select {
case b := <-f.NextByte():
// process the byte somehow
default:
}
I examined docs of the packages os and sync, and so far I don't see any way to do that.

I belive you need 2 signals:
main -> reader, to tell it to stop reading
reader -> main, to tell that reader has been terminated
of course you can select go signaling primitive (channel, waitgroup, context etc) that you prefer.
Example below, I use waitgroup and context. The reason is
that you can spin multiple reader and only need to close the context to tell all the reader go-routine to stop.
I created multiple go routine just as
an example that you can even coordinate multiple go routine with it.
package main
import (
"context"
"fmt"
"os"
"sync"
"time"
)
func main() {
ctx, cancelFn := context.WithCancel(context.Background())
f, err := os.Open("/dev/ttyUSB0")
if err != nil {
panic(err)
}
var wg sync.WaitGroup
for i := 0; i < 3; i++ {
wg.Add(1)
// Start a goroutine which keeps reading from a serial port
go func(i int) {
defer wg.Done()
reader(ctx, f)
fmt.Printf("reader %d closed\n", i)
}(i)
}
time.Sleep(1000 * time.Millisecond)
fmt.Println("closing")
cancelFn() // signal all reader to stop
wg.Wait() // wait until all reader finished
f.Close()
fmt.Println("file closed")
time.Sleep(1000 * time.Millisecond)
}
func reader(ctx context.Context, f *os.File) {
b := make([]byte, 100)
for {
select {
case <-ctx.Done():
return
default:
f.Read(b)
}
}
}

Golang WaitGroup causing memory leak, what can I do to improve this function

I've been struggling to find a memory leak in our application and have been using the pprof tool to understand what's going on.
When I look at the heap, I constantly see the following function and I don't understand why (or if) it's actually a problem.
func CreateClients(raw []byte) bool {
macs := []string{}
conn := FormatConn(raw)
if conn.Ap_Mac != "" {
var wg sync.WaitGroup
var array []Client
c1 := make(chan Client)
clients := FormatClients(conn)
wg.Add(len(clients))
for _, c := range clients {
go func(d Client) {
defer wg.Done()
c1 <- UpdateClients(d)
}(c)
}
go func() {
defer wg.Done()
for {
select {
case client := <-c1:
array = append(array, client)
macs = append(macs, client.Client_Mac)
}
}
}()
wg.Wait()
// Do some other stuff
...
}
The UpdateClients function updates the client model in Mongo. When it returns, I need each client - so I can index it with ES plus I need an array of the macs to do some other stuff.
I've gone through the online examples and thought this was the recommended way to loop through a channel.
My pprof heap looks like this, and grows steadily over a few days:
7.53MB of 9.53MB total (79.00%)
Dropped 234 nodes (cum <= 0.05MB)
Showing top 5 nodes out of 28 (cum >= 1MB)
flat flat% sum% cum cum%
2MB 21.00% 21.00% 2MB 21.00% strings.Replace
1.51MB 15.89% 36.89% 1.51MB 15.89% github.com/PolkaSpots/worker/worker.func·006
1.51MB 15.87% 52.76% 1.51MB 15.87% github.com/PolkaSpots/worker/worker.func·008
1.50MB 15.75% 68.51% 1.50MB 15.75% newproc_m
1MB 10.50% 79.00% 1MB 10.50% gopkg.in/mgo.v2/bson.(*decoder).readStr
Is there a more efficient / recommended way to achieve this?
EDIT_
As suggested, I've altered the loop as so
done := make(chan bool)
go func() {
for {
select {
case client := <-c1:
array = append(array, client)
macs = append(macs, client.Client_Mac)
case <-done:
return
}
}
}()
wg.Wait()
close(done)

The receive loop never breaks:
for {
select {
case client := <-c1:
...
}
It has no stop condition, no timeout, nothing - so it will just hang there forever - even if your whole function exits. and it will leak both the goroutine and the channel.
On top of that, you're deferring a wg.Done when this loop exits, but you're not doing wg.Add to match it. So if this loop ever exits, you will panic.
What you need to do is find some way to stop the for/select loop. Simplest way IMO - add a second channel that receives data after wg.Wait(), but do not do wg.Done() in that goroutine.

channels and memory leaks

I'm trying to develop a program which runs continuously.
It should pull some data from a database every sleepPool seconds and 'process' the information in a non-blocking way(at least that's what I'm trying to do).
The issue is that the memory keeps growing so I'm wondering if I'm doing something wrong.
Below is a snippet from the my program.
var uCh = make(chan *user, buffLimit) //emits new users to process
var statsCh = make(chan *user, buffLimit) //emits new users to store
func main() {
go emitUser(db)
go consumeUser(db)
for ur := range statsCh {
log.Infoln(ur)
}
}
func emitUser(db *sql.DB) {
for {
time.Sleep(sleepPool * time.Second)
log.Infoln("looking for new users")
rows, err := rowsD.Query()
for rows.Next() {
uCh <- usr
}
}
}
func consumeUser(db *sql.DB) {
for usr := range uCh {
go func(usr *user) {
//do something with the user
statsCh <- usr
}(usr)
}
}
I've read that I may need to close the channels so that the gc can recycle the memory but I'm not sure how to do that (because the program should run continuously) and if I really need to do it because the data is always read (guaranteed by the range from main) so I assume the memory is recycled.

You didn't give enough time for the GC to kick in, wait for an hour then check the memory.
If you really really wan (bad idea and gonna slow your program) to force it to free the memory you can use something like:
import "runtime/debug"
//........
func forceFree() {
for _ = range time.Tick(30 * time.Second) {
debug.FreeOSMemory()
}
}
func init() {
go forceFree()
}

This code does a lot of harmless passing things around, but should not leak anything.
My guess is that the rowsD.Query method has some kind of cache, or other leak.
Of course, it could just be fragmentation, or an artefact of the garbage collector (in which case you should see memory use level off, or even drop, over time).

What is the nodejs setTimeout equivalent in Golang?

I am currently studying, and I miss setTimeout from Nodejs in golang. I haven't read much yet, and I'm wondering if I could implement the same in go like an interval or a loopback.
Is there a way that I can write this from node to golang? I heard golang handles concurrency very well, and this might be some goroutines or else?
//Nodejs
function main() {
//Do something
setTimeout(main, 3000)
console.log('Server is listening to 1337')
}
Thank you in advance!
//Go version
func main() {
for t := range time.Tick(3*time.Second) {
fmt.Printf("working %s \n", t)
}
//basically this will not execute..
fmt.Printf("will be called 1st")
}

The closest equivalent is the time.AfterFunc function:
import "time"
...
time.AfterFunc(3*time.Second, somefunction)
This will spawn a new goroutine and run the given function after the specified amount of time. There are other related functions in the package that may be of use:
time.After: this version will return a channel that will send a value after the given amount of time. This can be useful in combination with the select statement if you want a timeout while waiting on one or more channels.
time.Sleep: this version will simply block until the timer expires. In Go it is more common to write synchronous code and rely on the scheduler to switch to other goroutines, so sometimes simply blocking is the best solution.
There is also the time.Timer and time.Ticker types that can be used for less trivial cases where you may need to cancel the timer.

This website provides an interesting example and explanation of timeouts involving channels and the select function.
// _Timeouts_ are important for programs that connect to
// external resources or that otherwise need to bound
// execution time. Implementing timeouts in Go is easy and
// elegant thanks to channels and `select`.
package main
import "time"
import "fmt"
func main() {
// For our example, suppose we're executing an external
// call that returns its result on a channel `c1`
// after 2s.
c1 := make(chan string, 1)
go func() {
time.Sleep(2 * time.Second)
c1 <- "result 1"
}()
// Here's the `select` implementing a timeout.
// `res := <-c1` awaits the result and `<-Time.After`
// awaits a value to be sent after the timeout of
// 1s. Since `select` proceeds with the first
// receive that's ready, we'll take the timeout case
// if the operation takes more than the allowed 1s.
select {
case res := <-c1:
fmt.Println(res)
case <-time.After(1 * time.Second):
fmt.Println("timeout 1")
}
// If we allow a longer timeout of 3s, then the receive
// from `c2` will succeed and we'll print the result.
c2 := make(chan string, 1)
go func() {
time.Sleep(2 * time.Second)
c2 <- "result 2"
}()
select {
case res := <-c2:
fmt.Println(res)
case <-time.After(3 * time.Second):
fmt.Println("timeout 2")
}
}
You can also run it on the Go Playground

another solution could be to implement an
Immediately-Invoked Function Expression (IIFE) function like:
go func() {
time.Sleep(time.Second * 3)
// your code here
}()

you can do this by using sleep function
and give your duration you need
package main
import (
"fmt"
"time"
)
func main() {
fmt.Println("First")
time.Sleep(5 * time.Second)
fmt.Println("second")
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Download files by chunks in multiple threads in Go - multithreading

You can check the implementation in aws-go-sdk. https://github.com/aws/aws-sdk-go-v2/blob/main/feature/s3/manager/download.go create n concurrent go routine download the chunks in the n go routine use writer.WriteAt() to seek and write to certain position.

Related

Troubles with gitlab scraping via golang

Concurrent Read/Close in Go, in a crossplatform way

Golang WaitGroup causing memory leak, what can I do to improve this function

channels and memory leaks

What is the nodejs setTimeout equivalent in Golang?

Categories

Resources