Go memory leak when doing concurrent os/exec.Command.Wait() - linux

I am running into a situation where a go program is taking up 15gig of virtual memory and continues to grow. The problem only happens on our CentOS server. On my OSX devel machine, I can't reproduce it.
Have I discovered a bug in go, or am I doing something incorrectly?
I have boiled the problem down to a simple demo, which I'll describe now. First build and run this go server:
package main
import (
"net/http"
"os/exec"
)
func main() {
http.HandleFunc("/startapp", startAppHandler)
http.ListenAndServe(":8081", nil)
}
func startCmd() {
cmd := exec.Command("/tmp/sleepscript.sh")
cmd.Start()
cmd.Wait()
}
func startAppHandler(w http.ResponseWriter, r *http.Request) {
startCmd()
w.Write([]byte("Done"))
}
Make a file named /tmp/sleepscript.sh and chmod it to 755
#!/bin/bash
sleep 5
And then make several concurrent requests to /startapp. In a bash shell, you can do it this way:
for i in {1..300}; do (curl http://localhost:8081/startapp &); done
The VIRT memory should now be several gigabytes. If you re-run the above for loop, the VIRT memory will continue to grow by gigabytes every time.
Update 1: The problem is that I am hitting OOM issues on CentOS. (thanks #nos)
Update 2: Worked around the problem by using daemonize and syncing the calls to Cmd.Run(). Thanks #JimB for confirming that .Wait() running in it's own thread is part of the POSIX api and there isn't a way to avoid calling .Wait() without leaking resources.

Each request you make requires Go to spawn a new OS thread to Wait on the child process. Each thread will consume a 2MB stack, and a much larger chunk of VIRT memory (that's less relevant, since it's virtual, but you may still be hitting a ulimit setting). Threads are reused by the Go runtime, but they are currently never destroyed, since most programs that use a large number of threads will do so again.
If you make 300 simultaneous requests, and wait for them to complete before making any others, memory should stabilize. However if you continue to send more requests before the others have completed, you will exhaust some system resource: either memory, file descriptors, or threads.
The key point is that spawning a child process and calling wait isn't free, and if this were a real-world use case you need to limit the number of times startCmd() can be called concurrently.

Related

Will Go's scheduler yield control from one goroutine to another for CPU-intensive work?

The accepted answer at golang methods that will yield goroutines explains that Go's scheduler will yield control from one goroutine to another when a syscall is encountered. I understand that this means if you have multiple goroutines running, and one begins to wait for something like an HTTP response, the scheduler can use this as a hint to yield control from that goroutine to another.
But what about situations where there are no syscalls involved? What if, for example, you had as many goroutines running as logical CPU cores/threads available, and each were in the middle of a CPU-intensive calculation that involved no syscalls. In theory, this would saturate the CPU's ability to do work. Would the Go scheduler still be able to detect an opportunity to yield control from one of these goroutines to another, that perhaps wouldn't take as long to run, and then return control back to one of these goroutines performing the long CPU-intensive calculation?
There are few if any promises here.
The Go 1.14 release notes says this in the Runtime section:
Goroutines are now asynchronously preemptible. As a result, loops without function calls no longer potentially deadlock the scheduler or significantly delay garbage collection. This is supported on all platforms except windows/arm, darwin/arm, js/wasm, and plan9/*.
A consequence of the implementation of preemption is that on Unix systems, including Linux and macOS systems, programs built with Go 1.14 will receive more signals than programs built with earlier releases. This means that programs that use packages like syscall or golang.org/x/sys/unix will see more slow system calls fail with EINTR errors. ...
I quoted part of the third paragraph here because this gives us a big clue as to how this asynchronous preemption works: the runtime system has the OS deliver some OS signal (SIGALRM, SIGVTALRM, etc.) on some sort of schedule (real or virtual time). This allows the Go runtime to implement the same kind of schedulers that real OSes implement with real (hardware) or virtual (virtualized hardware) timers. As with OS schedulers, it's up to the runtime to decide what to do with the clock ticks: perhaps just run the GC code, for instance.
We also see a list of platforms that don't do it. So we probably should not assume it will happen at all.
Fortunately, the runtime source is actually available: we can go look to see what does happen, should any given platform implement it. This shows that in runtime/signal_unix.go:
// We use SIGURG because it meets all of these criteria, is extremely
// unlikely to be used by an application for its "real" meaning (both
// because out-of-band data is basically unused and because SIGURG
// doesn't report which socket has the condition, making it pretty
// useless), and even if it is, the application has to be ready for
// spurious SIGURG. SIGIO wouldn't be a bad choice either, but is more
// likely to be used for real.
const sigPreempt = _SIGURG
and:
// doSigPreempt handles a preemption signal on gp.
func doSigPreempt(gp *g, ctxt *sigctxt) {
// Check if this G wants to be preempted and is safe to
// preempt.
if wantAsyncPreempt(gp) && isAsyncSafePoint(gp, ctxt.sigpc(), ctxt.sigsp(), ctxt.siglr()) {
// Inject a call to asyncPreempt.
ctxt.pushCall(funcPC(asyncPreempt))
}
// Acknowledge the preemption.
atomic.Xadd(&gp.m.preemptGen, 1)
atomic.Store(&gp.m.signalPending, 0)
}
The actual asyncPreempt function is in assembly, but it just does some assembly-only trickery to save user registers, and then calls asyncPreempt2 which is in runtime/preempt.go:
//go:nosplit
func asyncPreempt2() {
gp := getg()
gp.asyncSafePoint = true
if gp.preemptStop {
mcall(preemptPark)
} else {
mcall(gopreempt_m)
}
gp.asyncSafePoint = false
}
Compare this to runtime/proc.go's Gosched function (documented as the way to voluntarily yield):
//go:nosplit
// Gosched yields the processor, allowing other goroutines to run. It does not
// suspend the current goroutine, so execution resumes automatically.
func Gosched() {
checkTimeouts()
mcall(gosched_m)
}
We see the main differences include some "async safe point" stuff and that we arrange for an M-stack-call to gopreempt_m instead of gosched_m. So, apart from the safety check stuff and a different trace call (not shown here) the involuntary preemption is almost exactly the same as voluntary preemption.
To find this, we had to dig rather deep into the (Go 1.14, in this case) implementation. One might not want to depend too much on this.
A little bit more on this to complete #torek's answer.
Goroutines are interruptible when there is a syscall, but also when a routine is waiting on a lock, a chan or sleeping.
As #torek's said, since 1.14 routines can also be preempted even when they do none of the above. The scheduler can mark any routine as preemptible after it ran for more than 10ms.
More reading there: https://medium.com/a-journey-with-go/go-goroutine-and-preemption-d6bc2aa2f4b7

Golang daemon with goroutines won't stop executing

I've created a daemon that has the goal to consume queues in parallel. To test if it keeps executing in the background I've implemented a function that creates a file every 10 seconds until it reaches X, where X is the highest number of processes that I configured for the queues. The parameters for the queues are defined on the config.yaml file.
The problem now is that even though I stop and remove the daemon, it seems that the program keeps running and creating files... I've tried building and running the program again, exiting it, ending the processes, deleting the files, but nothing seems to work, files keep being created on the program directory.
You can check the program code here, and the config file here.
Do you have any idea how I can solve this problem?
Thanks in advance!
This code will never exit until it does processing for len(queues) times. It is not a concurrent code - all in main body - and there is no signal to tell the code to stop. The problem is here:
case "run":
// Installing the service
installed, err := service.Install()
logError(err, installed)
// Starting the service
started, err := service.Start()
logError(err, started)
if err == nil {
// Creating a goroutine and executing the queue's processes in parallel
for i := 0; i < len(queues); i++ {
go startProcessing(queues[i])
time.Sleep(time.Second) // Waiting for other functions to execute
}
select {} // To prevent the goroutine from exiting the main func
}
fmt.Println(started)
As it can be seen the select{} line will sit there and run for ever! :) It is better to move this case clauses, into their's own goroutines and have a quit signal there like this:
select {
case <-quit:
return
}
Though this is not the cleanest way to handle start/stop in Go apps; it just shows the problem.
When asking questions like this you should consider putting together a MCVE.
That way, since the problem size is a lot smaller, you might figure out the problem on your own.
If not, at least the people here will have an easier time helping you.

Goroutines are cooperatively scheduled. Does that mean that goroutines that don't yield execution will cause goroutines to run one by one?

From: http://blog.nindalf.com/how-goroutines-work/
As the goroutines are scheduled cooperatively, a goroutine that loops continuously can starve other goroutines on the same thread.
Goroutines are cheap and do not cause the thread on which they are multiplexed to block if they are blocked on
network input
sleeping
channel operations or
blocking on primitives in the sync package.
So given the above, say that you have some code like this that does nothing but loop a random number of times and print the sum:
func sum(x int) {
sum := 0
for i := 0; i < x; i++ {
sum += i
}
fmt.Println(sum)
}
if you use goroutines like
go sum(100)
go sum(200)
go sum(300)
go sum(400)
will the goroutines run one by one if you only have one thread?
A compilation and tidying of all of creker's comments.
Preemptive means that kernel (runtime) allows threads to run for a specific amount of time and then yields execution to other threads without them doing or knowing anything. In OS kernels that's usually implemented using hardware interrupts. Process can't block entire OS. In cooperative multitasking thread have to explicitly yield execution to others. If it doesn't it could block whole process or even whole machine. That's how Go does it. It has some very specific points where goroutine can yield execution. But if goroutine just executes for {} then it will lock entire process.
However, the quote doesn't mention recent changes in the runtime. fmt.Println(sum) could cause other goroutines to be scheduled as newer runtimes will call scheduler on function calls.
If you don't have any function calls, just some math, then yes, goroutine will lock the thread until it exits or hits something that could yield execution to others. That's why for {} doesn't work in Go. Even worse, it will still lead to process hanging even if GOMAXPROCS > 1 because of how GC works, but in any case you shouldn't depend on that. It's good to understand that stuff but don't count on it. There is even a proposal to insert scheduler calls in loops like yours
The main thing that Go's runtime does is it gives its best to allow everyone to execute and don't starve anyone. How it does that is not specified in the language specification and might change in the future. If the proposal about loops will be implemented then even without function calls switching could occur. At the moment the only thing you should remember is that in some circumstances function calls could cause goroutine to yield execution.
To explain the switching in Akavall's answer, when fmt.Printf is called, the first thing it does is checks whether it needs to grow the stack and calls the scheduler. It MIGHT switch to another goroutine. Whether it will switch depends on the state of other goroutines and exact implementation of the scheduler. Like any scheduler, it probably checks whether there're starving goroutines that should be executed instead. With many iterations function call has greater chance to make a switch because others are starving longer. With few iterations goroutine finishes before starvation happens.
For what its worth it. I can produce a simple example where it is clear that the goroutines are not ran one by one:
package main
import (
"fmt"
"runtime"
)
func sum_up(name string, count_to int, print_every int, done chan bool) {
my_sum := 0
for i := 0; i < count_to; i++ {
if i % print_every == 0 {
fmt.Printf("%s working on: %d\n", name, i)
}
my_sum += 1
}
fmt.Printf("%s: %d\n", name, my_sum)
done <- true
}
func main() {
runtime.GOMAXPROCS(1)
done := make(chan bool)
const COUNT_TO = 10000000
const PRINT_EVERY = 1000000
go sum_up("Amy", COUNT_TO, PRINT_EVERY, done)
go sum_up("Brian", COUNT_TO, PRINT_EVERY, done)
<- done
<- done
}
Result:
....
Amy working on: 7000000
Brian working on: 8000000
Amy working on: 8000000
Amy working on: 9000000
Brian working on: 9000000
Brian: 10000000
Amy: 10000000
Also if I add a function that just does a forever loop, that will block the entire process.
func dumb() {
for {
}
}
This blocks at some random point:
go dumb()
go sum_up("Amy", COUNT_TO, PRINT_EVERY, done)
go sum_up("Brian", COUNT_TO, PRINT_EVERY, done)
Well, let's say runtime.GOMAXPROCS is 1. The goroutines run concurrently one at a time. Go's scheduler just gives the upper hand to one of the spawned goroutines for a certain time, then to another, etc until all are finished.
So, you never know which goroutine is running at a given time, that's why you need to synchronize your variables. From your example, it's unlikely that sum(100) will run fully, then sum(200) will run fully, etc
The most probable is that one goroutine will do some iterations, then another will do some, then another again etc.
So, the overall is that they are not sequential, even if there is only one goroutine active at a time (GOMAXPROCS=1).
So, what's the advantage of using goroutines ? Plenty. It means that you can just do an operation in a goroutine because it is not crucial and continue the main program. Imagine an HTTP webserver. Treating each request in a goroutine is convenient because you do not have to care about queueing them and run them sequentially: you let Go's scheduler do the job.
Plus, sometimes goroutines are inactive, because you called time.Sleep, or they are waiting for an event, like receiving something for a channel. Go can see this and just executes other goroutines while some are in those idle states.
I know there are a handful of advantages I didn't present, but I don't know concurrency that much to tell you about them.
EDIT:
Related to your example code, if you add each iteration at the end of a channel, run that on one processor and print the content of the channel, you'll see that there is no context switching between goroutines: Each one runs sequentially after another one is done.
However, it is not a general rule and is not specified in the language. So, you should not rely on these results for drawing general conclusions.
#Akavall Try adding sleep after creating dumb goroutine, goruntime never executes sum_up goroutines.
From that it looks like go runtime spawns next go routines immediately, it might execute sum_up goroutine until go runtime schedules dumb() goroutine to run. Once dumb() is scheduled to run then go runtime won't schedule sum_up goroutines to run, as dumb runs for{}

Do we need a sleep() while running a forever process in Linux?

I have read that a forever process like daemon should run with a sleep() in their while(1) or for(;;) loop. They say, it is required because otherwise this process will always be in a run queue and the kernel will always run it. This will block the other process. I don't agree that it will block the other process completely. If there is a time slicing, then it will execute other process. But, certainly it will steal a time from others. Making a delay for other process since this process is always in the run state. By default, the Linux runs as a round-robin. The first task is swapd, then other tasks . This is a circular link list with first task as swapd(process-id is 0) and then other tasks. I believe this is still based as time sliced. A particular time for each process. These tasks are nothing but the process-descriptor. I believe this link list is maintained by the init process. Please do correct me here If I am wrong. Other question is if we need to give a sleep() then what should be its value? How can we determine the sleep value to get the best results?
If your program has useful things to do, don't throttle it. A program can move out of the run queue by doing blocking stuff like IO and waiting.
If you are writing a polling loop that can spin an arbitrary number of times you probably want to throttle it a bit with sleep because spinning too often has little value.
That said, polling loops are a means of last resort. Normally, programs perform useful work with every instruction, so they don't sleep at all.
Sleep is almost certainly the wrong solution.
Usually what you do it call a blocking function which wakes you up when there's something for you to do.
For example, if you're a network service you'd want to remain inactive until a request arrives.
In other words, the core of your daemon should not look like this:
while(1)
{
if (checkIfSomethingToDo())
doSomething();
else
sleep(1);
}
but rather a little like this:
while(1)
{
int ret = poll(fds, nfds, -1);
if (ret > 0)
doSomething();
}
Have the kernel put you to sleep until there's actual work to do. It's not hard to implement, you'd be a lot more efficient (not stealing CPU time from others, only to waste it doing no actual work) and your response latency will go down too.
A sleep forces the os to pass execution to another thread and therefore is helpfull, or at least fair. Start with sleep one. Should be ok.

640 enterprise library caching threads - how?

We have an application that is undergoing performance testing. Today, I decided to take a dump of w3wp & load it in windbg to see what is going on underneath the covers. Imagine my surprise when I ran !threads and saw that there are 640 background threads, almost all of which seem to say the following:
OS Thread Id: 0x1c38 (651)
Child-SP RetAddr Call Site
0000000023a9d290 000007ff002320e2 Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.WaitUntilInterrupted()
0000000023a9d2d0 000007ff00231f7e Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.Dequeue()
0000000023a9d330 000007fef727c978 Microsoft.Practices.EnterpriseLibrary.Caching.BackgroundScheduler.QueueReader()
0000000023a9d380 000007fef9001552 System.Threading.ExecutionContext.runTryCode(System.Object)
0000000023a9dc30 000007fef72f95fd System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0000000023a9dc80 000007fef9001552 System.Threading.ThreadHelper.ThreadStart()
If i had to give a guess, I'm thinkign that one of these threads are getting spawned for each run of our app - we have 2 app servers, 20 concurrent users, and ran the test approximately 30 times...it's in the neighborhood.
Is this 'expected behavior', or perhaps have we implemented something improperly? The test ran hours ago, so i would have expected any timeouts to have occurred already.
Edit: Thank you all for your replies. It has been requested that more detail be shown about the callstack - here is the output of !mk from sosex.dll.
ESP RetAddr
00:U 0000000023a9cb38 00000000775f72ca ntdll!ZwWaitForMultipleObjects+0xa
01:U 0000000023a9cb40 00000000773cbc03 kernel32!WaitForMultipleObjectsEx+0x10b
02:U 0000000023a9cc50 000007fef8f5f595 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0xc1
03:U 0000000023a9ccf0 000007fef8f59f49 mscorwks!Thread::DoAppropriateAptStateWait+0x41
04:U 0000000023a9cd50 000007fef8e55b99 mscorwks!Thread::DoAppropriateWaitWorker+0x191
05:U 0000000023a9ce50 000007fef8e2efe8 mscorwks!Thread::DoAppropriateWait+0x5c
06:U 0000000023a9cec0 000007fef8f0dc7a mscorwks!CLREvent::WaitEx+0xbe
07:U 0000000023a9cf70 000007fef8fba72e mscorwks!Thread::Block+0x1e
08:U 0000000023a9cfa0 000007fef8e1996d mscorwks!SyncBlock::Wait+0x195
09:U 0000000023a9d0c0 000007fef9463d3f mscorwks!ObjectNative::WaitTimeout+0x12f
0a:M 0000000023a9d290 000007ff002321b3 *** ERROR: Module load completed but symbols could not be loaded for Microsoft.Practices.EnterpriseLibrary.Caching.DLL
Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.WaitUntilInterrupted()(+0x0 IL)(+0x11 Native)
0b:M 0000000023a9d2d0 000007ff002320e2 Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.Dequeue()(+0xf IL)(+0x18 Native)
0c:M 0000000023a9d330 000007ff00231f7e Microsoft.Practices.EnterpriseLibrary.Caching.BackgroundScheduler.QueueReader()(+0x9 IL)(+0x12 Native)
0d:M 0000000023a9d380 000007fef727c978 System.Threading.ExecutionContext.runTryCode(System.Object)(+0x18 IL)(+0x106 Native)
0e:U 0000000023a9d440 000007fef9001552 mscorwks!CallDescrWorker+0x82
0f:U 0000000023a9d490 000007fef8e9e5e3 mscorwks!CallDescrWorkerWithHandler+0xd3
10:U 0000000023a9d530 000007fef8eac83f mscorwks!MethodDesc::CallDescr+0x24f
11:U 0000000023a9d790 000007fef8f0cbd2 mscorwks!ExecuteCodeWithGuaranteedCleanupHelper+0x12a
12:U 0000000023a9da20 000007fef945e572 mscorwks!ReflectionInvocation::ExecuteCodeWithGuaranteedCleanup+0x172
13:M 0000000023a9dc30 000007fef7261722 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)(+0x60 IL)(+0x51 Native)
14:M 0000000023a9dc80 000007fef72f95fd System.Threading.ThreadHelper.ThreadStart()(+0x8 IL)(+0x2a Native)
15:U 0000000023a9dcd0 000007fef9001552 mscorwks!CallDescrWorker+0x82
16:U 0000000023a9dd20 000007fef8e9e5e3 mscorwks!CallDescrWorkerWithHandler+0xd3
17:U 0000000023a9ddc0 000007fef8eac83f mscorwks!MethodDesc::CallDescr+0x24f
18:U 0000000023a9e010 000007fef8f9ae8d mscorwks!ThreadNative::KickOffThread_Worker+0x191
19:U 0000000023a9e330 000007fef8f59374 mscorwks!TypeHandle::GetParent+0x5c
1a:U 0000000023a9e380 000007fef8e52045 mscorwks!SVR::gc_heap::make_heap_segment+0x155
1b:U 0000000023a9e450 000007fef8f66139 mscorwks!ZapStubPrecode::GetType+0x39
1c:U 0000000023a9e490 000007fef8e1c985 mscorwks!ILCodeStream::GetToken+0x25
1d:U 0000000023a9e4c0 000007fef8f594e1 mscorwks!Thread::DoADCallBack+0x145
1e:U 0000000023a9e630 000007fef8f59399 mscorwks!TypeHandle::GetParent+0x81
1f:U 0000000023a9e680 000007fef8e52045 mscorwks!SVR::gc_heap::make_heap_segment+0x155
20:U 0000000023a9e750 000007fef8f66139 mscorwks!ZapStubPrecode::GetType+0x39
21:U 0000000023a9e790 000007fef8e20e15 mscorwks!ThreadNative::KickOffThread+0x401
22:U 0000000023a9e7f0 000007fef8e20ae7 mscorwks!ThreadNative::KickOffThread+0xd3
23:U 0000000023a9e8d0 000007fef8f814fc mscorwks!Thread::intermediateThreadProc+0x78
24:U 0000000023a9f7a0 00000000773cbe3d kernel32!BaseThreadInitThunk+0xd
25:U 0000000023a9f7d0 00000000775d6a51 ntdll!RtlUserThreadStart+0x1d
Yes, the caching block has some - issues - with regard to the scavenger threads in older versions of Entlib, particularly if things are coming in faster than the scavenging settings let them come out.
This was completely rewritten in Entlib 5, so that now you'll never have more than two threads sitting in the caching block, regardless of the load, and usually it'll only be one.
Unfortunately there's no easy tweak to change the behavior in earlier versions. The best you can do is change the cache settings so that each scavenge will clean out more items at a time so not as many scavenge requests need to get scheduled.
640 threads is very bad for performance. If they are all waiting for something, then I'd say it's a fair bet that you have a deadlock and they will never exit. If they are all running (not waiting)... well, with 600+ threads on a 2 or 4 core processor none of them will get enough time slices to run very far! ;>
If your app is set up with a main thread that waits on the thread handles to find out when the threads exit, and the background threads get caught up in a loop or in a wait state and never exit the thread proc, then the process and all of its threads will never exit.
Check your thread code to make sure that every threadproc has a clear path to exit the threadproc. It's bad form to write an infinite loop in a background thread on the assumption that the thread will be forcibly terminated when the process shuts down.
If the background thread code spins in a loop waiting for an event handle to signal, make sure that you have some way to signal that event so that the thread can perform a normal orderly exit. Otherwise, you need to write the background thread to wait on multiple events and unblock when any one of the events signals. One of those events can be the activity that the background thread is primarily interested in and the other can be a shutdown event.
From the names of things in the stack dump you posted, it would appear that the thread is waiting for something to appear in the ProducerConsumerQueue. Investigate how that queue object is supposed to be shut down, probably on the producer side, and whether shutting down the queue will automatically release all consumers that are waiting on that queue.
My guess is that either the queue is not being shut down correctly or shutting it down does not implicitly release the consumers that are waiting on it. If the latter case, you may need to pump a terminate message through the queue to wake up all the consumers waiting on that queue and tell them to break out of their wait loop and exit.
You have an major issue. Every Thread occupies 1MB of stack and there is significant cost paid for Context Switching every thread in and out. Especially it becomes worst with managed code because every time GC has to run , it would have walk the threads stack to look for roots and when these threads are paged to the disk the cost to read from the disk is expensive,which adds up Perf issue.
Creating threads are Bad unless you know what you are doing? Jeffery Richter has written in detail about this.
To solve the above issue I would look what these threads are blocked on and also put a break-point on Thread Create (example sxe ct within windbg)
And later rearchitect from avoid creating threads , instead use the thread pool.
It would have been nice to some callstacks of these threads.
In Microsoft Enterprise Library 4.1, the BackgroundScheduler class creates a new thread each time an object is instantiated. It will be fixed in version 5.0. I do not know enough of this Microsoft Library to advise you how to avoid that behavior, but you may try the beta version: http://entlib.codeplex.com/wikipage?title=EntLib5%20Beta2

Resources