I am using go to parallelize 2d convolutions where the convolution (implemented in go) is happening in a c-archive included in a C binary (where the go code is called). No calls are made from the go code to any c function
Before spawning goroutines, all matrixes are loaded into memory by the c code and all goroutines access it through the shared memory.
I use the GOMAXPROCS-1 to decide how many go routines to spawn and each routine is assigned a ID. The goroutines are assigned rows of the matrix based on their ID in a striped fashion. The go routines are locked to a OS thread when spawned and release the thread once finished.
e.g.
if GOMAXPROCS is set to 4, goroutine 0 takes row 0, 4, 8, 12 etc and goroutine 1 takes row 1, 5, 9, 13 and so on.
My issue is that when GOMAXPROCS is set to 4, go spawns 11 OS threads
htop and atop:
My understanding is that these OS threads are spawned because the scheduler is trying to make sure that there are always threads available that are not blocked.
There is no I/O or system calls happening after the goroutines have been spawned so I don't understand why the scheduler is creating all these processes or what is blocking the threads.
The number of threads being spawned is slowing down the execution when executing with GOMAXPROCS >=20 on a machine with 40 cores
Why is the scheduler spawning all these threads?
How can I debug where/how the routines are being blocked?
Source code
GOMAXPROCS limits the number of threads running
Go code, but cgo calls do not count as Go code, so
you can still see multiple threads with GOMAXPROCS=1.
https://groups.google.com/g/golang-nuts/c/8Tdb-aCeYFY?pli=1
Related
I have a computer with 1 cpu, 4 cores and 2 threads per core that can run. So I have effiency with maximum 8 running threads.
When I write a program in C and create threads using pthred_create function, how many threads is recommended to be created: 7 or 8? Do I have to substract the main thread, thus create 7, or main thread should not be counted and I can effiently crete 8? I know that in theory you can create much more, like thousands, but I want to be effiently planned, according with my computer architecture.
Which thread started which is not much relevant. A program's initial thread is a thread: while it is scheduled on an execution unit, no other thread can use that execution unit. You cannot have more threads executing concurrently than you have execution units, and if you have more than that eligible to run at any given time then you will pay the cost of extra context switches without receiving any offsetting gain from additional concurrency.
To a first approximation, then, yes, you must count the initial thread. But read the above carefully. The relevant metric is not how many threads exist at any given time, but rather how many are contending for execution resources. Threads that are currently blocked (on I/O, on acquiring a mutex, on pthread_join(), etc.) do not contend for execution resources.
More precisely, then, it depends on your threads' behavior. For example, if the initial thread follows the pattern of launching a bunch of other threads and then joining them all without itself performing any other work, then no, you do not count that thread, because it does not contend for CPU to any significant degree while the other threads are doing so.
If coroutines is still using threads to run code in parallel why it is considered light weight?
In my understanding kotlin's suspend functions is transformed by compiler into state machine where each branch can be run on the same or different thread defined by developer. Coroutine builder, e,g, launch{}, responsible for that and CoroutineContext is what defines a thread to run on.
Parallelism achieved by sending block of code to the thread pool which leverage the same threads
There was a benchmark on 100k coroutines and 100k threads where coroutines pass without issue and threads throw exception (likely OutOfMemory). It brings me to idea I am missing something here.
Could you help me to understand what is missed here what makes coroutines run code block 100k in parallel without exceeding memory limits like threads do?
Pointing from the article
Every Thread has its own stack, typically 1MB. 64k is the least amount of stack space allowed per thread in the JVM while a simple coroutine in Kotlin occupies only a few dozen bytes of heap memory.
The Coroutine Dispatcher has a limit that only a certain amount of threads can be created.
Such as Dispatchers.IO has limit of 64 threads, Dispatchers.Default has limit of number of cores on your processor (2, 4, 6, 8, etc.) Dispatchers.Unconfined cannot create new thread it runs on threads previously created by other dispatchers, here's proof: 500 operations having 10ms of sleep takes approx 5s (single-thread because it can't spawn a new) try it yourself.
Coroutines stick to a thread, and as soon as suspension point is reached, it leaves the Thread and frees it up letting it to pick up another coroutine if it is waiting. This way with less threads and less memory usage, that much concurrent work can be done.
The coroutines are managed to be suspended and resumed by a callback like object Continuation which is added as the last parameter to the function marked with suspend keyword at the time of compilation which lives in heap as other objects do and is responsible for the resume of coroutine, so Thousands of MBs space is not required in RAM to keep all the Threads alive. A typical 60-70 threads are created at max using CommonPool and are reused (if new coroutine is created it waits till another finishes).
The main saving comes from the fact that a single thread can run any number of coroutines, by way of cooperative multitasking. When you launch 100,000 coroutines, they run on as many threads as there are CPU cores, but when you start 100,000 threads, the JVM creates that many native threads. Note that the level of parallelism is equal in both cases, and limited to the number of CPU cores.
The only thing that changes is scheduling: in the classical case, the OS suspends and resumes threads, assigning them to CPU cores. With coroutines, the coroutines suspend themselves (this is their cooperative aspect) and the Dispatcher resumes them later on, running other coroutines in the meantime.
Lightweight: You can run many coroutines on a single thread due to support for suspension, which doesn't block the thread where the coroutine is running. Suspending saves memory over blocking while supporting many concurrent operations.
Fewer memory leaks: Use structured concurrency to run operations within a scope.
I have written a small application on go, which starts 4 threads for doing various things + one main thread. So in total there are 5 threads. But if I'll start activity monitor and monitor the process, this is what I see
First of all why 7 threads. And it is not constant. Sometimes it is 5 and other times it is 7. Also all 4 threads started by main thread ends after doing hat they are suppose to. I verify that threads end by putting a differ statement on the top of thread. Still thread count in Activity monitor stays 7.
Does anyone knows what is going on over here? Are these extra threads started by go runtime? Is there a way to find out how many threads are active my program that are started by my code and not by go runtime.
Yes they are started by the runtime, for example http://play.golang.org/p/c0cIngo_sO it will print 4 goroutines are running.
Goroutines aren't threads, 1 OS thread can handle 100s of goroutines, however if you're doing something heavy or using a blocking system call, the runtime will start a new thread to handle the other goroutines.
I suppose you mean Goroutines when you say threads.
The Go runtime transparently multiplexes lightweight Goroutines onto OS threads. That's also why you don't need to call functions like select()—that's the runtime's job.
If you spawn 7 Go routines and some of them block, the runtime might decide to terminate the idle OS threads. This is why you see less threads than Go routines.
I think you mistake Goroutines for thread.
In your go program, the thread you mean is actually goroutine ,which is a coroutine and is not a real thread , which is implemented by go's runtime(you need to know about go runtime, every go program is running on a runtime, and runtime actually use thread to implement goroutines).Diffrent goroutine may be running in the same thread, or may be not ,but you never know . You can use runtime.GOMAXPROCS for multi-core cpu .
And the threads you see in the monitor are real threads .
I know linux scheduler will schedule the task_struct which is a thread. Then if we have two processes, e.g., A contains 100 threads while B is single thread, how can the two processes be scheduled fairly, considering if each thread would be scheduled fairly?
In addition, so in Linux, context switch between threads from the same process would be faster than that between threads from different processes, right? Since the latter will have something to do with process control block while the former wouldn't.
The point you are missing here is, how scheduler looks at threads or tasks. Well, the Linux kernel scheduler will treat them as individual scheduling entity, therefore will be counted and scheduled differently.
Now let's see what CFS documentation says - it has a simplistic approach of giving out even slice of CPU time to each runnable process, therefore, if there are 4 runnable process/threads they'll get 25% of cpu time each. But on real hardware it's not possible and to fix the issue vruntime was introduced (take more on this from here
Now come back to your example, if process A creates 100 threads and B creates 1 thread then the # of running processes or threads becomes 103 (assuming all are runnable state) then CFS will evenly share the cpu using formula 1/103 (cpu/number of running tasks). And the context switching is same for all the scheduling entities, threads only shares task's internal mm_struct and when they run they have their own sets of registers, task status to load up to start with. Hope this will help to understand better.
I have two C++ codes one called a and one called b. I am running in in a 64 bits Linux, using the Boost threading library.
The a code creates 5 threads which stay in a non-ending loop doing some operation.
The b code creates 5 threads which stay in a non-ending loop invoking yield().
I am on a quadcore machine... When a invoke the a code alone, it gets almost 400% of the CPU usage. When a invoke the b code alone, it gets almost 400% of the CPU usage. I already expected it.
But when running both together, I was expecting that the b code used almost nothing of CPU and a use the 400%. But actually both are using equals slice of the CPU, almost 200%.
My question is, doesn't yield() works between different process? Is there a way to make it work the way I expected?
So you have 4 cores running 4 threads that belong to A. There are 6 threads in the queue - 1 A and 5 B. One of running A threads exhausts its timeslice and returns to the queue. The scheduler chooses the next runnable thread from the queue. What is the probability that this tread belongs to B? 5/6. Ok, this thread is started, it calls sched_yield() and returns back to the queue. What is the probability that the next thread will be a B thread again? 5/6 again!
Process B gets CPU time again and again, and also forces the kernel to do expensive context switches.
sched_yield is intended for one particular case - when one thread makes another thread runnable (for example, unlocks a mutex). If you want to make B wait while A is working on something important - use some synchronization mechanism that can put B to sleep until A wakes it up
Linux uses dynamic thread priority. The static priority you set with nice is just to limit the dynamic priority.
When a thread use his whole timeslice, the kernel will lower it's priority and when a thread do not use his whole timeslice (by doing IO, calling wait/yield, etc) the kernel will increase it's priority.
So my guess is that process b threads have higher priority, so they execute more often.