How many threads can the Go runtime (scheduler, garbage collector, etc.) use? For example, if GOMAXPROCS is 10, how many of those kernel threads would be used by the runtime?
Edit:
I was reading the rationale for changing GOMAXPROCS to runtime.NumCPU() in Go 1.5. There was a sentence that claimed that “the performance of single-goroutine programs can improve by raising GOMAXPROCS due to parallelism of the runtime, especially the garbage collector.”
My real question is: If I have a single-goroutine program running in a Docker container that has a CPU quota, what is the minimum number of logical processors that I need in order to have maximum performance?
There is no direct correlation. Threads used by your app may be less than, equal to or more than 10.
Quoting from the package doc of runtime:
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit. This package's GOMAXPROCS function queries and changes the limit.
So if your application does not start any new goroutines, threads count will be less than 10.
If your app starts many goroutines (>10) where none is blocking (e.g. in system calls), 10 operating system threads will execute your goroutines simultaneously.
If your app starts many goroutines where many (>10) are blocked in system calls, more than 10 OS threads will be spawned (but only at most 10 will be executing user-level Go code).
See this question for an example and details: Why does it not create many threads when many goroutines are blocked in writing file in golang?
Edit (in response to your edit):
I believe the default value of GOMAXPROCS is the number of logical CPUs for a reason: because in general that provides the highest performance. You may leave it at that. In general if you only have 1 goroutine and you're sure your code doesn't spawn more, GOMAXPROCS=1 is sufficient, but you should test and don't take any word for it.
Related
Imagine that I have two tasks, each of them needs 2 seconds to finish its job.
In this case, if I create two threads for each of them and my PC is single-core, this won't save any time. Am I right ?
What if I use fork to create two processes (the machine is still single-core) and each process takes charge of one task ? Can this save any time ?
If not, I have a question:
In current modern machine (including multi-core), if I have several heavy tasks, which method should I use ?
fork ?
thread ?
fork + thread, meaning that create some processes and
each process contains more than one thread ?
Even with a single core having two threads may speed up execution. If your routine is purely CPU bound then two threads won't improve anything, indeed the performance will be worse because of context switching overhead. But if the routine has to wait for memory, disk or or network (which is usually the case) then two threads will provide performance gains even with a single core.
About fork vs threads, threads require less resources so, in principle, should be the first choice. But there are two caveats: 1) maybe you want to be able to terminate a parallel routine, this is much safer to do with processes than with threads and 2) some languages (notably Python and Ruby) provide pseudo-thread libraries which do not use real threads but switch between routines using the same thread. This simulated threading can be very useful for example when waiting for network requests but it must be taken into account that it's not real multithreading.
Amendment: As commented by Sergio Tulentsev, Ruby and Python do indeed provide real threads and not only coroutines.
"job takes 2 seconds" - If those 2 seconds are fully occupying the CPU (100% load), you won't gain anything with either thread nor fork if you have no cores to share. The single-core CPU is simply busy and you cannnot make it more busy.
In case this 2 seconds include waiting time (for example on I/O, storage, whatever) you could gain something, even with a single core. The amount of gain depends on the CPU working vs. CPU waiting ratio and the overhead of your multiprocessing. Most non-trivial programs have at least some amount of "CPU waiting", so multithreading is often useful even on single-core CPUs.
This overhead for setting up a coroutine and context switching can be considerable and needs to be measured. Obviously, the shorter the run time of your actiual task is, the larger will be the ratio of overhead (for setting up a thread or process, etc.) and the smaller will be you multi-processing gain.
Traditionally, threads used to have considerably less overhead than processes (after all, that was why they were invented), but the "considerably" has maybe vanished over time - On modern Linux systems, processes are only a tad slower to set up than threads (actually, both use the same system calls). You rather decide between thread or process based on the requirements related to amount of protection (or sharing) of data than execution speed.
I'm tripping up on the multithreading concept.
For example, my processor has 2 cores (and with hyper threading) 2 threads per core totaling to 4 thread. So does this mean my CPU can execute four separate instructions simultaneously? Is each thread capable of being a multi-thread?
So does this mean my CPU can execute four separate instructions simultaneously? Is each thread capable of being a multi-thread?
In short to both, yes.
A CPU can only execute 1 single instruction per phase in a clock cycle, due to certain factors like pipelining, a CPU might be able to pass multiple instructions through the different phases in a single clock cycle, and the frequency of the clock might be extremely fast, but it's still only 1 instruction at a time.
As an example, NOP is an x86 assembly instruction which the CPU interprets as "no operation this cycle" that's 1 instruction out of the hundreds or thousands (and more) that are executed from something even as simple as:
int main(void)
{
while (1) { /* eat CPU */ }
return 0;
}
A CPU thread of execution is one in which a series of instructions (a thread of instructions) are being executed, it does not matter from what "application" the instructions are coming from, a CPU does not know about high level concepts (like applications), that's a function of the OS.
So if you have a computer with 2 (or 4/8/128/etc.) CPU's that share the same memory (cache/RAM), then you can have 2 (or more) CPU's that can run 2 (or more) instructions at (literally) the exact same time. Keep in mind that these are machine instructions that are running at the same time (i.e. the physical side of the software).
An OS level thread is something a bit different. While the CPU handles the physical side of the execution, the OS handles the logical side. The above code breaks down into more than 1 instruction and when executed, actually gets run on more than 1 CPU (in a multi-CPU aware environment), even though it's a single "thread" (at the OS level), the OS schedules when to run the next instructions and on what CPU (based on the OS's thread scheduling policy, which is different amongst the various OS's). So the above code will eat up 100% CPU usage per a given "time slice" on that CPU it's running on.
This "slicing" of "time" (also known as preemptive computing) is why an OS can run multiple applications "at the same time", it's not literally1 at the same time since a CPU can only handle 1 instruction at a time, but to a human (who can barely comprehend the length of 1 second), it appears "at the same time".
1) except in the case with a multi-CPU setup, then it might be literally the same time.
When an application is run, the kernel (the OS) actually spawns a separate thread (a kernel thread) to run the application on, additionally the application can request to create another external thread (i.e. spawning another process or forking), or by creating an internal thread by calling the OS's (or programming languages) API which actually call lower level kernel routines that spawn and maintain the context switching of the spawned thread, additionally, any created thread is also capable of calling the same API's to spawn other separate threads (thus a thread is capable of being "multi-threaded").
Multi-threading (in the sense of applications and operating systems), is not necessarily portable, so while you might learn Java or C# and use their API's (i.e. Thread.Start or Runnable), utilizing the actual OS API's as provided (i.e. CreateThread or pthread_create and the slew of other concurrency functions) opens a different door for design decisions (i.e. "does platform X support thread library Y"); just something to keep in mind as you explore the different API's.
I hope that can help add some clarity.
I actually researched this very topic in my Operating Systems class.
When using threads, a good rule of thumb for increased performance for CPU bound processes is to use an equal number of threads as cores, except in the case of a hyper-threaded system in which case one should use twice as many cores. The other rule of thumb that can be concluded is for I/O bound processes. This rule is to quadruple the number threads per cores, except for the case of a hyper-threaded system than one can quadruple the number of threads per core.
I have a question about scheduling threads. on the one hand, I learned that threads are scheduled and treated as processes in Linux, meaning they get scheduled like any other process using the conventional methods. (for example, the Completely Fair Scheduler in linux)
On the other hand, I also know that the CPU might also switch between threads using methods like Switch on Event or Fine-grain. For example, on cache miss event the CPU switches a thread. but what if the scheduler doesn't want to switch the thread? how do they agree on one action?
I'm really confused between the two: who schedules a thread? the OS or the CPU?
thanks alot :)
The answer is both.
What happens is really fairly simple: on a CPU that supports multiple threads per core (e.g., an Intel with Hyperthreading) the CPU appears to the OS as having some number of virtual cores. For example, an Intel i7 has 4 actual cores, but looks to the OS like 8 cores.
The OS schedules 8 threads onto those 8 (virtual) cores. When it's time to do a task switch, the OS's scheduler looks through the threads and finds the 8 that are...the most eligible to run (taking into account things like thread priority, time since they last ran, etc.)
The CPU has only 4 real cores, but those cores support executing multiple instructions simultaneously (and out of order, in the absence of dependencies). Incoming instructions get decoded and thrown into a "pool". Each clock cycle, the CPU tries to find some instructions in that pool that don't depend on the results of some previous instruction.
With multiple threads per core, what happens is basically that each actual core has two input streams to put into the "pool" of instructions it might be able to execute in a given clock cycle. Each clock cycle it still looks for instructions from that pool that don't depend on the results of previous instructions that haven't finished executing yet. If it finds some, it puts them into the execution units and executes them. The only major change is that each instruction now needs some sort of tag attached to designate which "virtual core" will be used to store results into--that is, each of the two threads has (for example) its own set of registers, and instructions from each thread have to write to the registers for that virtual core.
It is possible, however, for a CPU to support some degree of thread priority so that (for example) if the pool of available instructions includes some instructions from both input threads (or all N input threads, if there are more than two) it will prefer to choose instructions from one thread over instructions from another thread in any given clock cycle. This can be absolute, so it runs thread A as fast as possible, and thread B only with cycles A can't use, or it can be a "milder" preference, such as attempting to maintain a 2:1 ratio of instructions executed (or, of course, essentially any other ratio preferred).
Of course, there are other ways of setting up priorities as well (such as partitioning execution resources), but the general idea remains the same.
An OS that's aware of shared cores like this can also modify its scheduling to suit, such as scheduling only one thread on a pair of cores if that thread has higher priority.
The OS handles scheduling and dispatching of ready threads, (those that require CPU), onto cores, managing CPU execution in a similar fashion as it manages other resources. A cache-miss is no reason to swap out a thread. A page-fault, where the desired page is not loaded into RAM at all, may cause a thread to be blocked until the page gets loaded from disk. The memory-management hardware does that by generating a hardware interrupt to an OS driver that handles the page-fault.
Is there a way / program to find out the maximum no of threads a system can spawn ? i am creating a application and i am in a dilemma whether to go with event looping model or multi threaded model . so wanted to test the systems capabilities on how many threads it can handle ?
The "maximum number of threads" is not as useful a metric as you might think:
There is usually a system-wide maximum number of threads imposed by either the operating system or the available hardware resources.
The per-process maximum number of threads is often configurable and can even change on-the-fly.
In most cases the actual restriction comes from your hardware resources - rather than any imposed limit. Much like any other resource (e.g. memory) you have to check if you were successfull, rather than rely on some kind of limit.
In general, multi-threading has only two advantages when compared to event loops:
It can utilise more than one processor. Depending on the operating system you can also use multiple processes (rather than the more lightweight threads) to do that.
Depending on the operating system, it may offer some degree of privlilege separation.
Other than that multi-threading is usually more expensive in both memory and processing resources. A large number of threads can bring your system to a halt regardless if what they are doing is resource-intensive or not.
In most cases the best solution is a hybrid of both models i.e. a number of threads with an event loop in each one.
EDIT:
On modern Linux systems, the /proc/sys/kernel/threads-max file provides a system-wide limit for the number of threads. The root user can change that value if they wish to:
echo 100000 > /proc/sys/kernel/threads-max
As far as I know, the kernel does not specifically impose a per-process limit on the number of threads.
sysconf() can be used to query system limits. There are some semi-documented thread-related query variables defined in /usr/include/bits/confname.h (the _SC_THREAD* variables).
getrlimit() can be used to query per-session limits - in this case the RLIMIT_NPROC resource is related to threads.
The threading implementation in glibc may also impose its own limits on a per-process basis.
Keep in mind that, depending on your hardware and software configuration, none of these limits may be of use. On Linux a main limiting factor on the number of threads comes from the fact that each thread requires memory in the stack - if you start launching threads you can easily come upon this limit before any others.
If you really want to find the actual limit, then the only way is to start launching threads until you cannot do so any more. Even that will only give you a rough limit that is only valid at the time you run the program. It can easily change if e.g. your threads start doing actual work and increase their resource usage.
In my opinion if you are launching more than 3-4 threads per processor you should reconsider your design.
On Linux, you can find the total number of threads running by checking /proc/loadavg
# cat /proc/loadavg
0.02 0.03 0.04 1/389 7017
In the above, 389 is the total number of threads.
it can handel as many threads as provided by OS and you never know the limit.
But as a general measures if a normal LOB application having more than 25 threads at a time can lead to problems and have serious design issue.
The current upper global limit is 4 million threads, because once reached, you have run out of PIDs (see futex.h).
Is there any formula, maybe involving RAM & number of CPUs, which can give me a rough idea of how many threads I can spawn before it starts to be inefficient and slows the PC?
I want to load test another machine, so want to send requests as quickly as pobbile. But there's no point of spawning a million threads if they will just get in each other's way.
Edit: The threads are making Remote Procedure Calls (SOAP), so will be blocking waiting for the call to return.
It depends on what the threads are doing. If they're doing calculations, then the number will be lower. If they're waiting on I/O, then you can have more.
However, if they're waiting on I/O then you may be able to achieve the same result using async I/O apis better than using multiple threads.
If all threads are active and not blocking waiting for something then basically one thread per CPU (core really). Any more than that and you're relying on the operating system to context switch between the threads on a given CPU.
But it all depends on what the threads are doing. If they're sleeping most of the time or waiting on asynchronous IO operations, then you mostly just need to worry about the memory used for the stack which defaults to about 1MB per thread I believe.
The other answers are of course correct; "it depends". If the threads are busy doing CPU-intensive work, there's no point having more than the number of cores available. But assuming they are waiting on external results, it can vary widely.
I often find that this question is answered by the architecture and requirements of an application; you need as many threads as you need.
But if you potentially have an unlimited number of threads you might end up spawning, I think that probably sounds like a task for the ThreadPool myself; let it decide how many threads to actually have running.
First of all starting a thread may be quite a slow operation itself. When you start a thread stack space must be allocated, entry points in DLLs may be called etc. If you have a lot more threads than available cores then the majority of your threads will not be running at any given moment. I.e. they use resources and contribute nothing.
It is hard to give an exact number of threads for optimal performance, cause it depends on what the threads are doing, but generally you shouldn't go way above the number of available cores. Keep in mind that you cannot have more running threads than the number of available cores.