Number of threads in a thread pool - multithreading

I have two questions..
1. What is he difference between thread and thread pool? Can I have multiple thread pools (not threads) in my system.
2. I have been reading that general size of a threads in a thread pool is to be same as the number of processors or one more than the processor. I am using a quad core processor, that means I can have 4 or 5 threads in a thread pool. However under task manager my system shows more than 1000 threads active anytime..??

What is he difference between thread and thread pool?
A thread is a single flow of execution. A thread pool is a group of these threads; usually the threads in a thread pool are kept alive indefinitely (i.e. until program shutdown) so that as each new work-request comes in, it can be handed to the next available thread in the thread-pool for processing. (This is beneficial because it's more efficient to just wake up an existing thread and hand it some work than it is to always create a new thread every time a new work-request comes in, and then destroy the thread afterwards)
Can I have multiple thread pools (not threads) in my system.
Yes.
I have been reading that general size of a threads in a thread pool is to be same as the number of processors or one more than the
processor.
That's a good heuristic, but it's not a requirement; your thread pool can have as many or as few threads in it as you like. The reason people suggest that number is that if you have fewer threads in your thread pool than you have physical CPU cores, then under heavy load not all of your CPU cores will get used (e.g. if you have a 3-thread pool and 4 CPU cores, then under heavy load you'll have 3 CPU cores busy and 1 CPU core idle/wasted, and your program will take ~25% longer to finish the work than if it would have if it had 4 threads in the pool). On the other hand, if you have more threads than CPU cores, then under heavy load the "extra" threads merely end up time-sharing a CPU core together, slowing each other down and not providing any additional benefit in terms of work-completion-rate.
However under task manager my system shows more than 1000 threads
active anytime..??
The thing to notice about those 1000 threads is that probably 99% of them are actually asleep at any given moment. Most threads aren't doing work all the time; rather they spend most of their lives waiting for some particular event to occur, quickly handling it, and then going back to sleep until the next event comes along for them to handle. That's the reason why you can have 1000 threads present on just a handful of CPU cores without everything bogging down.

Related

Do we count the main thread when we compute the recommended number of threads that we can create in C using Pthreads?

I have a computer with 1 cpu, 4 cores and 2 threads per core that can run. So I have effiency with maximum 8 running threads.
When I write a program in C and create threads using pthred_create function, how many threads is recommended to be created: 7 or 8? Do I have to substract the main thread, thus create 7, or main thread should not be counted and I can effiently crete 8? I know that in theory you can create much more, like thousands, but I want to be effiently planned, according with my computer architecture.
Which thread started which is not much relevant. A program's initial thread is a thread: while it is scheduled on an execution unit, no other thread can use that execution unit. You cannot have more threads executing concurrently than you have execution units, and if you have more than that eligible to run at any given time then you will pay the cost of extra context switches without receiving any offsetting gain from additional concurrency.
To a first approximation, then, yes, you must count the initial thread. But read the above carefully. The relevant metric is not how many threads exist at any given time, but rather how many are contending for execution resources. Threads that are currently blocked (on I/O, on acquiring a mutex, on pthread_join(), etc.) do not contend for execution resources.
More precisely, then, it depends on your threads' behavior. For example, if the initial thread follows the pattern of launching a bunch of other threads and then joining them all without itself performing any other work, then no, you do not count that thread, because it does not contend for CPU to any significant degree while the other threads are doing so.

Why kotlin coroutines are considered light weight?

If coroutines is still using threads to run code in parallel why it is considered light weight?
In my understanding kotlin's suspend functions is transformed by compiler into state machine where each branch can be run on the same or different thread defined by developer. Coroutine builder, e,g, launch{}, responsible for that and CoroutineContext is what defines a thread to run on.
Parallelism achieved by sending block of code to the thread pool which leverage the same threads
There was a benchmark on 100k coroutines and 100k threads where coroutines pass without issue and threads throw exception (likely OutOfMemory). It brings me to idea I am missing something here.
Could you help me to understand what is missed here what makes coroutines run code block 100k in parallel without exceeding memory limits like threads do?
Pointing from the article
Every Thread has its own stack, typically 1MB. 64k is the least amount of stack space allowed per thread in the JVM while a simple coroutine in Kotlin occupies only a few dozen bytes of heap memory.
The Coroutine Dispatcher has a limit that only a certain amount of threads can be created.
Such as Dispatchers.IO has limit of 64 threads, Dispatchers.Default has limit of number of cores on your processor (2, 4, 6, 8, etc.) Dispatchers.Unconfined cannot create new thread it runs on threads previously created by other dispatchers, here's proof: 500 operations having 10ms of sleep takes approx 5s (single-thread because it can't spawn a new) try it yourself.
Coroutines stick to a thread, and as soon as suspension point is reached, it leaves the Thread and frees it up letting it to pick up another coroutine if it is waiting. This way with less threads and less memory usage, that much concurrent work can be done.
The coroutines are managed to be suspended and resumed by a callback like object Continuation which is added as the last parameter to the function marked with suspend keyword at the time of compilation which lives in heap as other objects do and is responsible for the resume of coroutine, so Thousands of MBs space is not required in RAM to keep all the Threads alive. A typical 60-70 threads are created at max using CommonPool and are reused (if new coroutine is created it waits till another finishes).
The main saving comes from the fact that a single thread can run any number of coroutines, by way of cooperative multitasking. When you launch 100,000 coroutines, they run on as many threads as there are CPU cores, but when you start 100,000 threads, the JVM creates that many native threads. Note that the level of parallelism is equal in both cases, and limited to the number of CPU cores.
The only thing that changes is scheduling: in the classical case, the OS suspends and resumes threads, assigning them to CPU cores. With coroutines, the coroutines suspend themselves (this is their cooperative aspect) and the Dispatcher resumes them later on, running other coroutines in the meantime.
Lightweight: You can run many coroutines on a single thread due to support for suspension, which doesn't block the thread where the coroutine is running. Suspending saves memory over blocking while supporting many concurrent operations.
Fewer memory leaks: Use structured concurrency to run operations within a scope.

Threads vs cores when threads are asleep

I am looking to confirm my assumptions about threads and CPU cores.
All the threads are the same. No disk I/O is used, threads do not share memory, and each thread does CPU bound work only.
If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Are any of my assumptions incorrect? if so why?
Edit
When I say the thread is asleep, I mean that the thread is blocked for a specific amount of time. In C++ I would use sleep_for Blocks the execution of the current thread for at least the specified sleep_duration
If we assume that you are talking about threads that are implemented using native thread support in a modern OS, then your statements are more or less correct.
There are a few factors that could cause the behavior to deviate from the "ideal".
If there are other user-space processes, they may compete for resources (CPU, memory, etcetera) with your application. That will reduce (for example) the CPU available to your application. Note that this will include things like the user-space processes responsible for running your desktop environment etc.
There are various overheads that will be incurred by the operating system kernel. There are many places where this happens including:
Managing the file system.
Managing physical / virtual memory system.
Dealing with network traffic.
Scheduling processes and threads.
That will reduce the CPU available to your application.
The thread scheduler typically doesn't do entirely fair scheduling. So one thread may get a larger percentage of the CPU than another.
There are some complicated interactions with the hardware when the application has a large memory footprint, and threads don't have good memory locality. For various reasons, memory intensive threads compete with each other and can slow each other down. These interactions are all accounted as "user process" time, but they result in threads being able to do less actual work.
So:
1) If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
Probably not all of the time, due to other user processes and OS overheads.
2) If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
Approximately. There are the overheads (see above). There is also the issue that time slicing between different threads of the same priority is fairly coarse grained, and not necessarily fair.
3) If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
Approximately: see above.
4) An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
There is also the issue that the OS consumes CPU to manage the sleeping threads; e.g. putting them to sleep, deciding when to wake them, rescheduling.
Another one is that the memory used by the threads may also come at a cost. For instance if the sum of the memory used for all process (including all of the 10,000 threads' stacks) is larger than the available physical RAM, then there is likely to be paging. And that also uses CPU resources.
5) In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Not necessarily. If the virtual memory usage is out of whack (i.e. you are paging heavily), the system may have to idle some of the CPU while waiting for memory pages to be read from and written to the paging device. In short, you need to take account of memory utilization, or it will impact on the CPU utilization.
This also doesn't take account of thread scheduling and context switching between threads. Each time the OS switches a core from one thread to another it has to:
Save the the old thread's registers.
Flush the processor's memory cache
Invalidate the VM mapping registers, etcetera. This includes the TLBs that #bazza mentioned.
Load the new thread's registers.
Take performance hits due to having to do more main memory reads, and vm page translations because of previous cache invalidations.
These overheads can be significant. According to https://unix.stackexchange.com/questions/506564/ this is typically around 1.2 microseconds per context switch. That may not sound much, but if your application is switching threads rapidly, that could amount to many milliseconds in each second.
As already mentioned in the comments, it depends on a number of factors. But in a general sense your assumptions are correct.
Sleep
In the bad old days a sleep() might have been implemented by the C library as a loop doing pointless work (e.g. multiplying 1 by 1 until the required time had elapsed). In that case, the CPU would still be 100% busy. Nowadays a sleep() will actually result in the thread being descheduled for the requisite time. Platforms such as MS-DOS worked this way, but any multitasking OS has had a proper implementation for decades.
10,000 sleeping threads will take up more CPU time, because the OS has to make scheduling judgements every timeslice tick (every 60ms, or thereabouts). The more threads it has to check for being ready to run, the more CPU time that checking takes.
Translate Lookaside Buffers
Adding more threads than cores is generally seen as OK. But you can run into a problem with Translate Lookaside Buffers (or their equivalents on other CPUs). These are part of the virtual memory management side of the CPU, and they themselves are effectively content address memory. This is really hard to implement, so there's never that much of it. Thus the more memory allocations there are (which there will be if you add more and more threads) the more this resource is eaten up, to the point where the OS may have to start swapping in and out different loadings of the TLB in order for all the virtual memory allocations to be accessible. If this starts happenging, everything in the process becomes really, really slow. This is likely less of a problem these days than it was, say, 20 years ago.
Also, modern memory allocators in C libraries (and thence everything else built on top, e.g. Java, C#, the lot) will actually be quite careful in how requests for virtual memory are managed, minising the times they actually have to as the OS for more virtual memory. Basically they seek to provide requested allocations out of pools they've already got, rather than each malloc() resulting in a call to the OS. This takes the pressure of the TLBs.

Does a process run threads in a sequential order?

The question is about multithreading. Say I have 3 threads, the main one, a child1, and a child2. Does the process executing these threads run it in an order that it works on one thread for a short amount of time, then works on the other, and so on and forth and keeps switching, or are the threads running without ever being stopped by the process? Somewhere I read that a thread gets stopped without finish, then another thread is worked on and stopped, then back to thread1 and so on on forth, but that wouldn't make any sense if any threads are stopped as the point of mutlithreading was that they are all concurrent and all run at the same time, but how does the processor do that?
This is in .Net/C#.
the scenario you describe is the way IS ran thread in the old age before multi-core
OS scheduled thread sequentially based in their priorities, but now... I suppose you have at least 2 core where 2 thread can run concurrently and the 3rd thread will be schedule and interrupt one of the other!!!!
The scenario you're describing is correct, except that one thread will normally be running at each time per processor core.
Simplified; if 3 threads are active on 4 cores, they will all always be allowed to run since there's always an available core to run them, while if 3 threads are active on 2 cores, only two can run at any time so they will have to take turns.
Operating systems schedule threads to execute on the available CPU cores (either real or virtual). In the past, most computers had single core CPUs, and thus only one thread could be executed at a time. Modern CPUs are typically 2, 4, or 8 core systems. Some of these cores are virtual, like Intel's hyperthreading CPUs which have twice as many virtual cores as physical cores.
However, there are almost always more threads than CPU cores available, so the OS will prioritize all of the threads on the system in order to run them as efficiently as possible. The threads created by your process may or may not truly run in parallel over any given time span, but you should assume that they will.

[CLR Threading]When a thread pool thread blocks, the thread pool creates additional threads

I see this in the book "CLR via C#" and I don't catch it. If there are still threads available in the thread pool, why does it create additional threads?
It might just be poor wording.
On a given machine the threadpool has a good guess of the optimum number of threads the machine can run without overextending resources. If, for some reason, a thread becomes IO blocked (for instance it is waiting for a long time to save or retrieve data from disk or for a response from a network device) the threadpool can start up another thread to take advantage of unused CPU time. When the other thread is no longer blocking, the threadpool will take the next freed thread out of the pool to reduce the size back to "optimum" levels.
This is part of the threadpool management to keep the system from being over-tasked (and reducing efficiency by all the context switches between too many threads) while reducing wasted cycles (while a thread is blocked there might not be enough other work to task the processor(s) fully even though there are tasks waiting to be run) and wasted memory (having threads spun up and ready but never allocated because they'd over task the CPU).
More info on the Managed Thread Pool from MSDN.
The book lied.
Threadpool only creates additional threads when all available threads have been blocked for more than 1 second. If there are free threads, it will use them to process your additional tasks. Note that after 30 seconds of thread idle, the CLR retires the thread (terminates it, gracefully of course).

Resources