Thread optimal pool size configuration?

Thread optimal pool size configuration? - multithreading

What is the reason of keeping thread pool size equal to the number of processors/cores for CPU-intensive tasks? And why I/O bound tasks should have larger pool size?

There is a correlation between the optimal number of threads to the number of central processing units because a thread can be thought of as a program. Programs requires run time. Run time is provided by a central processing unit.
A producer - consumer analogy would have the program as the consumer and the central processing units as the producer. So theoretically - if a producer (cpu) can handle T consumers (threads) and there are C producers - the optimal number of consumers would be T * C.
Too many threads would cause for too much context switch overhead, which is practically wasted cpu time to manage the threads themselves. Too few would cause idle cpus while tasks are still in queue.
I/O bound tasks communicate with slow devices (that's the reason they're called I/O bound). While requests are made to a slow device (such as the hard drive), the scheduler can have the cpu run other threads instead of waiting for the device's output.
An analogy for that would be you (the scheduler) ordering food in a restaurant (thread 1) and then sending an SMS to your friend (thread 2). The fact that you're waiting for your food shouldn't deny you of completing other tasks, such as sending the SMS to your friend.
To have deeper knowledge about possible optimizations you may want to read about affinity and scheduling.

Related

How is fairness of thread scheduling ensured across processes?

Every process has at least one thread of execution and I read somewhere that modern Operating Systems only schedule Thread and not process.
So if there are two processes running in the system - P1 with 1 thread and P2 with 100 threads, how will OS scheduling algorithm ensure that both P1 and P2 get approximately same amount of CPU time? If OS blindly schedules threads, P2 will get 100 times more CPU time than P1.
Does it also take into account which Process a particular thread belong to? Otherwise, it seems too easy for a process to hog all the CPU by creating more threads.

Does it also take into account which Process a particular thread belong to? Otherwise, it seems too easy for a process to hog all the CPU by creating more threads.
Wrong question. Consider two jobs that are trying to solve the exact same problem by doing the same work and are perfectly identical except for one thing -- one uses dozens of threads, the other uses dozens of processes. Why should the one that uses dozens of processes get more CPU time than the one that uses dozens of threads?
Your notion of fairness is not really a sensible one.
Instead, scheduling is more designed around trying to get as much work done as possible per unit time. The assumption is that everything the computer is doing is useful and it benefits competing tasks to have other tasks competing with them finish as quickly as possible too.
This is actually all you need the vast majority of the time. But occasionally you have special situations where this doesn't work. One is ultra-high-priority tasks like keeping video or audio flowing or keeping a user interface responsive. Another is ultra-low-priority tasks where there's an enormous amount of work you want done and you don't want the system to be slow for a long time while you're working on it. Priorities are used for this, and generally the system allows higher-priority threads to interrupt lower-priority ones to keep responsiveness.

In general, "fair thread scheduling" attempts to give each thread an equal amount of CPU time (regardless of how much CPU time all threads in a process get); and "fair process scheduling" attempts to give each process the same amount of CPU time (e.g. by giving threads belonging to different processes unequal amounts of CPU time). These are mutually exclusive - you can't have both (unless each process has the same number of threads).
Note that it's all a broken joke anyway. For example, if one thread gets 10 ms of time on a CPU that is running slow due to thermal throttling (and/or because another logical CPU in the same core is busy) and another thread gets 10 ms of time on a CPU that is running faster than normal (e.g. due to "turbo-boost" and/or because the other logical CPU in the core is not being used); then these threads have received an equal amount of CPU time but have not received anything that could be considered "fair" (because one thread might be able to get 20 times as much work done than the other).
Note that it's all unwanted anyway. For example, for a good OS threads would be given a priority to indicate how important the work they do is, and you don't want a high priority thread (doing very important work) to get the same "fair share" of CPU time as a low priority thread (doing irrelevant/unimportant work). For cases where two threads have equal priority you might (in theory) want them to get an "equal" amount of CPU time; but in practice this isn't common and threads block and unblock so often that it isn't worth caring about; and in practice it can lead to "two half finished jobs instead of one completed job and one unstarted job" scenarios that increases the average amount of time a job (e.g. request for work) takes to complete.

If the thread is the basic unit of scheduling (a generally safe assumption these days) then the process scheduler is the one to decide who to allocate the CPUs. How (and whether) it takes thread usage into account is entirely system specific. AND the behavior ma depends upon the type of process. For example, in VMS (and adopted in Windoze) realtime processes are treated differently than other types of processes.
In the VMS-type scheduling, a process with more threads gets more CPU by design. Better for an application to use more threads and for it to use more processes.
Keep in mind that a system may impose limits on the number of threads in a process.

Will a multi-threaded application be actually faster than a single-threaded application?

All is entirely theoretical, the question just came to mind and I wasn't entirely sure whats the answer:
Assume you have an application that calculates 4 independent calculations. (Totally independent, doesn't matter what order you do them and you don't need one to calculate another).
Also assume those calculations are long (minutes) and CPU-bound (not waiting for any kind of IO)
1) Now, if you have a 1-processor computer, a single thread application will logically be faster than (or the same as) a multithreaded application. As the computer not able to do more then one thing at a time with one processor, it would "waste" time on context switching and the likes.
So far so good?
2) If you have a 4 processor computer, 4 threads will mostly likely be faster for this than single thread. Right? your computer can now do 4 operations at a time so its just logical to divide your application to 4 threads, and it should complete with the time the longest of the 4 calculations take.
Still good so far?
3) And now the actual part I am confused about - why would I EVER have my application create more threads than the number of processors (well actually - cores) available? I have programmed and have seen applications that create tens and hundreds of threads, but actually - the perfect number is about 8 for an average computer?
P.S. I already read this: Threading vs single thread
but didn't quiet answer that.
Cheers

Why would I EVER have my application create more threads than the number of processors (well actually - cores) available?
One very good reason is if you have threads that wait on events. For example you might have a producer/consumer application in which the producer is reading from some data stream, and that data arrives in bursts: a few hundred (or thousand) records in a batch, followed by nothing for a while, and then another burst. Say you have a 4-core machine. You could have a single producer thread that reads the data and places it in a queue, and three consumer threads to process the queue.
Or, you could have a single producer thread and four consumer threads. Most of the time, the producer thread is idle, giving you four consumer threads to process items from the queue. But when items are available on the data stream, one of the consumer threads gets swapped out in favor of the producer.
That's a simplified example, but substantially similar to programs that I have in production.
More generally, it doesn't make any sense to create more continuously-working (i.e. CPU bound) threads than you have processing units (CPU cores in general, although the existence of hyperthreading muddies the waters a bit). If you know that your threads won't be waiting on external events, then having n+1 threads when you only have n cores will end up wasting time with thread context switches. Note that this is strictly in the context of your program. If there are other applications and OS services running, your application's threads will get swapped out from time to time so that those other apps and services can get a timeslice. But one assumes that, if you're running a CPU-intensive program, you'll limit the other apps and services that are running at the same time.
Your best bet, of course, is to set up a test. On a 4-core machine, test your app with 1, 2, 3, 4, 5, ... threads. Time how long it takes to complete with different numbers of threads. I think you'll find that on a 4-core machine the sweet spot will be 3 or 4; most likely 4 unless there are other apps or OS services that take a lot of CPU.

One reason i could come up with for more threads than cores would be if some threads needed to interface with other parties... waiting for a response from a server.. querying something from the database. This will allow the thread to sleep until an answer is provided. this way other computations wouldn't have to wait. in the 4cores->4thread the thread would wait for input which possibly causes other code to have to wait too

Adding threads to your application is not strictly about performance gains. Some times you want or need to perform more than one task at the same time because that is the most logical way to architect your program.
As an example, perhaps you are writing a game engine, if you take a multi-threaded approach, you may have one thread for physics, one thread for graphics, one thread for networking, one thread for user input, one thread for resource loading from disk etc.
Also James Baxters point is very true as well. Some times threads are waiting on a resource and can not execute further until they access said resource. With only the same number of threads as cores, one core would be going to waste.

I think you are assuming that all programs are CPU bound - remember some of your threads will be waiting for I/O (disk/network/user traffic).

Use of the terms "queues", "multicore", and "threads" in Grand Central Dispatch

I am trying to get my head around the concepts of Grand Central Dispatch. I want to understand these quotes from Vandad's book on Concurrent Programming.
The real use for GCD is to dispatch tasks to multiple cores, without making you the programmer, worry about which core is executing which task.
and
At the heart of GCD are dispatch queues. Dispatch queues are pools of threads.
and finally
You will not be working with these threads directly. You will just work with dispatch queues, dispatching tasks to these queues and asking queues to invoke your task.
I have bolded the key terms.
Are multiple cores the same as queues? Does a queue consist of many threads? Does each thread perform a task?

So multiple cores are the same as queues?
Not really. A queue is a programming abstraction, a core is a physical resource in your processor. There is no unique relationship between a queue and a core, although at any given point in time it can be said that a given queue is executing a given task on a given core.
A queue consists of many threads?
A queue consists of tasks. Tasks are assigned to threads by the queue managing system when it comes the time to execute that task. Threads are OS resources and are allocated to cores, which effectively run them and have no notion of what a task is (except for Hyper-Threading CPUs).
If you do not account for hardware-multithreading (e.g., Hyper-threading), at any given point in time a core is running a specific thread; when it comes the time to run a different thread, a context-switch occurs in that core. If you account for hardware-multithreading, you can have multiple threads running on virtual cores hosted in the same physical core.
The relationship between queues and threads is opaque. A queue could manage several threads at once, or several threads once at a time, or just one all the time; in the first case, you have a parallel queue, able to execute parallel tasks on simultaneous threads; in the second and third case, you have a serial queue.
Each thread performs a task?
At any given point in time, a thread is performing a task. You can have threads that are spawn, execute their task, and die; or you can have long running threads (i.e., the main thread) that execute several tasks.
Maybe it is pretty puzzling at start, you might need some reading about Operating Systems and maybe high-level Processor Architectures to fully understand this.
GCD aims at letting you reason exclusively in abstract terms: i.e., in terms of tasks and queues, and forget about threads and cores, that are seen as a sort of "implementation means", or low-level details that you can leave to the system to use efficiently.

Queues are just list of tasks to execute, cores depend on the processor, you can have 1 or many cores.
Queues are configurable and you can decide if tasks can be executed concurently or not, if you allow concurency in your queue, tasks in the queue can be executed at the same time in different cores.

I'm not sure those quotes really do GCD justice. For example, to take each quote in turn:
GCD is more than useable (and useful) even if you have only a single core available, since multi-threading certain tasks have their place in computer science regardless of the number of physical CPU cores available. Better to think of it as an alternative to managing threads explicitly - GCD will do the thread management so you don't have to, you (as the programmer) just have to think in terms of queues and whether certain related tasks must be done serially or can be done concurrently.
Dispatch queues are not "pools of threads". Dispatch queues are "units of work aggregation" and should be thought of that way. How that work is physically performed, by one thread or multiple threads, is not the programmer's concern and, in fact, the less assumptions the programmer makes about that the better since GCD tries very hard to be efficient and use as few threads as possible while still effectively utilizing hardware resources.
The third quote is good - that is the appropriate idiom to embrace. Just submit your work (be it blocks or function/context tuples) to the appropriate queue, creating queues as necessary to associate with resources that require synchronization, and you've got the gist of GCD.

What is meant by cpu slack?

The following is an excerpt from the book Java Concurrency in Practice, Chapter 12.2 Testing for Performance where the author talks about throughput of a bounded buffer implementation.
Figure 12.1 shows some sample results on a 4-way machine, using buffer
capacities of 1, 10, 100, and 1000. We see immediately that a buffer
size of one causes very poor throughput; this is because each thread
can make only a tiny bit of progress before blocking and waiting for
another thread. Increasing buffer size to ten helps dramatically, but
increases past ten offer diminishing returns.
It may be somewhat puzzling at first that adding a lot more threads
degrades performance only slightly. The reason is hard to see from the
data, but easy to see on a CPU performance meter such as perfbar while
the test is running: even with many threads, not much computation is
going on, and most of it is spent blocking and unblocking threads. So
there is plenty of CPU slack for more threads to do the same thing
without hurting performance very much.
However, be careful about concluding from this data that you can
always add more threads to a producer-consumer program that uses a
bounded buffer. This test is fairly artificial in how it simulates the
application; the producers do almost no work to generate the item
placed on the queue, and the consumers do almost no work with the item
retrieved. If the worker threads in a real producer-consumer
application do some nontrivial work to produce and consume items (as
is generally the case), then this slack would disappear and the
effects of having too many threads could be very noticeable. The
primary purpose of this test is to measure what constraints the
producer-consumer handoff via the bounded buffer imposes on overall
throughput.
What does the author mean by cpu slack here? Why will the throughput degrade not degrade more and more as more number of threads are being added? I am not following the reasoning given by the author regarding the slight degradation of performance while adding more and more threads , assuming that the bound on the buffer size is kept constant.
Edit: I can think of one reason :since in this case no real work is being done by threads , so the classic problem of increased traffic on shared memory bus, number of cache misses due to context switching of threads are not playing a major role as more and more threads are being added. The situation is going to change once the threads start doing some more work. Is that what the author is trying to convey here in the third paragraph?

There is no formal term such as CPU slack. The author simply means that the CPU is not fully utilised in doing meaningful work because most time is spent waiting to successfully acquire a mutually exclusive lock. The author is calling the unused capacity of the CPU, the CPU slack.
NOTE: The associated code tests a multiple producer / multiple consumer scenario, with an equal number of producers and consumers.
EDIT: In the later discussion they talk about the effect of adding more threads if a) the threads do almost no work, and b) the threads occupy the CPU substantially for every produced or consumed item. I will try to explain the difference with some slightly artificial scenarios.
Suppose that locking takes 1 time unit actively, and 8 time units passively by waiting. Passive waiting does not occupy the CPU.
Case 1: Producer-Consumer cost is 1 time unit.
So we currently account for 2 time units of CPU time, with an
additional 8 time units of passive waiting time. So we have 8/10
available CPU time units.
If we now want to double the number of threads, we need to accommodate
an additional 2 time units (1 for producer-consumer stuff, and 1 for
active locking time). That would eat into our supply of available CPU
time -- but we have enough.
Case 2: Producer-Consumer cost is 11 time units.
So we currently account for 11+1=12 time units of CPU time, with an additional 8 time units of passive waiting time. So we have 8/20 available CPU time units.
If we now want to double the number of threads, we need to accommodate an additional 12 time units (11 for producer-consumer stuff, and 1 for active locking time). That goes beyond the available CPU time units. Something has to give -- so waiting time will increase, and throughput will suffer.
So in case 2, the amount of real work reduces the amount of time available for new threads, thereby increasing the observed effect of locking contention on the throughput. It would have been nice if they had also included figures for this imagined scenarios in the book. It would have made their hand-wavy argument easier to follow.

I think cpu slack is the resource. According to Wikipedia, it is referred to the amount of time left after a job if the job was started now.
Plenty of cpu slack means much computation resources. When Consumer/Producer do something nontrivial, cpu slack decreases and impacts throughput.

Is the timeslice given to a thread that is waiting on I/O "wasted"?

I'm currently analyzing the pros and cons of writing a server using a threaded model or event driven model. I already know the many cons of the threaded model (does not scale well due to context switching overhead, virtual memory limitations, etc.) but I came upon another one in my analysis and would like to verify that my understanding of threads is correct.
If I have 5 threads, 1 which is doing work (not being blocked), 4 which are being blocked waiting for I/O (for example waiting on data from a socket), isn't the CPU time given to those 4 threads essentially wasted since no work is actually being done (assuming no data arrives)? The timeslice given to those 4 blocked threads is taking away time from the 1 thread actually doing work, correct?
In this case I'm explicitly saying that the socket is a blocking one.

No. Although it actually depends on the type of OS, type of I/O (polled/DMA) and device driver architecture, most device I/O is performed using DMA + interrupts. In such cases a thread is put into a sleep state until an interrupt is triggered for such I/O operations and scheduler does not visit those threads until their pending I/O is complete. Only polling I/O can cause consumption of CPU, such as PIO mode for hard disks.

Threads don't need to use their entire timeslice. I don't know the specifics, but if blocked threads even get time, they certainly don't use it all.
Obviously, these details vary platform-to-platform-to-environment-to-etc.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string