Why is optimal thread count of a program related to number of cores when there are thousands of background threads

Why is optimal thread count of a program related to number of cores when there are thousands of background threads - multithreading

I've been reading about multi-threaded programming and number of optimal threads. I understand that it is very subjective, varies case by case basis, and the real optimal can be found only through trial-and-error.
However, I've found so many posts saying that if the task is non-I/O-bound, then
Optimal: numberOf(threads) ~= numberOf(cores)
Please take a look at Optimal number of threads per core
Q) How can the above equation be valid if hundreds/thousands of background (OS/other stuff) threads are already fighting to get their turn?
Q) Doesn't having a bit more number of threads increase the probability of being allotted with a core?

The "optimal" only applies to threads that are executing full throttle. The 1000+ threads you can see in use in, say, the Windows Task Manager are threads that are not executing. They are waiting for a notification, blocking on a synchronization object's wait() call.
Which includes I/O but can also be a timer, a driver event, a process interop synch object, an UI thread waiting for a message, etcetera. The latter are much less visible since they are usually wrapped by a friendly api.
Writing a program that has as many threads as the machine has cores, all burning 100% core, is not actually that common. You'd have to solve the kind of problem that requires pure calculation. Real programs are typically bogged down by the need to read/write the data to perform an operation or are throttled by the rate at which data arrives.
Overscheduling the processor is not a good strategy if you have threads burning 100% core. They'll start to fight with each other, the context switching overhead causes less work to be done. It is fine when they block. Blocking automatically makes a core available to do something else.

Related

How is fairness of thread scheduling ensured across processes?

Every process has at least one thread of execution and I read somewhere that modern Operating Systems only schedule Thread and not process.
So if there are two processes running in the system - P1 with 1 thread and P2 with 100 threads, how will OS scheduling algorithm ensure that both P1 and P2 get approximately same amount of CPU time? If OS blindly schedules threads, P2 will get 100 times more CPU time than P1.
Does it also take into account which Process a particular thread belong to? Otherwise, it seems too easy for a process to hog all the CPU by creating more threads.

Does it also take into account which Process a particular thread belong to? Otherwise, it seems too easy for a process to hog all the CPU by creating more threads.
Wrong question. Consider two jobs that are trying to solve the exact same problem by doing the same work and are perfectly identical except for one thing -- one uses dozens of threads, the other uses dozens of processes. Why should the one that uses dozens of processes get more CPU time than the one that uses dozens of threads?
Your notion of fairness is not really a sensible one.
Instead, scheduling is more designed around trying to get as much work done as possible per unit time. The assumption is that everything the computer is doing is useful and it benefits competing tasks to have other tasks competing with them finish as quickly as possible too.
This is actually all you need the vast majority of the time. But occasionally you have special situations where this doesn't work. One is ultra-high-priority tasks like keeping video or audio flowing or keeping a user interface responsive. Another is ultra-low-priority tasks where there's an enormous amount of work you want done and you don't want the system to be slow for a long time while you're working on it. Priorities are used for this, and generally the system allows higher-priority threads to interrupt lower-priority ones to keep responsiveness.

In general, "fair thread scheduling" attempts to give each thread an equal amount of CPU time (regardless of how much CPU time all threads in a process get); and "fair process scheduling" attempts to give each process the same amount of CPU time (e.g. by giving threads belonging to different processes unequal amounts of CPU time). These are mutually exclusive - you can't have both (unless each process has the same number of threads).
Note that it's all a broken joke anyway. For example, if one thread gets 10 ms of time on a CPU that is running slow due to thermal throttling (and/or because another logical CPU in the same core is busy) and another thread gets 10 ms of time on a CPU that is running faster than normal (e.g. due to "turbo-boost" and/or because the other logical CPU in the core is not being used); then these threads have received an equal amount of CPU time but have not received anything that could be considered "fair" (because one thread might be able to get 20 times as much work done than the other).
Note that it's all unwanted anyway. For example, for a good OS threads would be given a priority to indicate how important the work they do is, and you don't want a high priority thread (doing very important work) to get the same "fair share" of CPU time as a low priority thread (doing irrelevant/unimportant work). For cases where two threads have equal priority you might (in theory) want them to get an "equal" amount of CPU time; but in practice this isn't common and threads block and unblock so often that it isn't worth caring about; and in practice it can lead to "two half finished jobs instead of one completed job and one unstarted job" scenarios that increases the average amount of time a job (e.g. request for work) takes to complete.

If the thread is the basic unit of scheduling (a generally safe assumption these days) then the process scheduler is the one to decide who to allocate the CPUs. How (and whether) it takes thread usage into account is entirely system specific. AND the behavior ma depends upon the type of process. For example, in VMS (and adopted in Windoze) realtime processes are treated differently than other types of processes.
In the VMS-type scheduling, a process with more threads gets more CPU by design. Better for an application to use more threads and for it to use more processes.
Keep in mind that a system may impose limits on the number of threads in a process.

fork vs thread on one single core

Imagine that I have two tasks, each of them needs 2 seconds to finish its job.
In this case, if I create two threads for each of them and my PC is single-core, this won't save any time. Am I right ?
What if I use fork to create two processes (the machine is still single-core) and each process takes charge of one task ? Can this save any time ?
If not, I have a question:
In current modern machine (including multi-core), if I have several heavy tasks, which method should I use ?
fork ?
thread ?
fork + thread, meaning that create some processes and
each process contains more than one thread ?

Even with a single core having two threads may speed up execution. If your routine is purely CPU bound then two threads won't improve anything, indeed the performance will be worse because of context switching overhead. But if the routine has to wait for memory, disk or or network (which is usually the case) then two threads will provide performance gains even with a single core.
About fork vs threads, threads require less resources so, in principle, should be the first choice. But there are two caveats: 1) maybe you want to be able to terminate a parallel routine, this is much safer to do with processes than with threads and 2) some languages (notably Python and Ruby) provide pseudo-thread libraries which do not use real threads but switch between routines using the same thread. This simulated threading can be very useful for example when waiting for network requests but it must be taken into account that it's not real multithreading.
Amendment: As commented by Sergio Tulentsev, Ruby and Python do indeed provide real threads and not only coroutines.

"job takes 2 seconds" - If those 2 seconds are fully occupying the CPU (100% load), you won't gain anything with either thread nor fork if you have no cores to share. The single-core CPU is simply busy and you cannnot make it more busy.
In case this 2 seconds include waiting time (for example on I/O, storage, whatever) you could gain something, even with a single core. The amount of gain depends on the CPU working vs. CPU waiting ratio and the overhead of your multiprocessing. Most non-trivial programs have at least some amount of "CPU waiting", so multithreading is often useful even on single-core CPUs.
This overhead for setting up a coroutine and context switching can be considerable and needs to be measured. Obviously, the shorter the run time of your actiual task is, the larger will be the ratio of overhead (for setting up a thread or process, etc.) and the smaller will be you multi-processing gain.
Traditionally, threads used to have considerably less overhead than processes (after all, that was why they were invented), but the "considerably" has maybe vanished over time - On modern Linux systems, processes are only a tad slower to set up than threads (actually, both use the same system calls). You rather decide between thread or process based on the requirements related to amount of protection (or sharing) of data than execution speed.

CPU usage vs Number of threads

In general what is the relation between CPU usage and number of threads in a program.
Assumptions:
Multi-core CPU
Threads do the exact same job (assume they fetch identical work items from a queue and process them)

It depends on the nature of the application.
An application that mostly do calculations - a ratio of 1 thread per
core is a reasonable decision, since you don't want to spawn too many threads due to overhead, and you want to take advantage of all your cores.
An application that mostly do IO operations (like http requests) can spawn much more threads then the #cores and still increase efficiency, since the bottleneck is the waiting time per IO request, and you want to gain as much information as possible in each time you need to wait.
That said, the CPU-usage you are going to get is still dependent on many factors (IO, synchronization, non parallel parts in your program).
If you are interested in the speed the application will take - always remember Amdahl's law, which gives you a strict bound on the time (speed-up) your application is going to take, even when having infinite number of working cores.

There is no such general relationship, except for the obvious ones:
an application can't use more CPU time (CPU seconds) than the number of available cores multiplied by the number of (wall clock) seconds that it runs, and
a single thread can't use more than one CPU second per second.
The actual amount of CPU that a multi-threaded application depends mostly on the nature of the application, and the way that you've implemented it:
If the computation performed by each thread does not generate contention with other threads for locks, memory access and so on, then you should be able to approach the theoretical limit of available CPU resources.
Contention is liable to reduce effective CPU usage, sometimes dramatically.
But there are no general formulae that will tell you how much speed-up you can get.

I think there is no relation or not easy one. It depends on the jobs the threads are doing. A program with one thread can consume 100% of CPU and a program with lots of threads can consume less.
If you are looking for an optimized relation between threads and job done, you must study your case, and possibly found an empiric solution.

As the other answers already state, "it depends". In an ideal world, for n cores, you would get a throughput of factor n, given that you do the same job in a separate thread on each core (which already contains a false assumption, since you need to somehow synchronize the threads when they read from the same queue).
Understanding the Disruptor, a Beginner's Guide to Hardcore Concurrency gives some nice examples what you need to consider when parallezing tasks, and also shows some cases where the attempt to parallelize leads to a longer execution time.

Why would I have to use multiple threads for one processing task if i can turn up the priority of the program?

Earlier I asked about processing a datastream and someone suggested to put data in a queue and processing this data on a different thead. If this was to slow, I should use multiple threads.
However, i'm using a system that has one core.
So my question is: why not up the prio of my app, so it gets more CPU time from the OS?
I'm writing a server based app and it will be the only big thing running on there.
What would be the pro's and con's of putting the prio up?:)

If you have only one core, then the only way that multi-threading can help you is if chunks of that work depends on something other than CPU, so one thread can get some work done while another is waiting for data from a disk or network connection.
If your application has a GUI, then it can benefit from multi-threading in that while it would be no quicker to do the processing (slower in fact, though probably negligibly so if the task is very long), it can still react to user input in the meantime.
If you have two or more cores, then you can also gain in CPU-bound operations though doing so varies from trivial to impossible depending on just what that operation is. This is irrelevant to your case, but worth considering generally if code you write could later be run on a multi-core system.
Upping the priority is probably a bad idea though, especially if you have only one core (one advantage of multi-core systems is that people who up priorities can't do as much damage).
All threads have priorities which is a factor of both their process' priority and their priority within that process. A low-priority thread in a high priority process trumps a high-priority thread in a low-priority process.
The scheduler doles out CPU slices in a round-robin fashion to the highest priority threads that have work to do. If there are CPUs left over (which in your case means if there are zero threads at that priority that need to run), then it doles out slices to the next lowest priority, and so on.
Most of the time, most threads aren't doing much anyway, which can be seen from the fact that most of the time CPU usage on most systems is below the 100% mark (hyperthreading skews this, the internal scheduling within the cores means a hyperthreaded system can be fully saturated and seem to be only running at as little as 70%). Anyway, generally stuff gets done and a thread that suddenly has lots to do will do so at normal priority in pretty much the same time it would at a higher.
However, while the benefit to that busy thread of higher priority is generally little or nothing, the decrement is great. Since it's the only thread that gets any CPU time, all other threads are stuck. All other processes therefore hang for a while. Eventually the scheduler notices that they've all been waiting for around 3seconds, and fixes this by boosting them all to highest priority and giving them larger slices than normal. Now we have a burst of activity as threads that got no time are all suddenly highest-priority threads that all want CPU time. There's a spurt of every thread except the high-priority one running, and the system stops from keeling over, though there's likely still a lot of applications showing "Not Responding" in their title bars. It's far from ideal, but it is an effective way to deal with a thread of higher than usual priority grabbing the core for so long.
The threads gradually drop down in priority, and eventually we're back to the situation where the single higher priority thread is the only one that can work.
For extra fun, if our high priority thread in any way depended upon services provided by the lower priority threads, it would have ended up being stuck waiting on them. Hopefully in a way that made it block and stopped itself from doing any damage, but probably not.
In all, thread priorities are to be approached with great caution, and process priorities even more so. They're only really valid if they'll yield quickly and are either essential to the workings of other threads (e.g. some OS processes will be done at a higher priority, finaliser threads in .NET will be higher than the rest of the process, etc) or if sub-millisecond delays can mess things up (some intensive media work requires this).

If you have multiple cores/processors in your system, upping the priority of a single threaded program will not improve your performance by much, because the other cores would still be unused.
The only way to take advantage of multiple processing units is to write your program using multiple threads/processes.
Having said this, setting your multithreaded application to very high priority may lead to some performance improvement, but I really never saw it to be significant, at least in my own tests.
Edit: I see now that you are using only one core. Basically your program will be able to run more often on the CPU than the rest of the processes that are of lower priority. This may bring you a marginal improvement, but not a dramatic one. Since we cannot know what other applications are running at the same time on your system, the golden rule here is to try it yourself with various priority levels and see what happens. It's the only valid way to see if things will be faster or not.

It all depends on why the data processing is slow.
If the data processing is slow because it is a genuinely cpu intensive operation then splitting it out into multiple threads on a single core system is not going to get you any benefit. In this case increasing the task priority would provide some benefit, assuming that there is (user) cpu time being used by other processes.
However, if the data processing operation is slow because of some non-cpu restriction (eg. if it is I/O bound, or relying on another process), then:
Increasing the task priority is going to have negligible impact. Task priority won't affect I/O times and if there is a dependency on another process on the system you may actually harm performance.
Splitting the data processing out into multiple threads can allow the cpu intensive areas to continue processing while waiting for the non-cpu intensive (eg. I/O) areas to complete.

Increasing the priority of a single-threaded process just gives you more (or bigger) time slices on the one core the process is running on. The core can still only do one thing at a time.
If you spin off a thread to handle the data processing, it can run on a different processor core (assuming a multi-core system), and it and your main thread are actually executing at the same time. Much more efficient.

If you use only one thread your server app will only be able to service one request at a time, no matter what its priority. If you use multiple threads you could service many at the same time.

How many simultaneous threads in an application is a lot?

5, 100, 1000?
I guess, "it depends", but on what?
What is common in applications that run as server daemons / services?
What are hard limits?
Given that the machine can handle the overall workload, how do I determine at how many threads the overhead starts to have an impact on performance?
What are important differences between OS's?
What else should be considered?
I'm asking because I would like to employ threads in an application to organize subcomponents of my application that do not share data and are designed to do their work in parallel. As the application would also use thread pools for parallelizing some tasks, I was wondering at what point I should start to think about the number of threads that's going to run in total.
I know the n+1 rule as a guideline for determining the number of threads that simultaneously work on the same task to gain performance. However, I want to use threads like one might use processes in a larger scope, i. e. to organize independent tasks that should not interfere with each other.
In this related question, some people advise to minimise the number of threads because of the added complexity. To me it seems that threads can also help to keep things sorted more orderly and actually reduce interference. Isn't that correct?

I can't answer your question about "how much is many" but I agree that you should not use threads for every task possible.
The optimal amount of threads for performance of application is (n+1), where n is the amount of processors/cores your computer/claster has.
The more your actual thread amount differs from n+1, the less optimal it gets and gets your system resources wasted on thread calculations.
So usually you use 1 thread for the UI, 1 thread for some generic tasks, and (n+1) threads for some huge-calculation tasks.

Actually Ajmastrean is a little out of date. Quoting from his own link
The thread pool has a default size of
250 worker threads per available
processor, and 1000 I/O completion
threads. The number of threads in the
thread pool can be changed by using
the SetMaxThreads method.
But generally I think 25 is really where the law of diminishing returns (and programmers abilities to keep track of what is going on) starts coming into effect. Although Max is right, as long as all of the threads are performing non-blocking calculations n+1 is the optimal number, in the real world most of the threading tasks I perform tend to be done on stuff with some kind of IO.

Also depends on your architecture. E.g. in NVIDIA GPGPU lib CUDA you can put on an 8 thread multiprocessor 512 threads simoultanously. You may ask why assign each of the scalar processors 64 threads? The answer is easy: If the computation is not compute bound but memory IO bound, you can hide the mem latencies by executing other threads. Similar applies to normal CPUs. I can remember that a recommendation for the parallel option for make "-j" is to use approx 1.5 times the number of cores you got. Many of the compiling tasks are heavy IO burden and if a task has to wait for harddisk, mem ... whatever, CPU could work on a different thread.
Next you have to consider, how expensive a task/thread switch is. E.g. it is comes free, while CPU has to perform some work for a context switch. So in general you have to estimate if the penalty for two task switches is longer than the time the thread would block (which depends heavily on your applications).

Microsoft's ThreadPool class limits you to 25 threads per processor. The limit is based on context switching between threads and the memory consumed by each thread. So, that's a good guideline if you're on the Windows platform.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string