Improvement on execution time from an aplication done with multi-threading is limited by the number of physical cores?

Improvement on execution time from an aplication done with multi-threading is limited by the number of physical cores? - multithreading

I was doing some testing with multi-threading on a linux virtual machine, and I implemented a benchmark with 10 threads (in this application each instruction would be executed 10x times more than in the single-thread scenario) and i was tweaking with the number of "physical cores" from the VM settings and with the single thread case I obtain 3s on average independently of the number of physical cores, If the number of cores is set to 1, and I run the multi-thread version, the execution time will be 30s. If I run it with 2 cores I obtain 15s and with 8 cores (the maximum number I can set) I obtain 6s, I obtain this dependancy due to the fact that I´m executing 10x times each instruction or is always like this?

If you have N threads running on N cores, and if they are all doing pure computation (i.e., not waiting for any I/O devices), and if they are all completely independent of each other, then they should be able to do N times as much work in a given amount of time as a single thread can do in the same amount of time.
But, that's if they are completely independent. That's a hard thing to achieve. For example, if the threads can't each do all of their work in their own, independent cache (e.g., in L1 cache,) then they will compete with each other for access to the main memory. They will sometimes have to wait for one another, because only one core can access main memory at any given moment. So, if the threads need to use memory, then the speedup will be somewhat less than N times.
If the threads need to share data in main memory, then it gets worse because then they will need to use mutual exclusion locks. One thread may keep a lock locked while it executes dozens of instructions, and any other thread that wants the same lock will have to wait until it is finished.
If the threads need to synchronize with each other/communicate with each other, then it gets worse still because unless their work loads are carefully balanced, a thread with less work to do may spend long periods of time awaiting signals from threads that have more work to do.
It's not unusual for a novice programmer to invent a multi-threaded version of some single-threaded algorithm, and find out that the multi-threaded version actually is slower than the single-threaded version.
There are some algorithms, for which even an expert programmer can't get much speed up by throwing more threads at it.

Related

How is fairness of thread scheduling ensured across processes?

Every process has at least one thread of execution and I read somewhere that modern Operating Systems only schedule Thread and not process.
So if there are two processes running in the system - P1 with 1 thread and P2 with 100 threads, how will OS scheduling algorithm ensure that both P1 and P2 get approximately same amount of CPU time? If OS blindly schedules threads, P2 will get 100 times more CPU time than P1.
Does it also take into account which Process a particular thread belong to? Otherwise, it seems too easy for a process to hog all the CPU by creating more threads.

Does it also take into account which Process a particular thread belong to? Otherwise, it seems too easy for a process to hog all the CPU by creating more threads.
Wrong question. Consider two jobs that are trying to solve the exact same problem by doing the same work and are perfectly identical except for one thing -- one uses dozens of threads, the other uses dozens of processes. Why should the one that uses dozens of processes get more CPU time than the one that uses dozens of threads?
Your notion of fairness is not really a sensible one.
Instead, scheduling is more designed around trying to get as much work done as possible per unit time. The assumption is that everything the computer is doing is useful and it benefits competing tasks to have other tasks competing with them finish as quickly as possible too.
This is actually all you need the vast majority of the time. But occasionally you have special situations where this doesn't work. One is ultra-high-priority tasks like keeping video or audio flowing or keeping a user interface responsive. Another is ultra-low-priority tasks where there's an enormous amount of work you want done and you don't want the system to be slow for a long time while you're working on it. Priorities are used for this, and generally the system allows higher-priority threads to interrupt lower-priority ones to keep responsiveness.

In general, "fair thread scheduling" attempts to give each thread an equal amount of CPU time (regardless of how much CPU time all threads in a process get); and "fair process scheduling" attempts to give each process the same amount of CPU time (e.g. by giving threads belonging to different processes unequal amounts of CPU time). These are mutually exclusive - you can't have both (unless each process has the same number of threads).
Note that it's all a broken joke anyway. For example, if one thread gets 10 ms of time on a CPU that is running slow due to thermal throttling (and/or because another logical CPU in the same core is busy) and another thread gets 10 ms of time on a CPU that is running faster than normal (e.g. due to "turbo-boost" and/or because the other logical CPU in the core is not being used); then these threads have received an equal amount of CPU time but have not received anything that could be considered "fair" (because one thread might be able to get 20 times as much work done than the other).
Note that it's all unwanted anyway. For example, for a good OS threads would be given a priority to indicate how important the work they do is, and you don't want a high priority thread (doing very important work) to get the same "fair share" of CPU time as a low priority thread (doing irrelevant/unimportant work). For cases where two threads have equal priority you might (in theory) want them to get an "equal" amount of CPU time; but in practice this isn't common and threads block and unblock so often that it isn't worth caring about; and in practice it can lead to "two half finished jobs instead of one completed job and one unstarted job" scenarios that increases the average amount of time a job (e.g. request for work) takes to complete.

If the thread is the basic unit of scheduling (a generally safe assumption these days) then the process scheduler is the one to decide who to allocate the CPUs. How (and whether) it takes thread usage into account is entirely system specific. AND the behavior ma depends upon the type of process. For example, in VMS (and adopted in Windoze) realtime processes are treated differently than other types of processes.
In the VMS-type scheduling, a process with more threads gets more CPU by design. Better for an application to use more threads and for it to use more processes.
Keep in mind that a system may impose limits on the number of threads in a process.

Performance of multi-threading exceeding cores

If I have a process that starts X amount of threads, will there ever be a performance gain having X higher than the number of CPU cores (assuming all the threads are working synchronously without async calls to storage/network)?
E.G. If I have a two cores CPU, will I just slow down the application starting 3+ constantly working threads?

It really depends on what your code does. it is too broad.
Having more threads than cores might speed up the program for example if some of the threads sleep or try to block on a lock. in this case, the OS scheduler can wake different thread and that thread will work while the other thread is sleeping.
Having more threads than the number of cores may also decrease the program execution time because the OS scheduler has to do more work to switch between the threads execution and that scheduling might be a heavy operation.
As always, benchmarking your application with different amount of threads is the best way to achieve maximum performance. there are also algorithms (like Hill-Climbing) which may help the application fine tune the best number of threads on runtime.

It is possible that such a thing happens.
Both Intel and AMD currently implement forms of SMT in their CPUs. This means that, in general, one single thread of execution may not be able to exploit 100% of the computing resources.
This happens because modern CPUs execute instructions in multiple pipelined steps, so that the clock frequency can be increased (less stuff gets done in every cycle, so you can do more cycles). The downside of this approach is that, if you have two consecutive instructions A and B, with the latter depending on the result of the former, you may have to wait some clock cycles without doing anything, just waiting for instruction A to complete. So, they came up with SMT, which allows the CPU to interleave instructions from two different threads/processes on the same pipeline, in order to fill such gaps.
Note: it is not exactly like this, CPUs don't just wait. They try to guess the result of the first operation and execute the second assuming that result. If their guess is wrong, they cancel the pending instructions and start over. Also, they have some feedback circuits that allow tighter execution of interdependent instructions. And nowadays branch predictors are surprisingly good. Things get better for the pipeline if you can just fill gaps with instructions from some other process, rather than going with a guess, but this potentially halves the amount of cache each executing thread can use.

It makes sense to run more threads if your threads make read/write/send/recv syscalls or similar, or sleep on locks, etc.
If your threads are pure computation threads, adding more of them will slow down system because of context switches.
If you still need more threads by design, you might want to look into the cooperative multitasking. Both Windows and Linux have API for that and that will work faster than the context switches. In Windows it called fibers:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms682661(v=vs.85).aspx
In Linux it is a set of functions make/get/swapcontext():
http://man7.org/linux/man-pages/man3/makecontext.3.html

This question: Optimal number of threads per core might help you.
In the thread I wrote an answer describing a scenario when having higher number of threads than the available number of cores boosts performance.

fork vs thread on one single core

Imagine that I have two tasks, each of them needs 2 seconds to finish its job.
In this case, if I create two threads for each of them and my PC is single-core, this won't save any time. Am I right ?
What if I use fork to create two processes (the machine is still single-core) and each process takes charge of one task ? Can this save any time ?
If not, I have a question:
In current modern machine (including multi-core), if I have several heavy tasks, which method should I use ?
fork ?
thread ?
fork + thread, meaning that create some processes and
each process contains more than one thread ?

Even with a single core having two threads may speed up execution. If your routine is purely CPU bound then two threads won't improve anything, indeed the performance will be worse because of context switching overhead. But if the routine has to wait for memory, disk or or network (which is usually the case) then two threads will provide performance gains even with a single core.
About fork vs threads, threads require less resources so, in principle, should be the first choice. But there are two caveats: 1) maybe you want to be able to terminate a parallel routine, this is much safer to do with processes than with threads and 2) some languages (notably Python and Ruby) provide pseudo-thread libraries which do not use real threads but switch between routines using the same thread. This simulated threading can be very useful for example when waiting for network requests but it must be taken into account that it's not real multithreading.
Amendment: As commented by Sergio Tulentsev, Ruby and Python do indeed provide real threads and not only coroutines.

"job takes 2 seconds" - If those 2 seconds are fully occupying the CPU (100% load), you won't gain anything with either thread nor fork if you have no cores to share. The single-core CPU is simply busy and you cannnot make it more busy.
In case this 2 seconds include waiting time (for example on I/O, storage, whatever) you could gain something, even with a single core. The amount of gain depends on the CPU working vs. CPU waiting ratio and the overhead of your multiprocessing. Most non-trivial programs have at least some amount of "CPU waiting", so multithreading is often useful even on single-core CPUs.
This overhead for setting up a coroutine and context switching can be considerable and needs to be measured. Obviously, the shorter the run time of your actiual task is, the larger will be the ratio of overhead (for setting up a thread or process, etc.) and the smaller will be you multi-processing gain.
Traditionally, threads used to have considerably less overhead than processes (after all, that was why they were invented), but the "considerably" has maybe vanished over time - On modern Linux systems, processes are only a tad slower to set up than threads (actually, both use the same system calls). You rather decide between thread or process based on the requirements related to amount of protection (or sharing) of data than execution speed.

CPU usage vs Number of threads

In general what is the relation between CPU usage and number of threads in a program.
Assumptions:
Multi-core CPU
Threads do the exact same job (assume they fetch identical work items from a queue and process them)

It depends on the nature of the application.
An application that mostly do calculations - a ratio of 1 thread per
core is a reasonable decision, since you don't want to spawn too many threads due to overhead, and you want to take advantage of all your cores.
An application that mostly do IO operations (like http requests) can spawn much more threads then the #cores and still increase efficiency, since the bottleneck is the waiting time per IO request, and you want to gain as much information as possible in each time you need to wait.
That said, the CPU-usage you are going to get is still dependent on many factors (IO, synchronization, non parallel parts in your program).
If you are interested in the speed the application will take - always remember Amdahl's law, which gives you a strict bound on the time (speed-up) your application is going to take, even when having infinite number of working cores.

There is no such general relationship, except for the obvious ones:
an application can't use more CPU time (CPU seconds) than the number of available cores multiplied by the number of (wall clock) seconds that it runs, and
a single thread can't use more than one CPU second per second.
The actual amount of CPU that a multi-threaded application depends mostly on the nature of the application, and the way that you've implemented it:
If the computation performed by each thread does not generate contention with other threads for locks, memory access and so on, then you should be able to approach the theoretical limit of available CPU resources.
Contention is liable to reduce effective CPU usage, sometimes dramatically.
But there are no general formulae that will tell you how much speed-up you can get.

I think there is no relation or not easy one. It depends on the jobs the threads are doing. A program with one thread can consume 100% of CPU and a program with lots of threads can consume less.
If you are looking for an optimized relation between threads and job done, you must study your case, and possibly found an empiric solution.

As the other answers already state, "it depends". In an ideal world, for n cores, you would get a throughput of factor n, given that you do the same job in a separate thread on each core (which already contains a false assumption, since you need to somehow synchronize the threads when they read from the same queue).
Understanding the Disruptor, a Beginner's Guide to Hardcore Concurrency gives some nice examples what you need to consider when parallezing tasks, and also shows some cases where the attempt to parallelize leads to a longer execution time.

Considerate, dynamic CPU load management

I am writing a CPU-intensive image processing library. To make best use of available CPU, I can detect the total number of cores on my machine and have my library run with that number of threads. When my library to allocate one thread for each core it performs optimally using 100% available processor time.
The above approach works fine when mine is the only CPU-heavy process running. If another CPU-intensive process is running, or even another instance of my own code, then the OS allocates us only a fraction of the available cores and my library then has too many threads running which is both inefficient and inconsiderate to other processes.
So I would like to find a way to determine the "fair share" number of threads to run given a specific load. For example, if two instances of my process are running on an 8-core machine, each would run with 4 threads. Each would need a way to adapt thread count dynamically according to fluctuations in machine load.
So, my question:
Is there any OS feature or third-party library which allows my process to adapt thread count dynamically to use its fair share of the CPU?
My focus is Windows but interested in non-Windows solutions too.
Edit: to be clear, this is about optimization. I am trying to achieve peak efficiency by running the optimal number of threads appropriate to my fair share of the CPU.

In my eyes, the application shouldnt decide how many threads to spawn. This is an information, that the caller should know. In linux, the "-j" or "--jobs" parameter is widely used (Default: 1).
What about also setting the priority of the processing tasks. So if the caller knows, the processing is mission-critical, he can increase the prio (with the knowledge of maybe blocking the (whole) system). Your processing lib would never know, how important the processing of this image would be.
If the caller doesnt care, then the default low-prio is used, which shouldnt affect the rest of the system. If it does, you should look to what is exactly blocking the system (maybe writing image files to the hdd, reduce ram size to prevent swapping, ...). If you figured out that, you can optimize exactly that point.
If you start the processing with (cpu-cores)*2 on low till normal priority, your system should be useable. No one would expect, that this will kill the system.
Just my 2 cents.

Actually it's not a problem of multithreading but a problem of executing many programs simultaneously. This is hard on most PC's operating systems because it conflicts to the idea of time-sharing.
Let's assume some workflow.
Suppose we have 8 cores and we create 8 threads to feed them; ok, that's easy. Next we choose to monitor core loading to summary how many tasks running on a certain core; well, that needs some statistical assumptions, e.g on Linux you can get a 1/5/15-mins load average chart, but that could be done. The statistical chart is clear and now we get a plot about how many CPU-bound processes are running, say, seeing other 3 CPU-intensive processes.
Then we come to the point: we have to make 3 redundant threads to sleep, but which 3?
Usually we choose 3 threads arbitrarily because the scheduler arranges the other 8 CPU-bound threads automatically. In some cases, we explicitly put threads on high load cores to sleep, assign other threads to certain low load cores, and let the scheduler do the rest things. Most scheduling policies also try to "keep CPU cache hot", which means they tend to forbid transferring threads between cores. We reasonably expect our CPU-intensive threads can utilize the core cache since other processes are scheduled to the 3 crowded cores. Everything looks good.
However this could fail in tightly synchronized computation. In this scenario we need to run our 5 threads simultaneously. Simultaneity here means the 5 threads have to gain CPU and run at almost the same time. I don't know if there's any scheduler on PC could do this for us. In most low-load cases, things still work fine because costs to wait for simultaneity is trivial. But when the load of a core is high and even 1 of our 5 threads is disturbed, occasionally we'll find we spend many life cycles in waiting.
It may help to schedule your program as a real-time program but it's not a perfect solution. Statistically it leads to a wider time window for simultaneity when it gains more CPU control priority. I have to say, it's not guaranteed.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string