Will a multi-threaded application be actually faster than a single-threaded application? - multithreading

All is entirely theoretical, the question just came to mind and I wasn't entirely sure whats the answer:
Assume you have an application that calculates 4 independent calculations. (Totally independent, doesn't matter what order you do them and you don't need one to calculate another).
Also assume those calculations are long (minutes) and CPU-bound (not waiting for any kind of IO)
1) Now, if you have a 1-processor computer, a single thread application will logically be faster than (or the same as) a multithreaded application. As the computer not able to do more then one thing at a time with one processor, it would "waste" time on context switching and the likes.
So far so good?
2) If you have a 4 processor computer, 4 threads will mostly likely be faster for this than single thread. Right? your computer can now do 4 operations at a time so its just logical to divide your application to 4 threads, and it should complete with the time the longest of the 4 calculations take.
Still good so far?
3) And now the actual part I am confused about - why would I EVER have my application create more threads than the number of processors (well actually - cores) available? I have programmed and have seen applications that create tens and hundreds of threads, but actually - the perfect number is about 8 for an average computer?
P.S. I already read this: Threading vs single thread
but didn't quiet answer that.
Cheers

Why would I EVER have my application create more threads than the number of processors (well actually - cores) available?
One very good reason is if you have threads that wait on events. For example you might have a producer/consumer application in which the producer is reading from some data stream, and that data arrives in bursts: a few hundred (or thousand) records in a batch, followed by nothing for a while, and then another burst. Say you have a 4-core machine. You could have a single producer thread that reads the data and places it in a queue, and three consumer threads to process the queue.
Or, you could have a single producer thread and four consumer threads. Most of the time, the producer thread is idle, giving you four consumer threads to process items from the queue. But when items are available on the data stream, one of the consumer threads gets swapped out in favor of the producer.
That's a simplified example, but substantially similar to programs that I have in production.
More generally, it doesn't make any sense to create more continuously-working (i.e. CPU bound) threads than you have processing units (CPU cores in general, although the existence of hyperthreading muddies the waters a bit). If you know that your threads won't be waiting on external events, then having n+1 threads when you only have n cores will end up wasting time with thread context switches. Note that this is strictly in the context of your program. If there are other applications and OS services running, your application's threads will get swapped out from time to time so that those other apps and services can get a timeslice. But one assumes that, if you're running a CPU-intensive program, you'll limit the other apps and services that are running at the same time.
Your best bet, of course, is to set up a test. On a 4-core machine, test your app with 1, 2, 3, 4, 5, ... threads. Time how long it takes to complete with different numbers of threads. I think you'll find that on a 4-core machine the sweet spot will be 3 or 4; most likely 4 unless there are other apps or OS services that take a lot of CPU.

One reason i could come up with for more threads than cores would be if some threads needed to interface with other parties... waiting for a response from a server.. querying something from the database. This will allow the thread to sleep until an answer is provided. this way other computations wouldn't have to wait. in the 4cores->4thread the thread would wait for input which possibly causes other code to have to wait too

Adding threads to your application is not strictly about performance gains. Some times you want or need to perform more than one task at the same time because that is the most logical way to architect your program.
As an example, perhaps you are writing a game engine, if you take a multi-threaded approach, you may have one thread for physics, one thread for graphics, one thread for networking, one thread for user input, one thread for resource loading from disk etc.
Also James Baxters point is very true as well. Some times threads are waiting on a resource and can not execute further until they access said resource. With only the same number of threads as cores, one core would be going to waste.

I think you are assuming that all programs are CPU bound - remember some of your threads will be waiting for I/O (disk/network/user traffic).

Related

Thread synchronisation for very short tasks

I have a C++ application running on winapi. Portability is not an issue. All I want is maximum performance. I have a basic understanding of multithreading and synchronization issues, but limited experience with the multitude of options ranging from winapi over C++ threads to third party libraries.
In the performance critical core of my application I identified a loop, which could be parallelized. I managed to split the loop into 4 parts which do not depend on each other. I would like to delegate the job to 4 threads running in parallel. The main thread should wait until all 4 threads have done their job, before it continues.
Sounds very simple. However, currently the loop takes only about 10 microseconds when running on one thread. I'm afraid that synchronization methods which cause a switch to the kernel (events, mutexes, etc.) would produce more overhead than the parallelization could save. SRWLocks + condition variables claim to be very lightweight, but I didn't find a way to solve my synchronization with these tools.
Of course I could test all kinds of synchronization APIs, but I'm sure this has been done before.
So my question is: Is there a reasonable way to synchronize very short tasks and if so, what are the appropriate tools?
If you simply need to wait for threads to complete you would use WaitForMultipleObjects on the thread handles. The other direct option would be to use a synchronization barrier, a primitive that allows a group of threads to halt until all members of the group have reached the barrier, but that is generally for the case where there is more work for the spawned threads to perform after being released.
Your question of whether this would actually be of benefit in your particular case is one that can only be answered through implementation and timing. And note that if you are going to perform this testing it should be done on a release build with optimizations enabled. It may well be the case that if the amount of work to perform is short enough that the time involved in thread management dwarfs any benefit.
The update algorithm consists of two steps. Each of these steps can be applied to the knots in arbitrary order, but step 1 must be completed before step 2 can start. I can portion the whole net into four (or more) parts and delegate each part to a separate thread. My problem is: Each thread has to pause after step 1 and wait until all threads have finished their job. Then each thread makes step 2, wait for completion of the other threads and so on.
You want to break the work into a large number of small chunks and have a fixed pool of threads take chunks of work. Do not make 8 threads on an 8 core machine and split the work into 8 chunks. That algorithm will work poorly if, for one reason or another, only 7 of those cores winds up doing work for you. Your algorithm will need twice as long as the second half of the time only one core is working.
The easy way is to have an extra dispatch thread. Just keep a "work unit" count somewhere protected by a mutex. When a thread finishes a work unit, have it decrement the "work unit" count. When it hits zero, broadcast a condition variable. That will wake the dispatch thread which will then do whatever it takes to get the worker threads going again. It can start them by setting the "work unit" count to the right level and broadcasting another condition variable that the worker threads wait for.
You can also just keep a count of which node needs to be done next and the number of nodes currently doing work. That will require synchronization after each thread though (to figure out which node to do next) and it may make more sense to have each thread grab some number of nodes, iterate over them, and then synchronize to grab another few nodes.
Avoid breaking the work into large chunks early. That can lead to the problem where you have 8 cores but 2 large work units left at some point. Remember, many modern CPUs run their cores at different speeds based on temperature and power measurements.

How is fairness of thread scheduling ensured across processes?

Every process has at least one thread of execution and I read somewhere that modern Operating Systems only schedule Thread and not process.
So if there are two processes running in the system - P1 with 1 thread and P2 with 100 threads, how will OS scheduling algorithm ensure that both P1 and P2 get approximately same amount of CPU time? If OS blindly schedules threads, P2 will get 100 times more CPU time than P1.
Does it also take into account which Process a particular thread belong to? Otherwise, it seems too easy for a process to hog all the CPU by creating more threads.
Does it also take into account which Process a particular thread belong to? Otherwise, it seems too easy for a process to hog all the CPU by creating more threads.
Wrong question. Consider two jobs that are trying to solve the exact same problem by doing the same work and are perfectly identical except for one thing -- one uses dozens of threads, the other uses dozens of processes. Why should the one that uses dozens of processes get more CPU time than the one that uses dozens of threads?
Your notion of fairness is not really a sensible one.
Instead, scheduling is more designed around trying to get as much work done as possible per unit time. The assumption is that everything the computer is doing is useful and it benefits competing tasks to have other tasks competing with them finish as quickly as possible too.
This is actually all you need the vast majority of the time. But occasionally you have special situations where this doesn't work. One is ultra-high-priority tasks like keeping video or audio flowing or keeping a user interface responsive. Another is ultra-low-priority tasks where there's an enormous amount of work you want done and you don't want the system to be slow for a long time while you're working on it. Priorities are used for this, and generally the system allows higher-priority threads to interrupt lower-priority ones to keep responsiveness.
In general, "fair thread scheduling" attempts to give each thread an equal amount of CPU time (regardless of how much CPU time all threads in a process get); and "fair process scheduling" attempts to give each process the same amount of CPU time (e.g. by giving threads belonging to different processes unequal amounts of CPU time). These are mutually exclusive - you can't have both (unless each process has the same number of threads).
Note that it's all a broken joke anyway. For example, if one thread gets 10 ms of time on a CPU that is running slow due to thermal throttling (and/or because another logical CPU in the same core is busy) and another thread gets 10 ms of time on a CPU that is running faster than normal (e.g. due to "turbo-boost" and/or because the other logical CPU in the core is not being used); then these threads have received an equal amount of CPU time but have not received anything that could be considered "fair" (because one thread might be able to get 20 times as much work done than the other).
Note that it's all unwanted anyway. For example, for a good OS threads would be given a priority to indicate how important the work they do is, and you don't want a high priority thread (doing very important work) to get the same "fair share" of CPU time as a low priority thread (doing irrelevant/unimportant work). For cases where two threads have equal priority you might (in theory) want them to get an "equal" amount of CPU time; but in practice this isn't common and threads block and unblock so often that it isn't worth caring about; and in practice it can lead to "two half finished jobs instead of one completed job and one unstarted job" scenarios that increases the average amount of time a job (e.g. request for work) takes to complete.
If the thread is the basic unit of scheduling (a generally safe assumption these days) then the process scheduler is the one to decide who to allocate the CPUs. How (and whether) it takes thread usage into account is entirely system specific. AND the behavior ma depends upon the type of process. For example, in VMS (and adopted in Windoze) realtime processes are treated differently than other types of processes.
In the VMS-type scheduling, a process with more threads gets more CPU by design. Better for an application to use more threads and for it to use more processes.
Keep in mind that a system may impose limits on the number of threads in a process.

Why does Dropbox use so many threads?

My understanding of threads is that you can only have one thread per core, two with hyper threading, before you start losing efficiency.
This computer has eight cores and so should work best with 8/16 threads then, yet many applications use several times that, especially Dropbox.
It also uses 95 threads while idling on my laptop, which only has 4 cores.
Why is this the case? Does it have so many threads for programming convenience, have I misunderstood threading efficiency or is it something else entirely?
I took a peek at the Mac version of the client, and it seems to be written in Python and it uses several frameworks.
A bunch of threads seem to be used in some in house actor system
They use nucleus for app analytics
There seems to be a p2p network
some networking threads (one per hype core)
a global pool (one per physical core)
many threads for file monitoring and thumbnail generation
task schedulers
logging
metrics
db checkpointing
something called infinite configuration
etc.
Most are idle.
It looks like a hodgepodge of subsystems, each starting their own threads, but they don't seem too expensive in terms of memory or CPU.
My understanding of threads is that you can only have one thread per core, two with hyper threading, before you start losing efficiency.
Nope, this is not true. I'm not sure why you think that, but it's not true.
As just the most obvious way to show that it's false, suppose you had that number of threads and one of them accessed a page of memory that wasn't in RAM and had to be loaded to disk. If you don't have any other threads that can run, then one core is wasted for the entire time it takes to read that page of memory from disk.
It's hard to address the misconception directly without knowing what flawed chain of reasoning led to it. But the most common one is that if you have more threads ready-to-run than you can execute at once, then you have lots of context switches and context switches are expensive.
But that is obviously wrong. If all the threads are ready-to-run, then no context switches are necessary. A context switch is only necessary if a running thread stops being ready-to-run.
If all context switches are voluntary, then the implementation can select the optimum number of context switches. And that's precisely what it does.
Having large numbers of threads causes you to lose efficiency if, and only if, lots of threads do a small amount of work and then become no longer ready-to-run while other waiting threads are ready-to-run. That forces the implementation to do a context even where it is not optimal.
Some applications that use lots of threads do in fact do this. And that does result in poor performance. But Dropbox doesn't.

Limitation of max. threads one can create in Multi-threading

I have multiple threads being invoked by say several other processes at the same time. Generally the thumb rule for max. number of threads that a processor can start giving performance efficiency is no. of threads = no. of processors + 1 (not sure though). All the modern applications maintain a threadpool and keep on re-using threads at any particular instance.
How can we make sure that performance won't degrade due to this. Because when it goes beyond the limit, threads keep on context switching and at any singe point, none of them will be executing the critical section of the code.
The number of threads is more dependend on the resources it uses.
If the thread processes data from disk or network, it depends on how long it has to wait on that resources. During the wait another thread can do some work.
For pure number crunching I would say one thread per processer/core.

Question about app with multiple threads in a few CPU-machine

Given a machine with 1 CPU and a lot of RAM. Besides other kinds of applications (web server etc.), there are 2 other server applications running on that machine doing the exact same kind of processing although one uses 10 threads and the other users 1 thread. Assume the processing logic for each request is 100% CPU-bound and typically takes no longer than 2 seconds to finish. The question is whose throughput, in terms of transactions processed per minute, might be better? Why?
Note that the above is not a real environment, I just make up the data to make the question clear. My current thinking is that there should be no difference because the apps are 100% CPU-bound and therefore if the machine can handle 30 requests per minute for the 2nd app, it will also be able to handle 3 requests per minute for each of the 10 threads of the 1st app. But I'm glad to be proven wrong, given the fact that there are other applications running in the machine and one application might not be always given 100% CPU time.
There's always some overhead involved in task switching, so if the threads aren't blocking on anything, fewer threads is generally better. Also, if the threads aren't executing the same part of code, you'll get some cache flushing each time you swtich.
On the other hand, the difference might not be measurable.
Interesting question.
I wrote a sample program that does just this. It has a class that will go do some processor intensive work, then return. I specify the total number of threads I want to run, and the total number of times I want the work to run. The program will then equally divide the work between all the threads (if there's only one thread, it just gets it all) and start them all up.
I ran this on a single proc VM since I could find a real computer with only 1 processor in it anymore.
Run independently:
1 Thread 5000 Work Units - 50.4365sec
10 Threads 5000 Work Units - 49.7762sec
This seems to show that on a one proc PC, with lots of threads that are doing processor intensive work, windows is smart enough not to rapidly switch them back and fourth, and they take about the same amount of time.
Run together (or as close as I could get to pushing enter at the same time):
1 Thread 5000 Work Units - 99.5112sec
10 Threads 5000 Work Units - 56.8777sec
This is the meat of the question. When you run 10 threads + 1 thread, they all seem to be scheduled equally. The 10 threads each took 1/10th longer (because there was an 11th thread running) while the other thread took almost twice its time (really, it got 1/10th of its work done in the first 56sec, then did the other 9/10ths in the next 43sec...which is about right).
The result: Window's scheduler is fair on a thread level, but not on a process level. If you make a lot of threads, it you can leave the other processes that weren't smart enought to make lots of threads high and dry. Or just do it right and us a thread pool :-)
If you're interested in trying it for yourself, you can find my code:
http://teeks99.com/ThreadWorkTest.zip
The scheduling overhead could make the app with 10 threads slower than the one with 1 thread. You won't know for sure unless you create a test.
For some background on multithreading see http://en.wikipedia.org/wiki/Thread_(computer_science)
This might very well depend on the operating system scheduler. For example, back in single-thread days the scheduler knew only about processes, and had measures like "niceness" to figure out how much to allocate.
In multithreaded code, there is probably a way in which one process that has 100 threads doesn't get 99% of the CPU time if there's another process that has a single thread. On the other hand, if you have only two processes and one of them is multithreaded I would suspect that the OS may give it more overall time. However, AFAIK nothing is really guaranteed.
Switching costs between threads in the same process may be cheaper than switching between processes (e.g., due to cache behavior).
One thing you must consider is wait time on the other end of the transaction. Having multiple threads will allow you to be waiting for a response on one while preparing the next transaction on the next. At least that's how I understand it. So I think a few threads will turn out better than one.
On the other hand you must consider the overhead involved with dealing on multiple threads. The details of the application are important part of the consideration here.

Resources