I am currently thinking of reasons why multi threaded applications may not scale well.
Two reasons I am aware of and that I have been fighting with are:
Communication between threads is not done well and slows down the speed
Number of cores on a chip and memory bandwith to the cpu do not increase proportionally. This leads to a slower memory bandwith per core the more cores on a chip are heavily used.
What else are problems?
For point 1), they are not necessarily 'not done well', but in most cases there are critical sections that processes/threads have to wait for each other, e.g. update some critical data. This is described well by Amdahl's law.
Another point I'd like to add is the scalability of the task itself. If the task (the input) is not scalable, then increasing processing power (cores/threads) cannot improve the whole throughput. For example, an application is to handle data flows, but there is a constraint that data packets from same flow can not be handled in parallel (due to ordering consideration), then the scalability will be limited by the number of flows.
In addition, the scalability of the algorithm is even more fundamental, considering the difference between O(1) and O(n) algorithms. Of course, maybe the topic here focus on scalability of processing power, rather than data size.
I think that, in (1), you've nailed one of most important factors that can negatively influence the performance of multithreaded apps. Esp. Google for 'false sharing'.
(2), however only affects a set of multithreaded apps - those that that run CPU-bound threads in parallel. If an app uses many threads that are I/O bound, (2) does not matter too much.
Looking at my box here, it has 100 processes and 1403 threads, CPU use 3%. Only 7 out of the 100 processes are single-threaded. Most of the apps, therefore, are multithreaded but I/O waiting.
My box would work reasonably well, at the moment, if it had only one core. Sure, hitting a link that winds up my browser would probably be a bit slower to bring up a complex page, but not much.
In the commonest case then, where apps are multithreaded to take avantage of the high I/O performance of preemptive multitaskers, apps scale very well indeed, even on a single-core CPU.
Try not to fall into the trap of thinking that preemptive multitasking OS are all about 'doing CPU-bound tasks in parallel' - they actually make this difficult by forcing the need for locking, synchro, signalling etc. It's much more about high-performance I/O, something that a cooperative scheduler is spectacularly bad at.
Many multi-threaded applications are built around the "one user one thread" concept which means that once a user or chore needs to be handled a thread is allocated to the task. Every extra thread increases the load on the scheduler leading up to the point where all processing is done trying to determine which thread should be run at this moment. Call this "scheduler saturation."
Windows (the multi-threaded engine, not 95/98/Me etc) has a mechanism called I/O Completion ports which recommend one thread per processor for best performance. IOCP-based applications are usually tremendously fast though, as always, the bottlenecks instead appear in other places such as running out of certain types of OS memory or waiting on the communications medium.
You can search for IOCP here at SO, it has its own tag.
I would add:
The more threads, the smaller their share of CPU cache. A typical modern CPU's might have 3 levels of cache: L1, L2 and L3. L1 might be private to that core, but L2 and L3 might be shared between cores on the die or something. So a single thread can use the entire L2 & L3, but if you have many threads then you get many more cache misses, depending on the profile of your algorithm.
See also:
many-core CPU's: Programming techniques to avoid disappointing scalability
It could be limited by the fixed maximum bandwidth of main memory, where your program has run out of the memory bandwidth, and however you make more thread can't create more available memory bandwidth. This is related to your specific application, whether is a memory bounded one or a compute bounded one, see roofline model.
Related
I am working with Rust but this question would also apply to many other situations.
Suppose you have M available vCPUs and N threads (including the main thread) to schedule, and that N > M. Each thread does approximately equal amounts of work.
Is there any good reason then to pin threads to specific cores? I've written a number of toy benchmarks and it seems like the answer is no, as I cannot make a program under these assumptions that performs better with thread affinity; in fact, it always does much worse.
if your application is working on a system with a lot of cores and heavily relies on the core cache, a context switch will be too expensive, so pinning tasks to cores reduces the context switches and improves throughput.
but in an "average pc" running plain RAM-bound tasks then your OS scheduler will be much better at load balancing the cores than you ever will.
pinning threads to cores is also useful if you care about latency instead of throughput, in a heavily loaded system if you have a time-critical task then you want it to have its own core which won't be contented by other tasks on the system, hence it makes sense to pin it to a certain core, an example will be an in-memory Database that needs to responds to request in under a millisecond latency.
so the answer is, it's only useful for certain apps.
This question has an answer that says:
Hyper-threading duplicates internal resources to reduce context switch
time. Resources can be: Registers, arithmetic unit, cache.
Why did CPU designers end up with duplication of state resources for simultaneous multithreading (or hyper-threading on Intel)?
Why wouldn't tripling (quadrupling, and so on) those same resources give us three logical cores and, therefore, even faster throughput?
Is duplication that researchers arrived at in some sense optimal, or is it just a reflection of current possibilities (transistor size, etc.)?
The answer you're quoting sounds wrong. Hyperthreading competitively shares the existing ALUs, cache, and physical register file.
Running two threads at once on the same core lets it find more parallelism to keep those execution units fed with work instead of sitting idle waiting for cache misses, latency, and branch mispredictions. (See Modern Microprocessors
A 90-Minute Guide! for very useful background, and a section on SMT. Also this answer for more about how modern superscalar / out-of-order CPUs find and exploit instruction-level parallelism to run more than 1 instruction per clock.)
Only a few things need to be physically replicated or partitioned to track the architectural state of two CPUs in one core, and it's mostly in the front-end (before the issue/rename stage). David Kanter's Haswell writeup shows how Sandybridge always partitioned the IDQ (decoded-uop queue that feeds the issue/rename stage), but IvyBridge and Haswell can use it as one big queue when only a single thread is active. He also describes how cache is competitively shared between threads. For example, a Haswell core has 168 physical integer registers, but the architectural state of each logical CPU only needs 16. (Out-of-order execution for each thread of course benefits from lots of registers, that's why register renaming onto a big physical register file is done in the first place.)
Some things are statically partitioned, like the ROB, to stop one thread from filling up the back-end with work dependent on a cache-miss load.
Modern Intel CPUs have so many execution units that you can only barely saturate them with carefully tuned code that doesn't have any stalls and runs 4 fused-domain uops per clock. This is very rare in practice, outside something like a matrix multiply in a hand-tuned BLAS library.
Most code benefits from HT because it can't saturate a full core on its own, so the existing resources of a single core can run two threads at faster than half speed each. (Usually significantly faster than half).
But when only a single thread is running, the full power of a big core is available for that thread. This is what you lose out on if you design a multicore CPU that has lots of small cores. If Intel CPUs didn't implement hyperthreading, they would probably not include quite so many execution units for a single thread. It helps for a few single-thread workloads, but helps a lot more with HT. So you could argue that it is a case of replicating ALUs because the design supports HT, but it's not essential.
Pentium 4 didn't really have enough execution resources to run two full threads without losing more than you gained. Part of this might be the trace cache, but it also didn't have nearly the amount of execution units. P4 with HT made it useful to use prefetch threads that do nothing but prefetch data from an array the main thread is looping over, as described/recommended in What Every Programmer Should Know About Memory (which is otherwise still useful and relevant). A prefetch thread has a small trace-cache footprint and fetches into the L1D cache used by the main thread. This is what happens when you implement HT without enough execution resources to really make it good.
HT doesn't help at all for code that achieves very high throughput with a single thread per physical core. For example, saturating the front-end bandwidth of 4 uops / clock cycle without ever stalling.
Or if your code only bottlenecks on a core's peak FMA throughput or something (keeping 10 FMAs in flight with 10 vector accumulators). It can even hurt for code that ends up slowing down a lot from extra cache misses caused by competing for space in the L1D and L2 caches with another thread. (And also the uop cache and L1I cache).
Saturating the FMAs and doing something with the results typically takes some instructions other than vfma... so high-throughput FP code is often close to saturating the front-end as well.
Agner Fog's microarch pdf says the same thing about very carefully tuned code not benefiting from HT, or even being hurt by it.
Paul Clayton's comments on the question also make some good points about SMT designs in general.
If you have different threads doing different things, SMT can still be helpful. e.g. high-throughput FP code sharing a core with a thread that does mostly integer work and stalls a lot on branch and cache misses could gain significant overall throughput. The low-throughput thread leaves most of the core unused most of the time, so running another thread that uses the other 80% of a core's front-end and back-end resources can be very good.
Objective:
I am trying to estimate how fast my code will execute when run concurrently in multiple threads.
Question 1)
If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?
Question 2)
What impact, if any, does the presence of other threads effect the execution speed of each other thread?
My Situation:
I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time. Takes Half a second on 1 thread and I was worried how this will scale with multiple users performing the same query. Every user requests is handled by a separate thread so 100 simultaneous users will require 100 simultaneous threads. Each thread is sharing the same resource but read only. No writing. Is there any chance I could get each user to see roughly the same execution time?
Note: I know it will depend upon a number of factors but surely there must be some way of identifying whether or not your code will scale if you find it takes x amount of time for a single thread given x hardware. As final note I'd like to add I have limited experience with computer hardware architecture and how multi-threading works under the hood.
These are all interesting questions, but there is, unfortunately, no straightforward answer, because the answer will depend on a lot of different factors.
Most modern machines are multi-core: in an ideal situation, a four-thread process has the ability to scale up almost linearly in a four-core machine (i.e. run four times as fast).
Most programs, though, spend most of their time waiting for things: disk or database access, the memory bus, network I/O, user input, and other resources. Faster machines don't generally make these things appreciably faster.
The way that most modern operating systems, including Windows, Unix/Linux, and MacOS, use the processor is by scheduling processor time to processes and threads in a more-or-less round-robin manner: at any given time there may be threads that are waiting for processor time (this is a bit simplistic, as they all have some notions of process prioritization, so that high-criticality processes get pushed up the queue earlier than less important ones).
When a thread is using a processor core, it gets it all for as long as its time slice lasts: indeed, only one thing at a time is actually running on a single core. When the process uses up its time slice, or requests some resource that isn't immediately available, it its turn at the processor core is ended, and the next scheduled task will begin. This tends to make pretty optimal use of the processor resources.
So what are the factors that determine how well a process will scale up?
What portion of its run time does a single process spend waiting for
I/O and user input?
Do multiple threads hit the same resources, or different ones?
How much communication has to happen between threads? Between individual threads and your processes main thread? This takes synchronization, and introduces waiting.
How "tight" are the hotspots of the active thread? Can the body of it fit into the processor's memory, or does the (much slower) bus memory have to be accessed?
As a general rule, the more independent individual threads are of one another, the more linearly your application will scale. In real-world business applications, though, that is far from the case. The best way to increase the scaling ability of your process is to understand it--and its dependencies--well, and then use a profiler to find out where the most waiting occurs, and see if you can devise technical strategies to obviate them.
If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?
No, you should determine it empirically.
What impact, if any, does the presence of other threads effect the execution speed of each other thread?
Computation-bound tasks will likely scale very well and be mostly independent of other threads. Interestingly enough, some CPU manufacturers implement features which can increase the clock of a lone-busy CPU core to compensate for the all the idle cores. This sort of feature might confound your measurements and expectations about scaling.
Cache/Memory/disk-bound tasks will start to contend with each other except for where resource partitions exist.
I know it will depend upon a number of factors
Absolutely! So I recommend that you prototype it and measure it. And then find out why it didn't scale as well as you'd hoped and try a different algorithm. Iterate.
but surely there must be some way of identifying whether or not your code will scale
Yes, but unfortunately it requires a detailed description of the algorithm implemented by the code. Your results will be heavily dependent on the ratio of your code's activity among these general regions, and your target's capability for these:
disk I/O
network I/O
memory I/O
computation
My Situation: My application runs in an app server that assigns one thread for every user request. If my application executes in 2 seconds for 1 user I can't assume it will be always take 2 seconds if say 100 users are simultaneously running the same operation correct?
If your app server computes pi to 100 digits for each user request, it will likely scale reasonably well until you encounter the core limit of your target.
If your app server does database queries for each user request, it will likely scale only as well as the target hardware can sustain the necessary load.
EDIT given specifics:
I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time.
Your problem sounds memory+cache-bound. You should study the details of your target CPU/mem deployment or if you are designing it, opt for high memory throughput.
A NUMA system ("resource partitioning" for memory) can likely maximize your overall concurrent memory throughput. Note that since your problem seems to dictate concurrent access to the same memory pages, a NUMA system would penalize the process doing remote memory accesses. In this case, consider creating multiple copies of the data at initialization time.
Depending on the pattern of traversal, TLB pressure might be a factor. Consider experimenting with huge (aka "large") pages.
Cache contention may be a factor in scaling as well.
Your specific algorithm could easily end up dominating over any of the specific system effects, depending on how far apart the best and worst cases are.
limited experience with computer hardware architecture and how multi-threading works under the hood.
Profile the query using CPU performance counters with a tool like Intel's VTune, perf, or oprofile. It can tell you where expensive operations are executing in your code. With this information you can optimize your query to perform well (individually and in aggregate).
If you're spawning multiple threads (or processes) concurrently, is it better to spawn as many as the number of physical processors or the number of logical processors, assuming the task is CPU-bound? Or is it better to do something in between (say, 3 threads)?
Does the performance depend on the kind of instructions that are getting executed (say, would non-local memory access be much different from cache hits)? If so, in which cases is it better to take advantage of hyperthreading?
Update:
The reason I'm asking is, I remember reading somewhere that if you have as many tasks as the number of virtual processors, tasks on the same physical core can sometimes starve some CPU resources and prevent each other from getting as many resources as needed, possibly decreasing performance. That's why I'm wondering if having as many threads as virtual cores is a good idea.
The performance depends on a huge variety of factors. Most tasks are not strictly CPU bound, since even if all of the data is in memory it is usually not on-board in the processor cache. I have seen examples (like this one) where memory access patterns can dramatically change the performance profile of a given 'parallel' process.
In short, there is no perfect number for all situations.
Chances are pretty good that you will see a performance improvement running 2 threads per core with HyperThreading enabled. Jobs that appear to be entirely CPU bound usually aren't, and HyperThreading can extract a few "extra" cycles out of the occasional interrupt or context switch.
On the other hand, with a core iX processor that has Turbo Boost, you might actually do better running 1 thread per core to encourage the CPU to overclock itself.
At work, we routinely run many-core servers at full CPU doing various kinds of calculation for days at a time. A while back we measured the performance difference with and without HT. We found that on average, with HyperThreading, and running twice as many jobs at once, we could complete the same amount of jobs about 10% faster than than without HyperThreading.
Assume that 2 × cores is a good place to start, but the bottom line is: measure!
I remember info that hyperthreading can give you up to 30% of performance boost. in general you'd better to treat them as 4 different cores. of course in some specific circumstances (e.g. having the same long running task bound to each core) you can divide your processing better taking into account that some cores are just logical ones
more info about hyperthreading itself here
Using Hyperthreading to run two threads on the same core, when both threads have similar memory access patterns but access disjoint data structures, would be very roughly equivalent to running them on two separate cores each with half the cache. If the memory-access patterns are such that half the cache would be sufficient to prevent thrashing, performance may be good. If the memory-access patterns are such that halving the cache induces thrashing, there may be a ten-fold performance hit (implying one would have been much better off without hyperthreading).
On the other hand, there are some situations where hyperthreading may be a huge win. If many threads will all be reading and writing the same shared data using lock-free data structures, and all threads must see a consistent view of the data, trying to run threads on disjoint processor may cause thrashing since only one processor at a time may have read-write access to any given cache line; running such a threads on two cores may take longer than running only one at a time. Such cache arbitration is not required, however, when a piece of data is accessed by multiple threads on a single core. In those cases, hyperthreading can be a huge win.
Unfortunately, I don't know any way to give the scheduler any "hints" to suggest that some threads should share a core when possible, while others should run separately when possible.
HT allows a boost of approximately 10-30% for mostly cpu-bound tasks that use the extra virtual cores. Although these tasks may seem CPU-bound, unless they are custom made assembly, they will usually suffer from IO waits between RAM and local cache. This allows one thread running on a physical HT-enabled core to work while the other thread is waiting for IO. This does come with a disadvantage though, as two threads share the same cache/bus, which will result in less resources each which may cause both threads to pause while waiting for IO.
In the last case, running a single thread will decrease the maximum simultaneous theoretical processing power(by 10-30%) in favor of running a single thread without the slowdown of cache thrashing which may be very significant in some applications.
Choosing which cores to use is just as important as choosing how many threads to run. If each thread is CPU-bound for roughly the same duration it is best to set the affinity such that threads using mostly different resources find themselves on different physical cores and threads using common resources be grouped to the same physical cores(different virtual core) so that common resources can be used from the same cache without extra IO wait.
Since each program has different CPU-usage characteristics and cache thrashing may or may not be a major slowdown(it usually is) it is impossible to determine what the ideal number of threads should be without profiling first. One last thing to note is that the OS/Kernel will also require some CPU and cache space. It is usually ideal to keep a single (physical)core set aside for the OS if real-time latency is required on CPU-bound threads so as to avoid sharing cache/cpu resources. If threads are often waiting for IO and cache thrashing is not an issue, or if running a real-time OS specifically designed for the application, you can skip this last step.
http://en.wikipedia.org/wiki/Thrashing_(computer_science)
http://en.wikipedia.org/wiki/Processor_affinity
All of the other answers already give lots of excellent info. But, one more point to consider is that the SIMD unit is shared between logical cores on the same die. So, if you are running threads with SSE code, do you run them on all 4 logical cores, or just spawn 2 threads (assuming you have two chips)? For this odd case, best to profile with your app.
5, 100, 1000?
I guess, "it depends", but on what?
What is common in applications that run as server daemons / services?
What are hard limits?
Given that the machine can handle the overall workload, how do I determine at how many threads the overhead starts to have an impact on performance?
What are important differences between OS's?
What else should be considered?
I'm asking because I would like to employ threads in an application to organize subcomponents of my application that do not share data and are designed to do their work in parallel. As the application would also use thread pools for parallelizing some tasks, I was wondering at what point I should start to think about the number of threads that's going to run in total.
I know the n+1 rule as a guideline for determining the number of threads that simultaneously work on the same task to gain performance. However, I want to use threads like one might use processes in a larger scope, i. e. to organize independent tasks that should not interfere with each other.
In this related question, some people advise to minimise the number of threads because of the added complexity. To me it seems that threads can also help to keep things sorted more orderly and actually reduce interference. Isn't that correct?
I can't answer your question about "how much is many" but I agree that you should not use threads for every task possible.
The optimal amount of threads for performance of application is (n+1), where n is the amount of processors/cores your computer/claster has.
The more your actual thread amount differs from n+1, the less optimal it gets and gets your system resources wasted on thread calculations.
So usually you use 1 thread for the UI, 1 thread for some generic tasks, and (n+1) threads for some huge-calculation tasks.
Actually Ajmastrean is a little out of date. Quoting from his own link
The thread pool has a default size of
250 worker threads per available
processor, and 1000 I/O completion
threads. The number of threads in the
thread pool can be changed by using
the SetMaxThreads method.
But generally I think 25 is really where the law of diminishing returns (and programmers abilities to keep track of what is going on) starts coming into effect. Although Max is right, as long as all of the threads are performing non-blocking calculations n+1 is the optimal number, in the real world most of the threading tasks I perform tend to be done on stuff with some kind of IO.
Also depends on your architecture. E.g. in NVIDIA GPGPU lib CUDA you can put on an 8 thread multiprocessor 512 threads simoultanously. You may ask why assign each of the scalar processors 64 threads? The answer is easy: If the computation is not compute bound but memory IO bound, you can hide the mem latencies by executing other threads. Similar applies to normal CPUs. I can remember that a recommendation for the parallel option for make "-j" is to use approx 1.5 times the number of cores you got. Many of the compiling tasks are heavy IO burden and if a task has to wait for harddisk, mem ... whatever, CPU could work on a different thread.
Next you have to consider, how expensive a task/thread switch is. E.g. it is comes free, while CPU has to perform some work for a context switch. So in general you have to estimate if the penalty for two task switches is longer than the time the thread would block (which depends heavily on your applications).
Microsoft's ThreadPool class limits you to 25 threads per processor. The limit is based on context switching between threads and the memory consumed by each thread. So, that's a good guideline if you're on the Windows platform.