What is meant by cpu slack? - multithreading

The following is an excerpt from the book Java Concurrency in Practice, Chapter 12.2 Testing for Performance where the author talks about throughput of a bounded buffer implementation.
Figure 12.1 shows some sample results on a 4-way machine, using buffer
capacities of 1, 10, 100, and 1000. We see immediately that a buffer
size of one causes very poor throughput; this is because each thread
can make only a tiny bit of progress before blocking and waiting for
another thread. Increasing buffer size to ten helps dramatically, but
increases past ten offer diminishing returns.
It may be somewhat puzzling at first that adding a lot more threads
degrades performance only slightly. The reason is hard to see from the
data, but easy to see on a CPU performance meter such as perfbar while
the test is running: even with many threads, not much computation is
going on, and most of it is spent blocking and unblocking threads. So
there is plenty of CPU slack for more threads to do the same thing
without hurting performance very much.
However, be careful about concluding from this data that you can
always add more threads to a producer-consumer program that uses a
bounded buffer. This test is fairly artificial in how it simulates the
application; the producers do almost no work to generate the item
placed on the queue, and the consumers do almost no work with the item
retrieved. If the worker threads in a real producer-consumer
application do some nontrivial work to produce and consume items (as
is generally the case), then this slack would disappear and the
effects of having too many threads could be very noticeable. The
primary purpose of this test is to measure what constraints the
producer-consumer handoff via the bounded buffer imposes on overall
throughput.
What does the author mean by cpu slack here? Why will the throughput degrade not degrade more and more as more number of threads are being added? I am not following the reasoning given by the author regarding the slight degradation of performance while adding more and more threads , assuming that the bound on the buffer size is kept constant.
Edit: I can think of one reason :since in this case no real work is being done by threads , so the classic problem of increased traffic on shared memory bus, number of cache misses due to context switching of threads are not playing a major role as more and more threads are being added. The situation is going to change once the threads start doing some more work. Is that what the author is trying to convey here in the third paragraph?

There is no formal term such as CPU slack. The author simply means that the CPU is not fully utilised in doing meaningful work because most time is spent waiting to successfully acquire a mutually exclusive lock. The author is calling the unused capacity of the CPU, the CPU slack.
NOTE: The associated code tests a multiple producer / multiple consumer scenario, with an equal number of producers and consumers.
EDIT: In the later discussion they talk about the effect of adding more threads if a) the threads do almost no work, and b) the threads occupy the CPU substantially for every produced or consumed item. I will try to explain the difference with some slightly artificial scenarios.
Suppose that locking takes 1 time unit actively, and 8 time units passively by waiting. Passive waiting does not occupy the CPU.
Case 1: Producer-Consumer cost is 1 time unit.
So we currently account for 2 time units of CPU time, with an
additional 8 time units of passive waiting time. So we have 8/10
available CPU time units.
If we now want to double the number of threads, we need to accommodate
an additional 2 time units (1 for producer-consumer stuff, and 1 for
active locking time). That would eat into our supply of available CPU
time -- but we have enough.
Case 2: Producer-Consumer cost is 11 time units.
So we currently account for 11+1=12 time units of CPU time, with an additional 8 time units of passive waiting time. So we have 8/20 available CPU time units.
If we now want to double the number of threads, we need to accommodate an additional 12 time units (11 for producer-consumer stuff, and 1 for active locking time). That goes beyond the available CPU time units. Something has to give -- so waiting time will increase, and throughput will suffer.
So in case 2, the amount of real work reduces the amount of time available for new threads, thereby increasing the observed effect of locking contention on the throughput. It would have been nice if they had also included figures for this imagined scenarios in the book. It would have made their hand-wavy argument easier to follow.

I think cpu slack is the resource. According to Wikipedia, it is referred to the amount of time left after a job if the job was started now.
Plenty of cpu slack means much computation resources. When Consumer/Producer do something nontrivial, cpu slack decreases and impacts throughput.

Related

Context switch: what happens in a worst case scenario?

I want to understand how a certain worst case scenario of context switch happens. Say I have 10 CPU cores running a single process. Everything is CPU intensive, no thread is sleeping (waiting for I/O).
(I am mainly concerned with mainstream modern personal computer architectures and systems, typically x64 with Windows, Linux...)
Correct me if I'm wrong: running 10 CPU/RAM intensive independent threads is most often a near optimal situation. The amount of time spent in context switch is rather negligible. While the system may sometimes decide to re-attribute threads to different cores in a round-robin fashion causing a reset of RAM caches, it has a minor effect and works almost as if each thread was running on a single fixed core.
Only the main RAM bus may be a limitation since all threads share it, but it's not the point I'm interested in here. Reducing the number of threads will not increase the throughput anyway.
Now assume you still have 10 cores but run 1000 threads. The scheduler could theoretically decide to switch rarely (say every second) running 10 threads for a second, then 10 others... and the whole thing would still be close to optimal performance (throughput).
But it does not seem to be the case and it looks like threads are switched intensively causing a strongly suboptimal performance (throughput). Am I right about it? What is the main cause for this suboptimal performance? A few numbers would be nice if you have any idea of orders of magnitude of (for example): switches per second, performance loss caused by switching...
I'm going to answer my own question (after some search).
On windows, the number of context switches can be measured with performance counters: https://technet.microsoft.com/en-us/library/cc938606.aspx
I measured it on my machine (core i7/Windows 10) and the order of magnitude is around 1000/s by core when the number of running threads is more than the number of cores (and these threads are full CPU).
The time needed for a context switch varies quite a bit depending on:
what registers need to be saved
if FPU registers need to be saved
the processor model (of course)
You can read: https://www.quora.com/How-long-does-a-context-switch-take or http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
A slightly pessimistic avg. order of magnitude seems to be 1000 ns. Thus the total time for all context switches on each core is 1ms per second, that is 0.1%.
This does not depend on the number of threads: if you run 100 or 1000 threads, the number of switches does not change. As a conclusion the time spent in context switching is somehow negligible.
This reasoning is correct as long as the threads are pure CPU with only small memory read/write like a few local variables. I ran a test with full CPU threads and the difference between a few and 1000 threads is not noticeable.
But the situation changes when RAM is involved and switches makes CPU (memory) cache less efficient. A worse case is when:
computation can be split into 1000 independent "data" parts
each part of the data fits just into the memory cache (say L1 or L2) of a core
each part needs to be read many times
In this situation, running 10 threads to completion, then ten others... would take full advantage of the cache, while running 1000 threads at a time would causes the cache to be useful only during 1ms.
But if the data of several threads could fit into the cache, or if the threads read common data to some degree, or if each thread reads the data just once, then it is possible that running 1000 threads vs. running 10 threads a hundred times will have similar throughput.
It is more a matter a adapting parallelism to memory access. And it depends very much on the way memory needs to be accessed.
The time spend in context switching is negligible, the time lost because of wrong usage of caches may sometimes be problem, sometimes not, depending on how the memory is accessed and shared.

How is processor speed distributed across threads?

Objective:
I am trying to estimate how fast my code will execute when run concurrently in multiple threads.
Question 1)
If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?
Question 2)
What impact, if any, does the presence of other threads effect the execution speed of each other thread?
My Situation:
I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time. Takes Half a second on 1 thread and I was worried how this will scale with multiple users performing the same query. Every user requests is handled by a separate thread so 100 simultaneous users will require 100 simultaneous threads. Each thread is sharing the same resource but read only. No writing. Is there any chance I could get each user to see roughly the same execution time?
Note: I know it will depend upon a number of factors but surely there must be some way of identifying whether or not your code will scale if you find it takes x amount of time for a single thread given x hardware. As final note I'd like to add I have limited experience with computer hardware architecture and how multi-threading works under the hood.
These are all interesting questions, but there is, unfortunately, no straightforward answer, because the answer will depend on a lot of different factors.
Most modern machines are multi-core: in an ideal situation, a four-thread process has the ability to scale up almost linearly in a four-core machine (i.e. run four times as fast).
Most programs, though, spend most of their time waiting for things: disk or database access, the memory bus, network I/O, user input, and other resources. Faster machines don't generally make these things appreciably faster.
The way that most modern operating systems, including Windows, Unix/Linux, and MacOS, use the processor is by scheduling processor time to processes and threads in a more-or-less round-robin manner: at any given time there may be threads that are waiting for processor time (this is a bit simplistic, as they all have some notions of process prioritization, so that high-criticality processes get pushed up the queue earlier than less important ones).
When a thread is using a processor core, it gets it all for as long as its time slice lasts: indeed, only one thing at a time is actually running on a single core. When the process uses up its time slice, or requests some resource that isn't immediately available, it its turn at the processor core is ended, and the next scheduled task will begin. This tends to make pretty optimal use of the processor resources.
So what are the factors that determine how well a process will scale up?
What portion of its run time does a single process spend waiting for
I/O and user input?
Do multiple threads hit the same resources, or different ones?
How much communication has to happen between threads? Between individual threads and your processes main thread? This takes synchronization, and introduces waiting.
How "tight" are the hotspots of the active thread? Can the body of it fit into the processor's memory, or does the (much slower) bus memory have to be accessed?
As a general rule, the more independent individual threads are of one another, the more linearly your application will scale. In real-world business applications, though, that is far from the case. The best way to increase the scaling ability of your process is to understand it--and its dependencies--well, and then use a profiler to find out where the most waiting occurs, and see if you can devise technical strategies to obviate them.
If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?
No, you should determine it empirically.
What impact, if any, does the presence of other threads effect the execution speed of each other thread?
Computation-bound tasks will likely scale very well and be mostly independent of other threads. Interestingly enough, some CPU manufacturers implement features which can increase the clock of a lone-busy CPU core to compensate for the all the idle cores. This sort of feature might confound your measurements and expectations about scaling.
Cache/Memory/disk-bound tasks will start to contend with each other except for where resource partitions exist.
I know it will depend upon a number of factors
Absolutely! So I recommend that you prototype it and measure it. And then find out why it didn't scale as well as you'd hoped and try a different algorithm. Iterate.
but surely there must be some way of identifying whether or not your code will scale
Yes, but unfortunately it requires a detailed description of the algorithm implemented by the code. Your results will be heavily dependent on the ratio of your code's activity among these general regions, and your target's capability for these:
disk I/O
network I/O
memory I/O
computation
My Situation: My application runs in an app server that assigns one thread for every user request. If my application executes in 2 seconds for 1 user I can't assume it will be always take 2 seconds if say 100 users are simultaneously running the same operation correct?
If your app server computes pi to 100 digits for each user request, it will likely scale reasonably well until you encounter the core limit of your target.
If your app server does database queries for each user request, it will likely scale only as well as the target hardware can sustain the necessary load.
EDIT given specifics:
I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time.
Your problem sounds memory+cache-bound. You should study the details of your target CPU/mem deployment or if you are designing it, opt for high memory throughput.
A NUMA system ("resource partitioning" for memory) can likely maximize your overall concurrent memory throughput. Note that since your problem seems to dictate concurrent access to the same memory pages, a NUMA system would penalize the process doing remote memory accesses. In this case, consider creating multiple copies of the data at initialization time.
Depending on the pattern of traversal, TLB pressure might be a factor. Consider experimenting with huge (aka "large") pages.
Cache contention may be a factor in scaling as well.
Your specific algorithm could easily end up dominating over any of the specific system effects, depending on how far apart the best and worst cases are.
limited experience with computer hardware architecture and how multi-threading works under the hood.
Profile the query using CPU performance counters with a tool like Intel's VTune, perf, or oprofile. It can tell you where expensive operations are executing in your code. With this information you can optimize your query to perform well (individually and in aggregate).

CPU usage vs Number of threads

In general what is the relation between CPU usage and number of threads in a program.
Assumptions:
Multi-core CPU
Threads do the exact same job (assume they fetch identical work items from a queue and process them)
It depends on the nature of the application.
An application that mostly do calculations - a ratio of 1 thread per
core is a reasonable decision, since you don't want to spawn too many threads due to overhead, and you want to take advantage of all your cores.
An application that mostly do IO operations (like http requests) can spawn much more threads then the #cores and still increase efficiency, since the bottleneck is the waiting time per IO request, and you want to gain as much information as possible in each time you need to wait.
That said, the CPU-usage you are going to get is still dependent on many factors (IO, synchronization, non parallel parts in your program).
If you are interested in the speed the application will take - always remember Amdahl's law, which gives you a strict bound on the time (speed-up) your application is going to take, even when having infinite number of working cores.
There is no such general relationship, except for the obvious ones:
an application can't use more CPU time (CPU seconds) than the number of available cores multiplied by the number of (wall clock) seconds that it runs, and
a single thread can't use more than one CPU second per second.
The actual amount of CPU that a multi-threaded application depends mostly on the nature of the application, and the way that you've implemented it:
If the computation performed by each thread does not generate contention with other threads for locks, memory access and so on, then you should be able to approach the theoretical limit of available CPU resources.
Contention is liable to reduce effective CPU usage, sometimes dramatically.
But there are no general formulae that will tell you how much speed-up you can get.
I think there is no relation or not easy one. It depends on the jobs the threads are doing. A program with one thread can consume 100% of CPU and a program with lots of threads can consume less.
If you are looking for an optimized relation between threads and job done, you must study your case, and possibly found an empiric solution.
As the other answers already state, "it depends". In an ideal world, for n cores, you would get a throughput of factor n, given that you do the same job in a separate thread on each core (which already contains a false assumption, since you need to somehow synchronize the threads when they read from the same queue).
Understanding the Disruptor, a Beginner's Guide to Hardcore Concurrency gives some nice examples what you need to consider when parallezing tasks, and also shows some cases where the attempt to parallelize leads to a longer execution time.

Selecting number of threads in a multiprocess multiprocessor environment

I went through a few questions such as POSIX Threads on a Multiprocessor System and Concurrency of posix threads in multiprocessor machine and Threads & Processes Vs MultiThreading & Multi-Core/MultiProcessor : How they are mapped?
Based on these and few other Wiki articles , I believe for a system having three basic works viz, Input , Processing and Output
For a CPU - bound processing number of CPU -intensive threads (No. of Application * Thread per application) should be apprx 1 to 1.5 times the number of cores of processor.
Input and Output threads must be sufficiently large, so as to remove any bottlenecks. For example for a communication system which is based on query/query-ack and response/response - ack model, time must not be wasted in I/O waiting states.
If there is a large requirement for dynamic memory, its better to go with greater number of processes than threads (to avoid memory sync ups).
Are these arguments fairly consistent while determining number of threads to have in our application ? Do we need to look into any other paramters??
'1 to 1.5 times the number of cores' - this appears to be OS/langauge dependent. On Windows/C++, for example, with large numbers of CPU-intensive tasks, the optimum seems to be much more than twice the number of cores with the performance spread very small. If such environments, it seems you may as well just allocate 64 threads on a pool and not bother with the number of cores.
'query/query-ack and response/response - ack model, time must not be wasted in I/O waiting states' - this is unavoidable with such protocols with the high latency of most networks. The delay is enforced by the 'ping-pong' protocol & so there will, inevitably be an I/O wait. Async I/O just moves this wait into the kernel - it's still there!
'large requirement for dynamic memory, its better to go with greater number of processes than threads' - not really. 'large requirement for dynamic memory' usually means that large data buffers are going to be moved about. Large buffers can only be efficiently moved around by reference. This is very easy and quick between threads because of the shared memory space. With processes, you are stuck with awkward and slow inter-process comms.
'Determining number of threads to have in our application' - well, so difficult on several fronts. Given an unknown architecture, design. language and OS, the only advice I have is to try and make everything as flexible and configurable as you reasonably can. If you have a thread pool, make its size a run-time parameter you can tweak. If you have an object pool, try to design it so that you can change its depth. Have some default values that work on your test boxes and then, at installation or while running, you can make any specific changes and tweaks for a particular system.
The other thing with flexible/configurable designs is that you can, at test time, tweak away and fix many of the incorrect decisions, assumptions and guesstimates made by architects, designers, developers and, most of all, customers

How many simultaneous threads in an application is a lot?

5, 100, 1000?
I guess, "it depends", but on what?
What is common in applications that run as server daemons / services?
What are hard limits?
Given that the machine can handle the overall workload, how do I determine at how many threads the overhead starts to have an impact on performance?
What are important differences between OS's?
What else should be considered?
I'm asking because I would like to employ threads in an application to organize subcomponents of my application that do not share data and are designed to do their work in parallel. As the application would also use thread pools for parallelizing some tasks, I was wondering at what point I should start to think about the number of threads that's going to run in total.
I know the n+1 rule as a guideline for determining the number of threads that simultaneously work on the same task to gain performance. However, I want to use threads like one might use processes in a larger scope, i. e. to organize independent tasks that should not interfere with each other.
In this related question, some people advise to minimise the number of threads because of the added complexity. To me it seems that threads can also help to keep things sorted more orderly and actually reduce interference. Isn't that correct?
I can't answer your question about "how much is many" but I agree that you should not use threads for every task possible.
The optimal amount of threads for performance of application is (n+1), where n is the amount of processors/cores your computer/claster has.
The more your actual thread amount differs from n+1, the less optimal it gets and gets your system resources wasted on thread calculations.
So usually you use 1 thread for the UI, 1 thread for some generic tasks, and (n+1) threads for some huge-calculation tasks.
Actually Ajmastrean is a little out of date. Quoting from his own link
The thread pool has a default size of
250 worker threads per available
processor, and 1000 I/O completion
threads. The number of threads in the
thread pool can be changed by using
the SetMaxThreads method.
But generally I think 25 is really where the law of diminishing returns (and programmers abilities to keep track of what is going on) starts coming into effect. Although Max is right, as long as all of the threads are performing non-blocking calculations n+1 is the optimal number, in the real world most of the threading tasks I perform tend to be done on stuff with some kind of IO.
Also depends on your architecture. E.g. in NVIDIA GPGPU lib CUDA you can put on an 8 thread multiprocessor 512 threads simoultanously. You may ask why assign each of the scalar processors 64 threads? The answer is easy: If the computation is not compute bound but memory IO bound, you can hide the mem latencies by executing other threads. Similar applies to normal CPUs. I can remember that a recommendation for the parallel option for make "-j" is to use approx 1.5 times the number of cores you got. Many of the compiling tasks are heavy IO burden and if a task has to wait for harddisk, mem ... whatever, CPU could work on a different thread.
Next you have to consider, how expensive a task/thread switch is. E.g. it is comes free, while CPU has to perform some work for a context switch. So in general you have to estimate if the penalty for two task switches is longer than the time the thread would block (which depends heavily on your applications).
Microsoft's ThreadPool class limits you to 25 threads per processor. The limit is based on context switching between threads and the memory consumed by each thread. So, that's a good guideline if you're on the Windows platform.

Resources