I have the following workflow in flink running on a cluster of 3 machines with 4 cores each on GCP.
HDFS-Scan -> Filter -> Aggregate
I set the parallelism of these operators to 12 initially, so that each operator can have 12 subtasks (I disabled chaining). I am trying to study the effect of the number of subtasks on the resource usage. The following is what I did:
Run 1: I made the filter operation logic expensive so that it invokes backpressure. The total execution time was 212 seconds.
Run 2: I kept the filter operator expensive. Since the scan operator was being backpressured anyways, I reduced its parallelism all the way to 4. Having less parallelism meant that scan produced data slower, but Filter was still the bottleneck. The execution time was still around 212 seconds.
Run 3: I kept the filter operator expensive. I reduced scan operator's parallelism to 2. At this point, scan became the bottleneck and the execution time increased.
My question is about Run 2. I was expecting that reducing the parallelism of scan would have an effect on the CPU usage of the VMs. I expected either of two cases - 1) the CPU usage should go down because CPU would be more free, or 2) Filter subtasks take up the CPU freed by the scan's subtasks. In this case CPU usage wouldn't go down, but the execution time should. But neither of these happened.
Can someone help me understand this? Is there some other way to reason about what is happening?
I would expect the aggregate CPU effort expended across the collection of HDFS-Scan subtasks to be roughly the same in both Run 1 and Run 2. Whether there are 4 or 12 subtasks for this HDFS-Scan operator doesn't make much difference, since they are spending most of their time blocked, doing nothing while waiting for Flink's credit-based flow control to allocate buffers for them to work with.
Just to make up some numbers, perhaps with 12 instances they are each blocked 75% of the time, while with 4 they are each blocked 25% of the time. While there is somewhat more overhead to have 12 compared to 4, the overall performance is probably dominated by ser/de plus whatever the filter is doing.
A (sub)task can be in one of three states:
idle, meaning it has nothing to do
backpressured, meaning it can not do anything because it has no available output buffers
busy, meaning it is actively processing events
A task (an instance of an operator chain) corresponds to a JVM thread. All of the tasks in all of the slots in a single task manager are competing with each other for the resources (CPU, memory, etc) available to that task manager's JVM.
While idle or backpressured, a task isn't consuming any (significant amount of) CPU time. Because there is a small, fixed amount of buffering in any Flink pipeline, any backpressure quickly propagates upstream and ends up throttling the sources.
So in your case, whether there are 12 source tasks that are mostly all doing nothing while backpressured, or 4 that are kind of busy, collectively those source tasks are producing the same volume of events (however many the downstream bottleneck can handle) and expending (approximately) the same amount of CPU in aggregate to get that done.
Related
I want to understand how a certain worst case scenario of context switch happens. Say I have 10 CPU cores running a single process. Everything is CPU intensive, no thread is sleeping (waiting for I/O).
(I am mainly concerned with mainstream modern personal computer architectures and systems, typically x64 with Windows, Linux...)
Correct me if I'm wrong: running 10 CPU/RAM intensive independent threads is most often a near optimal situation. The amount of time spent in context switch is rather negligible. While the system may sometimes decide to re-attribute threads to different cores in a round-robin fashion causing a reset of RAM caches, it has a minor effect and works almost as if each thread was running on a single fixed core.
Only the main RAM bus may be a limitation since all threads share it, but it's not the point I'm interested in here. Reducing the number of threads will not increase the throughput anyway.
Now assume you still have 10 cores but run 1000 threads. The scheduler could theoretically decide to switch rarely (say every second) running 10 threads for a second, then 10 others... and the whole thing would still be close to optimal performance (throughput).
But it does not seem to be the case and it looks like threads are switched intensively causing a strongly suboptimal performance (throughput). Am I right about it? What is the main cause for this suboptimal performance? A few numbers would be nice if you have any idea of orders of magnitude of (for example): switches per second, performance loss caused by switching...
I'm going to answer my own question (after some search).
On windows, the number of context switches can be measured with performance counters: https://technet.microsoft.com/en-us/library/cc938606.aspx
I measured it on my machine (core i7/Windows 10) and the order of magnitude is around 1000/s by core when the number of running threads is more than the number of cores (and these threads are full CPU).
The time needed for a context switch varies quite a bit depending on:
what registers need to be saved
if FPU registers need to be saved
the processor model (of course)
You can read: https://www.quora.com/How-long-does-a-context-switch-take or http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
A slightly pessimistic avg. order of magnitude seems to be 1000 ns. Thus the total time for all context switches on each core is 1ms per second, that is 0.1%.
This does not depend on the number of threads: if you run 100 or 1000 threads, the number of switches does not change. As a conclusion the time spent in context switching is somehow negligible.
This reasoning is correct as long as the threads are pure CPU with only small memory read/write like a few local variables. I ran a test with full CPU threads and the difference between a few and 1000 threads is not noticeable.
But the situation changes when RAM is involved and switches makes CPU (memory) cache less efficient. A worse case is when:
computation can be split into 1000 independent "data" parts
each part of the data fits just into the memory cache (say L1 or L2) of a core
each part needs to be read many times
In this situation, running 10 threads to completion, then ten others... would take full advantage of the cache, while running 1000 threads at a time would causes the cache to be useful only during 1ms.
But if the data of several threads could fit into the cache, or if the threads read common data to some degree, or if each thread reads the data just once, then it is possible that running 1000 threads vs. running 10 threads a hundred times will have similar throughput.
It is more a matter a adapting parallelism to memory access. And it depends very much on the way memory needs to be accessed.
The time spend in context switching is negligible, the time lost because of wrong usage of caches may sometimes be problem, sometimes not, depending on how the memory is accessed and shared.
The following is an excerpt from the book Java Concurrency in Practice, Chapter 12.2 Testing for Performance where the author talks about throughput of a bounded buffer implementation.
Figure 12.1 shows some sample results on a 4-way machine, using buffer
capacities of 1, 10, 100, and 1000. We see immediately that a buffer
size of one causes very poor throughput; this is because each thread
can make only a tiny bit of progress before blocking and waiting for
another thread. Increasing buffer size to ten helps dramatically, but
increases past ten offer diminishing returns.
It may be somewhat puzzling at first that adding a lot more threads
degrades performance only slightly. The reason is hard to see from the
data, but easy to see on a CPU performance meter such as perfbar while
the test is running: even with many threads, not much computation is
going on, and most of it is spent blocking and unblocking threads. So
there is plenty of CPU slack for more threads to do the same thing
without hurting performance very much.
However, be careful about concluding from this data that you can
always add more threads to a producer-consumer program that uses a
bounded buffer. This test is fairly artificial in how it simulates the
application; the producers do almost no work to generate the item
placed on the queue, and the consumers do almost no work with the item
retrieved. If the worker threads in a real producer-consumer
application do some nontrivial work to produce and consume items (as
is generally the case), then this slack would disappear and the
effects of having too many threads could be very noticeable. The
primary purpose of this test is to measure what constraints the
producer-consumer handoff via the bounded buffer imposes on overall
throughput.
What does the author mean by cpu slack here? Why will the throughput degrade not degrade more and more as more number of threads are being added? I am not following the reasoning given by the author regarding the slight degradation of performance while adding more and more threads , assuming that the bound on the buffer size is kept constant.
Edit: I can think of one reason :since in this case no real work is being done by threads , so the classic problem of increased traffic on shared memory bus, number of cache misses due to context switching of threads are not playing a major role as more and more threads are being added. The situation is going to change once the threads start doing some more work. Is that what the author is trying to convey here in the third paragraph?
There is no formal term such as CPU slack. The author simply means that the CPU is not fully utilised in doing meaningful work because most time is spent waiting to successfully acquire a mutually exclusive lock. The author is calling the unused capacity of the CPU, the CPU slack.
NOTE: The associated code tests a multiple producer / multiple consumer scenario, with an equal number of producers and consumers.
EDIT: In the later discussion they talk about the effect of adding more threads if a) the threads do almost no work, and b) the threads occupy the CPU substantially for every produced or consumed item. I will try to explain the difference with some slightly artificial scenarios.
Suppose that locking takes 1 time unit actively, and 8 time units passively by waiting. Passive waiting does not occupy the CPU.
Case 1: Producer-Consumer cost is 1 time unit.
So we currently account for 2 time units of CPU time, with an
additional 8 time units of passive waiting time. So we have 8/10
available CPU time units.
If we now want to double the number of threads, we need to accommodate
an additional 2 time units (1 for producer-consumer stuff, and 1 for
active locking time). That would eat into our supply of available CPU
time -- but we have enough.
Case 2: Producer-Consumer cost is 11 time units.
So we currently account for 11+1=12 time units of CPU time, with an additional 8 time units of passive waiting time. So we have 8/20 available CPU time units.
If we now want to double the number of threads, we need to accommodate an additional 12 time units (11 for producer-consumer stuff, and 1 for active locking time). That goes beyond the available CPU time units. Something has to give -- so waiting time will increase, and throughput will suffer.
So in case 2, the amount of real work reduces the amount of time available for new threads, thereby increasing the observed effect of locking contention on the throughput. It would have been nice if they had also included figures for this imagined scenarios in the book. It would have made their hand-wavy argument easier to follow.
I think cpu slack is the resource. According to Wikipedia, it is referred to the amount of time left after a job if the job was started now.
Plenty of cpu slack means much computation resources. When Consumer/Producer do something nontrivial, cpu slack decreases and impacts throughput.
In general what is the relation between CPU usage and number of threads in a program.
Assumptions:
Multi-core CPU
Threads do the exact same job (assume they fetch identical work items from a queue and process them)
It depends on the nature of the application.
An application that mostly do calculations - a ratio of 1 thread per
core is a reasonable decision, since you don't want to spawn too many threads due to overhead, and you want to take advantage of all your cores.
An application that mostly do IO operations (like http requests) can spawn much more threads then the #cores and still increase efficiency, since the bottleneck is the waiting time per IO request, and you want to gain as much information as possible in each time you need to wait.
That said, the CPU-usage you are going to get is still dependent on many factors (IO, synchronization, non parallel parts in your program).
If you are interested in the speed the application will take - always remember Amdahl's law, which gives you a strict bound on the time (speed-up) your application is going to take, even when having infinite number of working cores.
There is no such general relationship, except for the obvious ones:
an application can't use more CPU time (CPU seconds) than the number of available cores multiplied by the number of (wall clock) seconds that it runs, and
a single thread can't use more than one CPU second per second.
The actual amount of CPU that a multi-threaded application depends mostly on the nature of the application, and the way that you've implemented it:
If the computation performed by each thread does not generate contention with other threads for locks, memory access and so on, then you should be able to approach the theoretical limit of available CPU resources.
Contention is liable to reduce effective CPU usage, sometimes dramatically.
But there are no general formulae that will tell you how much speed-up you can get.
I think there is no relation or not easy one. It depends on the jobs the threads are doing. A program with one thread can consume 100% of CPU and a program with lots of threads can consume less.
If you are looking for an optimized relation between threads and job done, you must study your case, and possibly found an empiric solution.
As the other answers already state, "it depends". In an ideal world, for n cores, you would get a throughput of factor n, given that you do the same job in a separate thread on each core (which already contains a false assumption, since you need to somehow synchronize the threads when they read from the same queue).
Understanding the Disruptor, a Beginner's Guide to Hardcore Concurrency gives some nice examples what you need to consider when parallezing tasks, and also shows some cases where the attempt to parallelize leads to a longer execution time.
I've a multi-threaded application which uses a threadpool of 10 threads. Each thread takes 5 minutes to process input. Is there a law/formula which governs the total time taken to process n inputs?
In other words, is it right to say that every 5 minutes, 10 inputs can be processed, so to process 100 inputs, it will take 50 minutes?
In addition to the computing power (processors/cores) and hardware resource dependencies (hard disk, I/O competition, etc.), the data dependency should also be considered. For example, if the processing of each input includes updating a shared data by all the other threads, which requires locking (mutex), then the total throughput will be less than 10 times, even if it is a multi-core processor with more than 10 cores. The maximum speed-up depends on the proportion of the critical section. If you need a formula, refer to the famous Amdahl's law: en.wikipedia.org/wiki/Amdahl's_law
Not really, you have to consider the total computing power required. If for example a thread takes 5 minutes to do the work, and the processor is completely consumed during that time, then additional threads will not help you. On the other extreme, if the processor utilization is near zero (all of the time is spent waiting for I/O for example), then your proposed calculation would work. So you have to consider the actual resources being used by the computation.
I'm performing an operation, lets call it CalculateSomeData. CalculateSomeData operates in successive "generations", numbered 1..x. The number of generations in the entire run is fixed by the input parameters to CalculateSomeData and is known a priori. A single generation takes anywhere from 30 minutes to 2 hours to complete. Some of that variability is due to the input parameters and that cannot be controlled. However, a portion of that variability is due to things like hardware capacities, CPU load from other processes, network bandwidth load, etc. One parameter that can be controlled per-generation is the number of threads that CalculateSomeData uses. Right now that's fixed and likely non-optimal. I'd like to track the time each generation takes and then have some algorithm by which I tweak the number of threads so that each successive generation improves upon the prior generation's calculation time (minimizing time). What approach should I use? How applicable are genetic algorithms? Intuition tells me that the range is going to be fairly tight - maybe 1 to 16 threads on a dual quad-core processor machine.
any pointers, pseudocode, etc. are much appreciated.
How about an evolutionary algorithm.
Start with a guess. 1 thread per CPU core seems good, but depends on the task at hand.
Measure the average time for each task in the generation. Compare it to the time taken by the previous generation. (Assume effectively infinite time and 0 threads for generation 0).
If the most recent generation tasks averaged a better time than the one before, continue to change the number of threads in the same direction as you did last step (so if the last generation had more threads than the previous thread, then add a thread for the new generation, but if it had fewer, then use one fewer (obviously with a lower limit of 1 thread).
If the most recent generation tasks took longer, on average, than the previous generation, then change the number of threads in the opposite direction (so if increasing the number of threads resulted in worse time, use one fewer thread next time).
As long as the optimal number of threads isn't too close to 1, then you'll probably end up oscillating between 3 values that are all reasonably close to optimal. You may want to explicitly detect this case and lock yourself into the central value, if you have a large number of generations to deal with.
If the calculations are completely CPU bound the number of threads should be equal to the number of cores on the machine. That way you minimize the number of context switches.
If your calculations involve I/O, network, synchronization or something else that blocks execution you must find the limiting resource and measure the utilization. You need to monitor the utilization and slowly add more threads until the utilization gets close to 100%. You should have as few threads as possible to saturate your limiting resource.
You should divide up your generations into lots of small tasks and put them in a queue. Spawn one thread per core and have each thread grab a task to do, run it to completion, and repeat.
You want lots more tasks than cores to make sure that you don't end up with just one task running at the end of the generation and all other threads idle. This is what is likely to happen if you set #tasks = #threads = #cores as Albin suggests (unless you can ensure that all tasks take precisely the same amount of time).
You also probably don't want more threads than cores. Context switching isn't terribly expensive, but the larger cache footprint that comes with having more than #cores tasks simultaneously active could hurt you (unless your tasks use very little memory).