how reasonable to estimate cpu cycles using bogomips? - linux

If I've got a long-running process that uses, on average, 10% of the CPU as it does its job, and I run two copies of it in parallel, I can expect them to collectively use 20% of the CPU. Or if I run one copy on a different CPU that's twice as fast, I can expect it to use 5% of that CPU.
What I'd like to do is get a handle on the CPU requirements of such a process, but in a reasonably CPU independent way, i.e. not as a percentage.
I'm wondering how meaningful it might be to assign a simple "cycles per second needed" number to such an app, by multiplying its CPU percentage by the bogomips number of the machine where I measured it. That is, a process that uses 10% of the CPU on a machine with a bogomips value of 1000 could be said to require 100,000,000 (bogus) instructions per second.
(Disclaimers: Of course I know that bogomips are bogus, that instructions do not equal cycles, and that cycle and instruction timings are not at all comparable between disparate processor families. I'm looking for rough, linear comparisons here, not precise counts.)
In some more detail: Suppose I've got a system with an assortment of long-running processes running on a possibly CPU-constrained machine. I might want to predict, in advance, whether they'll all run without overloading the CPU. Or I might want to implement checks (simple ones) that no process is using more CPU than it's supposed to. I'm willing to empirically measure the performance of each process in advance, to help me make these predictions and implement these checks. What I'm exploring here is, what's the right unit to measure the performance in?
For example, today I might be running processes A, B, D, E, and H on processor X. I might observe that the percent of CPU used by the processes is 10, 5, 1, 5, and 20%, respectively. 10+5+1+5+20 is 41, and 41 is comfortably less than 100, so I'm fine.
But tomorrow I might want to run processes A, B, C, H, and J on a different processor Y running at half the clock rate. Even if I also know something about the performance of processes C and J, it just seems unnecessarily messy to try to do the math based on percentages, when the CPU (that the percentages are of) is a moving target.
As mentioned, I might also want to assign an explicit CPU "budget" to each long-running process, and for sanity's sake I might want that budget to stay reasonably valid over time. That is, I might want to say that process A is only allowed to use 100,000,000 cycles per second. If it ever uses 150,000,000, something is wrong. But if I move everything to a 2x faster CPU tomorrow, I do not want process A to be able to get away with using twice as much, because I might have in mind to use the extra CPU power for other processes.
Finally, if this has made any sense, and if multiplying by bogomips is not a good way of doing what I'm trying to do, does anyone have any better ideas?
(Oh, and one more disclaimer: the question obviously gets more complicated for multi-processor machines, and multi-core processors. I'll worry about those additional wrinkles later, not today.)

Completely unreasonable. No, seriously. Indeed, removing bogomips was proposed, precisely because they're meaningless, but was overridden because it broke some users' functionality: https://lwn.net/Articles/627930/ (search for 'bogomips').

Related

Multiprocessing: why doesn't a single thread just use more cpu?

I'm learning about multiprocessing and it seems to be applicable in one of two scenarios:
our program is waitng for some I/O, so it makes sense to go do something else while waiting;
we break our program up so that individual parts of it can run "in parellel", in an attempt to take full advantage of the cpu
My confusion is about the second case. I'm probably just lacking in my understanding of how cpus really work: but if our single thread process is only using 1% of the cpu and it therefore makes sense to get more threads going, then why wouldn't we just (somehow?) speed up that single process so that it uses more cpu and finishes faster?
but if our single thread process is only using 1% of the cpu and it therefore makes sense to get more threads going, then why wouldn't we just (somehow?) speed up that single process so that it uses more cpu and finishes faster?
We don't know how to. There seem to be fundamental limitations to how fast we can do things that we haven't quite figured out how to get around. So instead, we do more than one thing at a time.
It takes a woman 9 months to make a baby. So if you want lots of babies, you get lots of women. You don't try to get one woman to go faster.
Say you want to raise 7 to the twenty-millionth power and also raise 11 to the twenty-millionth power. Each of these two operations can be reduced in the number of steps, but you will reach a limit. Say each operation takes N sequential steps (each requiring the output from the previous step as its input) and the fastest we can do a single step is Q nanoseconds. With one thread, it will take at least 2NQ nanoseconds to perform all the operations. With two threads, can do one step from each of the two operations at the same time, reducing the time minimum to N*Q nanoseconds.
That's a big win.
I might be wrong, but when we split things into threads, we want to make use of multi-core architecture of our CPUs.
We mostly think CPUs being a single unit, but you must've heard about how i5 is a quad-core processor, meaning it has 4 cores-- or 4 cores make a CPU, i3 is a dual core processor-- i.e, it only has two cores.
So the aggregate CPU utilization for quad-core would be 100% split into 4x25. There's a difference b/w concurrency and parallelism. Parallel means each thread runs on a separate core, making full use of it. Now you have 4 people doing one job-- or a better analogy would be there are 4 printers in the office, and 4 people can go ahead and get the copies that they want. This is parallelism.
Using that same analogy let's extend it to just one copier/printer and 4 people want to make copies, what we do is make use concurrency, we print each requested copy but only 25% of it, then we switch to the next person, then the next and then the next, this will take 4 iterations for all the copies to get printed. Even though we utilized 100% of the copier's capability, still our guys had to wait-- this waiting time also depends on what was the length of the document they wanted to print-- so we use something like pre-emption, you can only execute/print for a certain amount of time, before we start printing for the next guy.
Speeding up a single process-- allocating it 100% of the CPU is not a problem [although we want to run bunch of other stuff like GUI, play music, system services etc, but 85% is doable], the execution time becomes 1/4th when it's distributed b/w the CPUs. Imagine you have to print a book, with 4 copiers, book is 400pages long-- you use 4 copiers to print 100pages each. Will be faster right?
I hope I made some sense, Going to sleep.

Splitting a computational workload: where is it possible or impossible

I program although I am not a computer scientist. Therefore I would like to see if I understood correctly the challenge in splitting a workload. Is this below the correct way to think about it?
Specifically, is the following statement (1) correct?
(1) If A(X_a) + A(X_b) + A(X_c) + ... = B(X_a,X_b,X_c, ...) = Y is an equation that is being computed
whether or not it can be computed more rapidly from the perspective of the computer by assigning parts of the equation to be computed by individual threads at the same time depends on the following
if X_m changes when A(X_n) changes for m not equal to n, then dividing the workload for that particular computation is gives less of a performance gain, and if this is true for every combination of m and n in the system, then no performance gain for multithreading over single threading is possible.
Or in other words do I understand correctly that presence of linked variables decreases ability to multithread successfully because X_b and X_c depend on what A(X_a) and it bottlenecks the process: the other threads know A but have to wait for the first thread to give an output before they have instructions to execute, so simultaneous working on parts of an instruction which is easily broken up into parts cannot be done and the computation takes as much time one one thread doing each part of the calculation one after the other as it does to perform on more than one thread working at once and summing the results in order they complete on the fly on another thread.
(2) Or is there a way around the above bottleneck? For example if this bottleneck is known in advance, the first thread can start early and store in memory the results to A(X_n) for all n that bottleneck the operation, and then split the workload efficiently, one A(X_i) to the i th thread, but to do this, the first thread would have to predict in some way when the calculation B(X_a,X_b,X_c, ...) must be executed BEFORE B(X_a,X_b,X_c, ...) is actually executed, otherwise it would run into the bottleneck.
[EDIT: To clarify, in context of NWP's answer. If the clarification is too long / unclear, please leave a comment, and I'll make a few graphics in LaTeX to shorten the question writeup.]
Suppose the longest path in the program "compute I" is 5 units of time in the example. If you know this longest path, and the running system can anticipate (based on past frequency of execution) when this program "compute I" will be run in the future, subprogram "compute B->E" (which does not depend on anything else but is a proper subset of the longest path of program "compute I") may be executed in advance. The result is stored in memory prior to the user requesting "compute I".
If so, is the max speedup considered to be 9/4? The B->E is ready, so other threads do not have to wait for it. Or is max speed up for "compute I" still considered to be 9/5?
The anticipation program run before has a cost, but this cost may be spread over each instance of execution of "compute I". If the anticipation program has 15 steps, but the program "compute I" is run typically 100 times per each execution of the anticipation program, and all steps cost equally, do we simply say the max speedup possible in "compute I" is therefore 9/(5 - 1 + 15/100)?
The speedup possible now appears to depend not only on the number of threads, and the longest path, but also on the memory available to store precalculations and how far in advance another program can anticipate "compute I" will be run and precalculate proper subprograms of it. Another program "compute X" may have the same length of the longest path as "compute I" but the system cannot anticipate that "compute X" will be run equally as far in advance as "compute I". How do we weight the speedup achieved (i) at expense of increasing memory to store precalculations (ii) timing of execution of some programs can be anticipated further in advance than of other program allowing bottleneck to be precalculated and this way cutting down the longest path?
But if a longest path can be dynamically cut down in dynamics by improving predictive precalculation of subprograms and greater memory for storing results of precalculation, can bottlenecks be considered at all as determining the ultimate upper boundary to speedup due to splitting a computational workload?
From the linked variables dependency bottleneck perspective / graph bottle perspective, the ultimate upper boundary of speedup to multithreading a program "compute I" appears to be determined by longest subprogram (other subprograms depend on it / wait for it). But from the dynamics perspective, where the whole system is running before and after the program "compute I" is executed as a part of it, sufficient predictability of timing of future execution of "compute I" and ability to store more and more precalculations of its independent subprograms can completely cut down length of all subprograms of "compute I" to 1 unit, meaning it can in possibly achieve a speedup of 9/1 = 9, if sufficient predictability and memory is available.
Which perspective is the correct one for estimating the upper bounds to speedup by multithreading? (A program run in a system running a long time with sufficient memory seems to have no limit to multithreading, whereas if it is looked at by itself, there is a very definite fixed limit to the speedup.)
Or is the question of ability to cut down longest path by anticipation and partial precalculation a moot one because speedup in that case varies with the user's decision to execute a program in a way that can be predicted and so cannot upper boundary to multithreading speedup due to anticipation cannot be know to a program writer or system designer and should be ignored / not relied upon to exist?
I do not quite understand which things depend on what from your description but I can give you some theory. There is Ahmdal's law which gives you an upper bound of the speedup you can achieve based on how parallelizable a given algorithm is assuming you have enough processors. If you can parallelize 50% of the calculation you can get a maximum speedup of 2x. 95% parallelization gives you a maximum speedup of 20x. To figure out how much speedup you can get you need to know how much of your problem can be parallelized. This can be done by drawing a graph of the things you need to do and which depend on what and figure out the longest path. Example:
In this example the longest path would be B->E->F->H->I. All blocks are assumed to take the same time to execute. So there are 9 blocks, the longest path is 5 blocks, so your maximum achievable speedup is 9/5 = 1.8x. In practice you need to consider that your computer can only run a limited number of threads in parallel, that some blocks take longer than others and that there is a cost involved in creating threads and using appropriate locking mechanisms to prevent data races. Those can be added to the graph by giving each block a cost and finding the longest path based on adding cost including the cost of threading mechanisms. Although this method only gives you an upper bound it tends to be very humbling. I hope this allows you to draw a graph and find the answer.
EDIT:
I forgot to say that Ahmdal's law compares executing the code with a single thread to executing the code with an infinite number of threads with no overhead. If you make the multithreaded version execute different code than the single threaded version you are no longer bound by Ahmdal's law.
With enough memory and time you can calculate the results for all possible inputs and then just do a lookup based on a given input to find the result. Such a system would get higher speedup because it does not actually calculate anything and is not bound by Ahmdal's law. If you manage to optimize B->E to take zero units of time the longest path becomes 3 and there are only 8 nodes giving you a maximum speedup of 8/3 = 2.66x which is better than the 1.8x of before. That is only the speedup possibility by multithreading though, actually the first version takes 4 time units and the second version 3 time units. Optimizing code can give you more speedup than multithreading. The graph can still be useful though. Assuming you do not run out of cores the graph can tell you which parts of your program are worth optimizing and which are not. Assuming you do run out of cores the graph can tell you which paths should be prioritized. In my example I calculate A, B, C and D simultaneously and therefore need a quadcore to make it work. If I move C down in time to execute in parallel to E and make D run parallel to H a dualcore will suffice for the same speedup of 1.8x.

HLSL operator/functions cycle count

I'm modeling some algorithms to be run on GPU's. Is there a reference or something as to how many cycles the various intrinsics and calculations take on modern hardware? (nvidia 5xx+ series, amd 6xxx+ series) I cant seem to find any official word on this even though there are some mentions of the raised costs of normalization, square root and other functions throughout their documentation.. thanks.
Unfortunately, the cycle count documentation you're looking for either doesn't exist, or (if it does) it probably won't be as useful as you would expect. You're correct that some of the more complex GPU instructions take more time to execute than the simpler ones, but cycle counts are only important when instruction execution time is main performance bottleneck; GPUs are designed such that this is very rarely the case.
The way GPU shader programs achieve such high performance is by running many (potentially thousands) of shader threads in parallel. Each shader thread generally executes no more than a single instruction before being swapped out for a different thread. In perfect conditions, there are enough threads in flight that some of them are always ready to execute their next instruction, so the GPU never has to stall; this hides the latency of any operation executed by a single thread. If the GPU is doing useful work every cycle, then it's as if every shader instruction executes in a single cycle. In this case, the only way to make your program faster is to make it shorter (fewer instructions = fewer cycles of work overall).
Under more realistic conditions, when there isn't enough work to keep the GPU fully loaded, the bottleneck is virtually guaranteed to be memory accesses rather than ALU operations. A single texture fetch can take thousands of cycles to return in the worst case; with unpredictable stalls like that, it's generally not worth worrying about whether sqrt() takes more cycles than dot().
So, the key to maximizing GPU performance isn't to use faster instructions. It's about maximizing occupancy -- that is, making sure there's enough work to keep the GPU sufficiently busy to hide instruction / memory latencies. It's about being smart about your memory accesses, to minimize those agonizing round-trips to DRAM. And sometimes, when you're really lucky, it's about using fewer instructions.
http://books.google.ee/books?id=5FAWBK9g-wAC&lpg=PA274&ots=UWQi5qznrv&dq=instruction%20slot%20cost%20hlsl&pg=PA210#v=onepage&q=table%20a-8&f=false
this is the closest thing i've found so far, it is outdated(sm3) but i guess better than nothing.
does operator/functions have cycle? I know assembly instructions have cycle, that's the low level time measurement, and mostly depends on CPU.since operator and functions are all high level programming stuffs. so I don't think they have such measurement.

statisics for randomized task on multiple cores

Consider the time for completing a task on a processor core is a distribution with mean m and standard deviation s. If the same task runs on n cores, what is the mean and standard deviation of the time it takes to complete the task? (the task is finished when one of the cores finishes the task)
This is more of a statistics question, than anything else. Without information on the distribution function of the time t a single task needs to complete, I could only give you a hint: You need to calculate the distribution function of the minimum of t for n of your tasks, as seen here. Using that you can then calculate the mean and the standard deviation.
PS: Is this homework?
EDIT:
Whether - and how much - it's worth to use multiple cores, depends on several things:
What you need to do. If you have to run the same program with different inputs, launching multiple instances makes a lot of sense. It might not cut down the overall time down to 1/n and each experiment will still need at least as much time as before, but the time needed for the whole series will be signigicantly less.
If on the other hand, you are hoping to run the same task with e.g. a different seed and keep the one that converges the fastest, you will probably gain far less, as estimated by the first part of my answer.
How well you have parallelized your tasks. n completely independent tasks is the ideal scenario. n threads with multiple synchronization points etc are not going to be near as efficient.
How well your hardware can handle multiple tasks. For example if each of these tasks needs a lot of memory, it will probably be faster to use a single core only, than forcing the system to use the swap space/pagefile/whatever your OS calls it by running multiple instances at once.

Algorithm to optimize # threads used in a calculation

I'm performing an operation, lets call it CalculateSomeData. CalculateSomeData operates in successive "generations", numbered 1..x. The number of generations in the entire run is fixed by the input parameters to CalculateSomeData and is known a priori. A single generation takes anywhere from 30 minutes to 2 hours to complete. Some of that variability is due to the input parameters and that cannot be controlled. However, a portion of that variability is due to things like hardware capacities, CPU load from other processes, network bandwidth load, etc. One parameter that can be controlled per-generation is the number of threads that CalculateSomeData uses. Right now that's fixed and likely non-optimal. I'd like to track the time each generation takes and then have some algorithm by which I tweak the number of threads so that each successive generation improves upon the prior generation's calculation time (minimizing time). What approach should I use? How applicable are genetic algorithms? Intuition tells me that the range is going to be fairly tight - maybe 1 to 16 threads on a dual quad-core processor machine.
any pointers, pseudocode, etc. are much appreciated.
How about an evolutionary algorithm.
Start with a guess. 1 thread per CPU core seems good, but depends on the task at hand.
Measure the average time for each task in the generation. Compare it to the time taken by the previous generation. (Assume effectively infinite time and 0 threads for generation 0).
If the most recent generation tasks averaged a better time than the one before, continue to change the number of threads in the same direction as you did last step (so if the last generation had more threads than the previous thread, then add a thread for the new generation, but if it had fewer, then use one fewer (obviously with a lower limit of 1 thread).
If the most recent generation tasks took longer, on average, than the previous generation, then change the number of threads in the opposite direction (so if increasing the number of threads resulted in worse time, use one fewer thread next time).
As long as the optimal number of threads isn't too close to 1, then you'll probably end up oscillating between 3 values that are all reasonably close to optimal. You may want to explicitly detect this case and lock yourself into the central value, if you have a large number of generations to deal with.
If the calculations are completely CPU bound the number of threads should be equal to the number of cores on the machine. That way you minimize the number of context switches.
If your calculations involve I/O, network, synchronization or something else that blocks execution you must find the limiting resource and measure the utilization. You need to monitor the utilization and slowly add more threads until the utilization gets close to 100%. You should have as few threads as possible to saturate your limiting resource.
You should divide up your generations into lots of small tasks and put them in a queue. Spawn one thread per core and have each thread grab a task to do, run it to completion, and repeat.
You want lots more tasks than cores to make sure that you don't end up with just one task running at the end of the generation and all other threads idle. This is what is likely to happen if you set #tasks = #threads = #cores as Albin suggests (unless you can ensure that all tasks take precisely the same amount of time).
You also probably don't want more threads than cores. Context switching isn't terribly expensive, but the larger cache footprint that comes with having more than #cores tasks simultaneously active could hurt you (unless your tasks use very little memory).

Resources