How does branch prediction speed up anything? [duplicate] - branch-prediction

This question already has answers here:
Understanding branch prediction
(2 answers)
Closed 4 years ago.
If I have the following structure:
if( condition() ){
doA();
} else {
doB();
}
then how does branch prediction help me? Even if branch A is predicted correctly, then I still need to evaluate doA() and condition() - just not in this order. Or is branch prediction executed in parallel? In this case does it compete with other threads for CPU time? In general what is the maximum expected speed up from branch prediction?

Due to the pipelined nature of modern CPUs, new instructions begin to be processed before previous instructions have finished processing. The exact number varies depending on the CPU architecture and the type of instruction. The reason for pipelining is to make the CPU more efficient in utilisation of its components, allowing to improve throughput of instructions. For example, the circuitry designed to fetch the next instruction would lay idle for at least a few cycles whilst the previous instructions carries out its stages (things like source register read, data cache access, arithmetic execution, etc) without pipelining.
It introduces its own challenges though: one example is how the instruction fetch part should know which instruction to fetch next in the presence of a conditional jump instruction in the pipeline. The conditional jump (such as the one necessitated by your if above) require the evaluation of a condition to determine which instruction to fetch next - however this evaluation happens several stages in the pipeline later. During its transition through the pipeline stages, the pipeline must keep going and new instructions must keep being loaded - otherwise you would lose efficiency in having to wait until the resolution of the condition is known (a pipeline stall: a condition CPUs try to avoid). Without knowing for sure where the next instructions should be coming from, the CPU has to guess: this is known as branch prediction. If it guesses correctly, the pipeline can keep going at full tilt after the condition has been evaluated and the target jump address confirmed. If it guesses wrong, the pipeline must be cleared of all instructions started after the conditional jump, and re-started from the correct target jump address: an expensive condition that efficient branch prediction algorithms try to minimize.
Applying to your example above: if branch prediction correctly guesses the outcome of condition() a large percentage of the time, the following execution (of either doA() or doB()) continues without a pipeline flush: otherwise the conditional statement imposes a performance hit. This can occur if the evaluation of condition() is random from call to call, or otherwise follows a pattern that the branch prediction algorithm finds hard to predict.

Related

How does [branch] get executed in HLSL?

The [branch] attribute can mark an if statement in HLSL to make it execute only one branch instead of all branches and discarding the results like when using [flatten].
My question is how can this actually work, when a branch diverges withing a warp/wavefront? As far as I know, in this case all threads must execute all branches taken by any of the threads in the warp (like when using [flatten]) which is consequence of the fact, that they are all within the same SIMD block and must execute the same instruction.
Since GeForce series 6xx GPUs do actually support branching, though in limited form and with performance cost. The [branch] and [flatten] tags are just hints to the compiler to prefer one or the other if supported and possible. It basically depends on hardware and on the driver, so different hardware or different driver versions might in the end determine a different execution from what you specified with the tag.
You can find more info online, for example check this link

Estimating WCET of a task on Linux

I want to approximate the Worst Case Execution Time (WCET) for a set of tasks on linux. Most professional tools are either expensive (1000s $), or don't support my processor architecture.
Since, I don't need a tight bound, my line of thought is that I :
disable frequency scaling
disbale unnecesary background services and tasks
set the program affinity to run on a specified core
run the program for 50,000 times with various inputs
Profiling it and storing the total number of cycles it had completed to
execute.
Given the largest clock cycle count and knowing the core frequency, I can get an estimate
Is this is a sound Practical approach?
Secondly, to account for interference from other tasks, I will run the whole task set (40) tasks in parallel with each randomly assigned a core and do the same thing for 50,000 times.
Once I get the estimate, a 10% safe margin will be added to account for unforseeble interference and untested path. This 10% margin has been suggested in the paper "Approximation of Worst Case Execution time in Preepmtive Multitasking Systems" by Corti, Brega and Gross
Some comments:
1) Even attempting to compute worst case bounds in this way means making assumptions that there aren't uncommon inputs that cause tasks to take much more or even much less time. An extreme example would be a bug that causes one of the tasks to go into an infinite loop, or that causes the whole thing to deadlock. You need something like a code review to establish that the time taken will always be pretty much the same, regardless of input.
2) It is possible that the input data does influence the time taken to some extent. Even if this isn't apparent to you, it could happen because of the details of the implementation of some library function that you call. So you need to run your tests on a representative selection of real life data.
3) When you have got your 50K test results, I would draw some sort of probability plot - see e.g. http://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htm and links off it. I would be looking for isolated points that show that in a few cases some runs were suspiciously slow or suspiciously fast, because the code review from (1) said there shouldn't be runs like this. I would also want to check that adding 10% to the maximum seen takes me a good distance away from the points I have plotted. You could also plot time taken against different parameters from the input data to check that there wasn't any pattern there.
4) If you want to try a very sophisticated approach, you could try fitting a statistical distribution to the values you have found - see e.g. https://en.wikipedia.org/wiki/Generalized_Pareto_distribution. But plotting the data and looking at it is probably the most important thing to do.

Parallel MonteCarlo: reproducibility or real randomness?

I'm preparing a college exam in parallel computing.
The main purpose is to speedup as much as possible a Montecarlo simulation about electron drift in earth magnetic field.
I've already developed something with two layers of parallelization:
MPI used to make te code run on several machines
OpenMP to run parallel simulation inside the single computer
Now comes the question: I would like to keep on-demand the task execution.
The fastest computer must be able to execute more work the the slower ones.
The problem partition is done via master-worker cycle, so there is no actual struggle about achieving this result.
Since the number of tasks (a block of n electrons to simulate) executed by a worker is not prior defined I have two roads to follow:
every thread in every worker has is own RNG initialized with random generated seed (different generation method). The unbalancing of the cluster will change results, but in this approach the result is as casual as possible.
every electron has his own seed, granting reproducibility of the simulation despite of which worker runs the single task. Must have a better RNG.
Lets's poll about this. What's your suggestion?
Have fun
gf
What to poll about here?
Clearly, only approach #2 is a feasible one. Each source particle starts with it's own and stable seed. It makes result reproducible AND debuggable (for a lack of better word).
Well-known Monte Carlo code MCNP5+ used this scheme for good, runs on multi-cores and MPI. To implement it you'll need RNG with fast skip-ahead (a.k.a. leapfrog or discard) feature. And there are quite a few of them. They are based upon fast exponent computation, paper by F. Brown, "Random Number Generation with Arbitrary Stride", Trans. Am. Nucl. Soc. (Nov. 1994). Basically, skip-ahead is log(N) with Brown approach.
Simplest version which is about the same as MCNP5 one is here https://github.com/Iwan-Zotow/LCG-PLE63
More complicated (and slower, but higher quality) RNG is here http://www.pcg-random.org/

Bi-Threaded processing in Matlab

I have a Large-Scale Gradient Descent optimization problem that I am running using Matlab. The code has got two parts:
A Sequential update part that fires every iteration that updates the parameter vector.
A validation error computation part that fires every 10 iterations or so using the parameter value at the end of the corresponding iteration in which its fired.
The way that I am running this now is to do (1) and (2) sequentially. But (2) takes a lot of time and its not the core part of my routine - I made it just to check the progress and plot the error of my model. Is it possible in Matlab to run (2) in a parallel manner to (1) ? Please note that (1) cannot be run in parallel since it performs sequential update. So a simple 'parfor' usage is not a solution, unless there is a really smart way of doing that.
I don't think Matlab has any way of multi-threading outside of the (rather restricted) parallel computing toolbox. There is a work over which may help you though:
Open 2 sessions of Matlab, sessions A and B (or instances, or workspaces, however you call it)
Matlab session A:
Calculate the 10 iterations of your sequential process (1)
Saves the result in a file (adequately and uniquely named)
Goes on to calculate the next 10 iterations (back to the top of this loop basically)
In parralel:
Matlab session B:
Check periodically for the existence of the file written by process A (define a timer that will do that at the time interval which make sense for your process, a few seconds or a few minutes ...)
If the file exist => load it then do the validation computation (your process (2)) and display/report the results.
note: This only works if process (1) doesn't need the result of process (2) to run its iterations, but if it is the case I don't know how you could parallelise anyway.
If you have multiple cores on your machine that should run smoothly, if you have a single core then the 2 sessions will have to share and you will see a performance impact.

Greedy Scheduling in Multi-threading programming in cilk

I am having problem understanding the complete step and incomplete step in greedy scheduling in Multi-threaded programing in cilk.
Here is the power-point presentation for reference.
Cilk ++ Multi-threaded Programming
The problem I have understanding is in from slide # 32 - 37.
Can someone please explain especially the how is
Complete step>=P threads ready to run
incomplete steps < p threads ready
Thanks for your time and help
First, note that "threads" mentioned in the slides are not like OS threads as one may think. Their definition of a thread is given at slide 10: "a maximal sequence of instructions not containing parallel control (spawn, sync, return)". To avoid further confusion, let me call it a task instead.
On slides 32-35, a circle represents a task ("thread"), and edges represent dependencies between tasks. And the sentences you ask about are in fact definitions: when P or more tasks are ready to run (and so all P processors can be busy doing some work) the situation is called a complete step, while if less than P tasks are ready, the situation is called an incomplete step. To simplify the analysis, it is (implicitly) assumed that all tasks contain equal work (of size 1).
Then the theorem on the slide 35 provides an upper bound of time required for a greedy scheduler to run a program. Since all the execution is a sequence of complete and incomplete steps, the execution time is the sum of all steps. Since each complete step performs exactly P work, the number of complete steps cannot be bigger than T1 (total work) divided by P. Then, each incomplete step must execute a task belonging to the critical path (because at every step at least one critical path task must be ready, and incomplete steps execute all ready tasks); so the overall number of incomplete steps does not exceed the span T_inf (critical path length). Thus the sum of T1/P and T_inf gives an upper bound on execution time.
The rest of slides in the "Scheduling Theory" section are rather straightforward.

Resources