I've often read that it's bad from a perf perspective that branching, kind of at an assembly instruction level, is bad. But I haven't really seen why it's so. So, why?
Most modern processors prefetch instructions and even speculatively execute them before the code flow has reached that instruction. Having a branch means that there are suddenly two different instructions that might be the next instruction. There are at least three possible ways that this can interact with pre-fetching:
Instructions after branches aren't pre-fetched. The instruction pipeline becomes empty and the processor must wait as the next instruction is fetched at the last moment, giving worse performance.
The processor can guess which branch will be taken (branch prediction) and prefetch and execute the appropriate instruction. If it guesses the wrong branch it will have to discard the work done though, and wait for the correct instruction to be fetched.
The processor can fetch and execute both branches then discard the results from the branch that wasn't taken.
Depending on the processor and the specific code, the branch may or may not give significant performance impact compared to equivalent code without a branch. If the processor executing the code uses branch prediction (most do) and mostly guesses correctly for a specific piece of code it may not cause a significant performance impact. On the other hand if it mostly guess incorrectly it might give a huge slow down.
It can be hard to predict for a specific piece of code whether removing the branch will significantly speed up the code. When micro-optimizing it is best to measure the performance of both approaches rather than guess.
It's bad because it interferes with instruction prefetch. Modern processors can start to load the next command's bytes while still processing the first in order to run faster. When a branch occurs, that "next command" that was prefetched has to be thrown away, which wastes time. Inside of a tight loop or the like, those missed prefetches can add up.
beacause the processor doesn't know what instructions it is supposed to prefetch for execution if you give it to possibilities. in the case that the branch goes the other way than expected it has to flush the instruction pipeline since those loaded instructions are wrong now and that makes it a couple cycles slower...
In addition to the prefetch issues, if you're jumping, you're not doing other work...
If you think about an automobile assembly line, you hear things like X number of cars come off the line in a day. That doesnt mean the raw materials started at the beginning of the line and X number completed the whole run in a day. Who knows it probably doesnt but could take a number of days per car beginning to end, thats the point of the assembly line. Imagine though if for some reason you had a manufacturing change and you basically had to flush all the cars in the line and scrap them or salvage their parts to be put on another car at some other time. It would take a while to fill that assembly line up and get back to X number of cars per day.
The instruction pipeline in a processor works exactly the same way, there arent hundreds of steps in the pipeline, but the concept is the same, to maintain that one or more instructions per clock cycle execution rate (X number of cars per day) you need to keep that pipeline running smoothly. So you prefetch, that burns a memory cycle, which is usually slow but layers of caching helps. Decode, takes another clock, execute, can take many clocks esp on a CISC like an x86. When you perform a branch, on most processors, you have to throw away the instruction in the execute and prefetch, basically 2/3rds of your pipeline if you think in terms of a general, simplified pipeline. Then you have to wait those clocks for the fetch, and decode before you get back into smooth execution. On top of that the fetch, kinda by definition, not being the next instruction, some percentage of the time is more than a cacheline away and some percentage of the time that means a fetch from memory or a higher layer cache which is even more clock cycles than if you were executing linearly. The other common solution is that some processors state that no matter what whatever instruction is after a branch instruction or sometimes two instruction after the branch instruction are always executed. This way you execute as you flush the pipe, a good compiler will arrange the instructions so that something other a nop is after each branch. The lazy way though is to just put a nop or two after every branch, creating another performance hit, but for that platform most folks will be used that. A third way is what ARM does, having conditional execution. For short, forward branches, which are not all that uncommon, instead of saying branch if condition, you mark the few instructions you are trying to branch over with execute if not condition, they go into decode and execute and execute as nops and the pipe keeps moving. ARM relies on the traditional flush and refill for longer or backward branches.
Older x86 (the 8088/86) manuals as well as other equally old processor manuals for other processors as well as microcontroller manuals (new and old) will still publish the clock cycles for execution per instruction. And for the branch instructions it will say add x number of clocks if the branch happens. Your modern x86 and even ARM and other processors intended to run windows or linux or other (bulky and slow) operating systems dont bother, they often just say it runs one instruction per clock or talk about mips to megahertz or things like that and dont necessarily have a table of clocks per instruction. You just assume one, and remember that is like one car per day, its that last execution clock not the other clocks getting there. Microcontroller folks in particular deal with not one clock per instruction, and have to be more aware of execution times than the average desktop application. Look at the specs for a few of them the Microchip PIC (not the PIC32, which is mips), the msp430, definitely the 8051, although those are or were made by so many different companies their timing specs vary wildly.
Bottom line, for desktop applications or even kernel drivers on an operating system, the compiler is not that efficient and the operating system adds that much more overhead that you will hardly notice the clock savings. Switch to a microcontroller and put too many branches in and your code is up to 2 or 3 times slower. Even with a compiler and not assembler. Granted using a compiler (not writing in assembler) can/will make your code 2 to 3 times slower as well, you have to balance development, maintenance, and portability with performance.
Related
I am trying to profile a CUDA Application. I had a basic doubt about performance analysis and workload characterization of HPC programs. Let us say I want to analyse the wall clock time(the end-to-end time of execution of a program). How many times should one run the same experiment to account for the variation in the wall clock time measurement?
Thanks.
How many times should one run the same experiment to account for the
variation in the wall clock time measurement?
The question statement assumes that there will be a variation in execution time. Had the question been
How many times should one run CUDA code for performance analysis and workload characterization?
then I would have answered
Once.
Let me explain why ... and give you some reasons for disagreeing with me ...
Fundamentally, computers are deterministic and the execution of a program is deterministic. (Though, and see below, some programs can provide an impression of non-determinism but they do so deterministically unless equipped with exotic peripherals.)
So what might be the causes of a difference in execution times between two runs of the same program?
Physics
Do the bits move faster between RAM and CPU as the temperature of the components varies? I haven't a clue but if they do I'm quite sure that within the usual temperature ranges at which computers operate the relative difference is going to be down in the nano- range. I think any other differences arising from the physics of computation are going to be similarly utterly negligible. Only lesson here, perhaps, is don't do performance analysis on a program which only takes a microsecond or two to execute.
Note that I ignore, for the purposes of this answer, the capability of some processors to adjust their clock rates in response to their temperature. This would have some (possibly large) impact on a program's execution time, but all you'd learn is how to use it as a thermometer.
Contention for System Resources
By which I mean matters such as other processes (including the operating system) running on the same CPU / core, other traffic on the memory bus, other processes using I/O, etc. Sure, yes, these may have a major impact on a program's execution time. But what do variations in run times between runs of your program tell you in these cases? They tell you how busy the system was doing other work at the same time. And make it very difficult to analyse your program's performance.
A lesson here is to run your program on an otherwise quiet machine. Indeed one of the characteristics of the management of HPC systems in general is that they aim to provide a quiet platform to provide a reliable run time to user codes.
Another lesson is to avoid including in your measurement of execution time the time taken for operations, such as disk reads and writes or network communications, over which you have no control.
If your program is a heavy user of, say, disks, then you should probably be measuring i/o rates using one of the standard benchmarking codes for the purpose to get a clear idea of the potential impact on your program.
Program Features
There may be aspects of your program which can reasonably be expected to produce different times from one run to the next. For example, if your program relies on randomness then different rolls of the dice might have some impact on execution time. (In this case you might want to run the program more than once to see how sensitive it is to the operations of the RNG.)
However, I exclude from this third source of variability the running of the code with different inputs or parameters. If you want to measure the scalability of program execution time wrt input size then you surely will have to run the program a number of times.
In conclusion
There is very little of interest to be learned, about a program, by running it more than once with no differences in the work it is doing from one run to the next.
And yes, in my early days I was guilty of running the same program multiple times to see how the execution time varied. I learned that it didn't, and that's where I got this answer from.
This kind of test demonstrates how well the compiled application interacts with the OS/computing environment where it will be used, as opposed to the efficiency of a specific algorithm or architecture. I do this kind of test by running the application three times in a row after a clean reboot/spinup. I'm looking for any differences caused by the OS loading and caching libraries or runtime environments on the first execution; and I expect the next two runtimes to be similar to each other (and faster than the first one). If they are not, then more investigation is needed.
Two further comments: it is difficult to be certain that you know what libraries and runtimes your application requires, and how a given computing environment will handle them, if you have a complex application with lots of dependencies.
Also, I recommend avoiding specifying the application runtime for a customer, because it is very hard to control the customer's computing environment. Focus on the things you can control in your application: architecture, algorithms, library version.
For example, in X86, 2 CPU cores are running different software threads.
At a moment, these 2 threads need to run on their CPU cores at the same time.
Is there a way to sync-up these 2 CPU cores/threads, or something like this to make them start to run at (almost) the same time (at instruction level)?
Use a shared variable to communicate a rdtsc based deadline between the two threads. E.g., set a deadline of say the current rdtsc value plus 10,000.
Then have both threads spin on rdtsc waiting until the gap between the current rdtsc value and the threshold is less than a threshold value T (T = 100 should be fine). Finally, use the final gap value (that is, the deadline rdtsc value minus last read rdtsc value) to jump into a sequence of dependent add instructions such that the number of add instructions is equal to the gap.
This final step compensates for the fact that each chip will generally not be "in phase" with respect to their rdtsc spin loop. E.g., assuming a 30-cycle back-to-back throughput for rdtsc readings, one chip may get readings of 890, 920, 950 etc, while the other may read 880, 910, 940 so there will be a 10 or 20 cycle error if rdtsc alone is used. Using the add slide compensation, if the deadline was 1,000, and with a threshold of 100, the first thread would trigger at rdtsc == 920 and execute 80 additions, while the second would trigger at rdtsc == 910 and execute 90 additions. In principle both cores are then approximately synced up.
Some notes:
The above assumes CPU frequency equal to the nominal rdtsc frequency - if that's not the case you'll have to apply a compensation factor based on the nominal to true frequency ration when calculating where to jump into the add slide.
Don't expect your CPUs to say synced for long: anything like an interrupt, a variable latency operation like a cache miss, or a lot of other things can make them get out of sync.
You want all your payload code, and the addition slide to be hot in the icache of each core, or else they are very likely to get out of sync immediately. You can warm up the icache by doing one or more dummy runs through this code prior to the sync.
You want T to be large enough that the gap is always positive, so somewhat larger than the back-to-back rdtsc latency, but no so large as to increase the chance of events like interrupts during the add slide.
You can check the effectiveness of the "sync" by issuing a rdtsc or rdtscp at various points in the "payload" code following the sync up and seeing how close the recorded values are across threads.
A totally different option would be to use Intel TSX: transactional extensions. Organize for the two threads that want to coordinate to both read a shared line inside a transactional region and then spin, and have a third thread to write to the shared line. This will cause an abort on both of the waiting threads. Depending on the inter-core topology, the two waiting threads may receive the invalidation and hence the subsequent TSX abort at nearly the same time. Call the code you want to run "in sync" from the abort handler.
Depending on your definition of "(almost) the same time", this is a very hard problem microarchitecturally.
Even the definition of "Run" isn't specific enough if you care about timing down to the cycle. Do you mean issue from the front-end into the out-of-order back-end? Execute? (dispatch to an execution unit? or complete execution successfully without needing a replay?) Or retire?
I'd tend to go with Execute1 because that's when an instruction like rdtsc samples the timestamp counter. This it's the one you can actually record the timing of and then compare later.
footnote 1: on the correct path, not in the shadow of a mis-speculation, unless you're also ok with executions that don't reach retirement.
But if the two cores have different ROB / RS states when the instruction you care about executes, they won't continue in lock-step. (There are extremely few in-order x86-64 CPUs, like some pre-Silvermont Atoms, and early Xeon Phi: Knight's Corner. The x86-64 CPUs of today are all out-of-order, and outside of low-power Silvermont-family are aggressively so with large ROB + scheduler.)
x86 asm tricks:
I haven't used it, but x86 asm monitor / mwait to have both CPUs monitor and wait for a write to a given memory location could work. I don't know how synchronized the wakeup is. I'd guess that the less deep the sleep, the less variable the latency.
Early wake-up from an interrupt coming before a write is always possible. Unless you disable interrupts, you aren't going to be able to make this happen 100% of the time; hopefully you just need to make it happen with some reasonable chance of success, and be able to tell after the fact whether you achieved it.
(On very recent low-power Intel CPUs (Tremont), a user-space-usable version of these are available: umonitor / umwait. But in kernel you can probably just use monitor/mwait)
If umonitor/umwait are available, that means you have the WAITPKG CPU feature which also includes tpause: like pause but wait until a given TSC timestamp.
On modern x86 CPUs, the TSC is synchronized between all cores by hardware, so using the same wake-up time for multiple cores makes this trivial.
Otherwise you could spin-wait on a rdtsc deadline and probably get within ~25 cycles at worst on Skylake.
rdtsc has one per 25 cycle throughput on Skylake (https://agner.org/optimize/) so you expect each thread to be on average 12.5 cycles late leaving the spin-wait loop, +-12.5. I'm assuming the branch-mispredict cost for both threads is the same. These are core clock cycles, not the reference cycles that rdtsc counts. RDTSC typically ticks close to the max non-turbo clock. See How to get the CPU cycle count in x86_64 from C++? for more about RDTSC from C.
See How much delay is generated by this assembly code in linux for an asm function that spins on rdtsc waiting for a deadline. You could write this in C easily enough.
Staying in sync after initial start:
On a many-core Xeon where each core can change frequency independently, you'll need to fix the CPU frequency to something, probably max non-turbo would be a good choice. Otherwise with cores at different clock speeds, they'll obviously de-sync right away.
On a desktop you might want to do this anyway, in case pausing the clock to change CPU frequency throws things off.
Any difference in branch mispredicts, cache misses, or even different initial states of ROB/RS could lead to major desync.
More importantly, interrupts are huge and take a very long time compared to running 1 more instruction in an already-running task. And it can even lead to the scheduler doing a context switch to another thread. Or a CPU migration for the task, obviously costing a lot of cycles.
I'm running some benchmarks and I'm wondering whether using a "tickless" (a.k.a CONFIG_NO_HZ_FULL_ALL) Linux kernel would be useful or detrimental to benchmarking.
The benchmarks I am running will be repeated many times using a new process each time. I want to control as many sources of variation as possible.
I did some reading on the internet:
https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt
https://lwn.net/Articles/549580/
From these sources I have learned that:
In the default configuration (CONFIG_NO_HZ=y), only non-idle CPUs receive ticks. Therefore under this mode my benchmarks would always receive ticks.
In "tickless" mode (CONFIG_NO_HZ_FULL_ALL), all CPUs but one (the boot processor) are in "adaptive-tick" mode. When a CPU is in adaptive-tick mode, ticks are only received if there is more than a single job in the schedule queue for the CPU. The idea being that if there is a sole process in the queue, a context switch cannot happen, so sending ticks is not necessary.
On one hand, not having benchmarks receive ticks seems like a great idea, since we rule out the tick routine as a source of variation (we do not know how long the tick routines take). On the other hand, I think tickless mode could introduce variations in benchmark timings.
Consider my benchmarking scenario running on a tickless kernel. Suppose we repeat a benchmark twice.
Suppose the first run is lucky, and gets scheduled onto an adaptive-tick CPU which was previously idle. This benchmark will therefore not be interrupted by ticks.
When the benchmark is run a second time, perhaps it is not so lucky, and gets put on a CPU which already has some processes scheduled. This run will be interrupted by ticks at regular intervals in order to decide if one of the other processes should we switched in.
We know that ticks impose a performance hit (context switch plus the time taken to run the routine). Therefore the first benchmark run had an unfair advantage, and would appear to run faster.
Note also that a benchmark which initially has an adaptive-tick CPU to itself may find that mid-benchmark another process gets thrown on to the same CPU. In this case the benchmark is initially not receiving ticks, then later starts receiving them. This means benchmark performance can change over time.
So I think tickless mode (under my benchmarking scenario at-least) introduces timing variations. Is my reasoning correct?
One solution would be to use an isolated adaptive-tick CPU for benchmarking (isolcpus + taskset), however we have already ruled out isolated CPUs since this introduces artificial slowdowns in our multi-threaded benchmarks.
Thanks
For your "unlucky" scenario above, there has to be an active job scheduled on the same processor. This is not likely to be the case on an otherwise generally idle system, assuming that you have multiple processors. Even if this happens on one or two occasions, that means your benchmark might see the effect of one or two ticks - which hardly seems problematic.
On the other hand if it happens on many more occasions, this would be a general indication of high processor load - not an ideal scenario for running benchmarks anyway.
I would suggest, though, that "ticks" are not likely to be a significant source of variation in your benchmark timings. The scheduler is supposed to be O(1). I doubt you will see much difference in variation between tickless and non-tickless mode.
I'm learning about Concurrency and how the OS handles interrupts such as moving your cursor across the screen while a program is doing some important computation like large matrix multiplications.
My question is, say you're on those old computers with a single core on it, wouldn't that single core need to constantly context-switch to handle your interrupts because of all the cursor moving and therefore more time is needed for the important computation? But I assume it's not that huge of a delay because perhaps the OS will prioritize my calculation above my interrupts? Maybe skip a few frames between the movement.
And if we move to a multi-core system, is this generally less likely to happen as the cursor moving will probably be processed by another core? So my calculations will not really be that delayed?
While I am at this, am I right to assume that the single-core computer probably goes through like hundreds of processes and it context-switches throughout all of them? So quite literally, your computer is doing one instruction at a time for a certain amount of time (a time slice) and then it needs to switch to another process with a different set of instructions. If so, it's amazing how the core does so much.... Jump, get the context, do a few steps, save the context onto stack, jump yet again. Rinse and repeat. There's obviously no parallelism. So no two instructions are EVER running at the same time. It only gives that illusion.
Your last paragraph is correct, it's the job of the operating system's scheduler to generate the feeling of parallelism by letting each process execute some instructions and then continuing with the next. This does not only affect single core CPUs by the way - typically your computer will be running many more processes than you have CPUs. (Use task manager on windows or top on Linux to see how many processes are running currently).
In terms of your mouse question: that interrupt will most likely just change the current mouse coordinates in a variable and not cause a repaint. Therefore it is going to be extremely fast and should not cause any measurable delay in programming execution. Maybe it would if you could move your mouse by the speed of light ;)
I'm modeling some algorithms to be run on GPU's. Is there a reference or something as to how many cycles the various intrinsics and calculations take on modern hardware? (nvidia 5xx+ series, amd 6xxx+ series) I cant seem to find any official word on this even though there are some mentions of the raised costs of normalization, square root and other functions throughout their documentation.. thanks.
Unfortunately, the cycle count documentation you're looking for either doesn't exist, or (if it does) it probably won't be as useful as you would expect. You're correct that some of the more complex GPU instructions take more time to execute than the simpler ones, but cycle counts are only important when instruction execution time is main performance bottleneck; GPUs are designed such that this is very rarely the case.
The way GPU shader programs achieve such high performance is by running many (potentially thousands) of shader threads in parallel. Each shader thread generally executes no more than a single instruction before being swapped out for a different thread. In perfect conditions, there are enough threads in flight that some of them are always ready to execute their next instruction, so the GPU never has to stall; this hides the latency of any operation executed by a single thread. If the GPU is doing useful work every cycle, then it's as if every shader instruction executes in a single cycle. In this case, the only way to make your program faster is to make it shorter (fewer instructions = fewer cycles of work overall).
Under more realistic conditions, when there isn't enough work to keep the GPU fully loaded, the bottleneck is virtually guaranteed to be memory accesses rather than ALU operations. A single texture fetch can take thousands of cycles to return in the worst case; with unpredictable stalls like that, it's generally not worth worrying about whether sqrt() takes more cycles than dot().
So, the key to maximizing GPU performance isn't to use faster instructions. It's about maximizing occupancy -- that is, making sure there's enough work to keep the GPU sufficiently busy to hide instruction / memory latencies. It's about being smart about your memory accesses, to minimize those agonizing round-trips to DRAM. And sometimes, when you're really lucky, it's about using fewer instructions.
http://books.google.ee/books?id=5FAWBK9g-wAC&lpg=PA274&ots=UWQi5qznrv&dq=instruction%20slot%20cost%20hlsl&pg=PA210#v=onepage&q=table%20a-8&f=false
this is the closest thing i've found so far, it is outdated(sm3) but i guess better than nothing.
does operator/functions have cycle? I know assembly instructions have cycle, that's the low level time measurement, and mostly depends on CPU.since operator and functions are all high level programming stuffs. so I don't think they have such measurement.