What is the difference between gcc optimization levels? - linux

What is the difference between different optimization levels in GCC? Assuming I don't care to have any debug hooks, why wouldn't I just use the highest level of optimization available to me? does a higher level of optimization necessarily (i.e. provably) generate a faster program?

Yes, a higher level can sometimes mean a better performing program. However, it can cause problems depending on your code. For example, branch prediction (enabled in -O1 and up) can break poorly written multi threading programs by causing a race condition. Optimization will actually decide something that's better than what you wrote, which in some cases might not work.
And sometimes, the higher optimizations (-O3) add no reasonable benefit but a lot of extra size. Your own testing can determine if this size tradeoff makes a reasonable performance gain for your system.
As a final note, the GNU project compiles all of their programs at -O2 by default, and -O2 is fairly common elsewhere.

Generally optimization levels higher than -O2 (just -O3 for gcc but other compilers have higher ones) include optimizations that can increase the size of your code. This includes things like loop unrolling, lots of inlining, padding for alignment regardless of size, etc. Other compilers offer vectorization and inter-procedural optimization at levels higher than -O3, as well as certain optimizations that can improve speed a lot at the cost of correctness (e.g., using faster, less accurate math routines). Check the docs before you use these things.
As for performance, it's a tradeoff. In general, compiler designers try to tune these things so that they don't decrease the performance of your code, so -O3 will usually help (at least in my experience) but your mileage may vary. It's not always the case that really aggressive size-altering optimizations will improve performance (e.g. really aggressive inlining can get you cache pollution).

I found a web page containing some information about the different optimization levels. One thing a remember hearing somewhere is that optimization might actually break your program and that can be an issue. But I'm not sure how much of a an issue that is any longer. Perhaps todays compilers are smart enough to handle those problems.

Sidenote:
It's quite hard to predict exactly what flags are turned on by the global -O directives on the gcc command line for different versions and platforms, and all documentation on the GCC site is likely to become outdated quickly or doesn't cover the compiler internals in enough detail.
Here is an easy way to check exactly what happens on your particular setup when you use one of the -O flags and other -f flags and/or combinations thereof:
Create an empty source file somewhere:touch dummy.c
Run it though the compiler pass just as you normally would, with all -O, -f and/or -m flags you would normally use, but adding -Q -v to the command line:gcc -c -Q -v dummy.c
Inspect the generated output, perhaps saving it for different run.
Change the command line to your liking, remove the generated object file via rm -f dummy.o and re-run.
Also, always keep in mind that, from a purist point of view, most non-trivial optimizations generate "broken" code (where broken is defined as deviating from the optimal path in corner cases), so choosing whether or not to enable a certain set of optimization mechanisms sometimes boils down to choosing the level of correctness for the compiler output. There always have (and currently are) bugs in any compiler's optimizer - just check the GCC mailing list and Bugzilla for some samples. Compiler optimization should only be used after actually performing measurements sincegains from using a better algorithm will dwarf any gains from compiler optimization,there is no point in optimizing code that will run every once in a blue moon,if the optimizer introduces bugs, it's immaterial how fast your code runs.

Related

Which nodejs v8 flags for benchmarking?

For comparison of different libraries with the same functionality, we compare their execution time. This works great. However, there are v8 flags that impact execution time and skew results.
Some flags that are relevant are: --predictable, --always-opt, --no-opt, --minimal.
Question: Which v8 flags should typically be set for running a meaningful benchmarks? What are the tradeoffs?
Edit: The problem is that a benchmark typically runs the same code over and over to get a good average. This might lead to v8 optimizing code, which it would typically not optimize.
V8 developer here. You should definitely run benchmarks with the default configuration. It is the responsibility of the benchmark to be realistic. An unrealistic benchmark cannot be made meaningful with engine flags. (And yes, there are many many unrealistic and/or otherwise meaningless snippets of code out there that people call "benchmarks". Remember, if you can't measure a difference with a realistic benchmark, then any unmeasurable difference that might exist is irrelevant.)
In particular:
--predictable
Absolutely not. Detrimental to performance. Changes behavior in unrealistic ways. Meant for debugging certain things, and for helping fuzzers find reproducible test cases (at the expense of being somewhat unrealistic), not for anything related to performance testing.
--always-opt
Absolutely not. Contrary to what a naive reader of this flag's name might think, this does not improve performance, on the contrary; it mostly causes V8 to waste a bunch of CPU cycles on useless work. This flag is barely ever useful at all; it can sometimes flush out weird corner case bugs in the compilation pipeline, but most of the time it just creates pointless work for V8 developers by creating artificial situations that never occur in practice.
--no-opt
Absolutely not. Turns off all optimizations. Totally unrealistic.
--minimal
That's not a V8 flag I've ever heard of. So yeah, sure, pass it along, it won't do anything (beyond printing an "unknown flag" warning), so at least it won't break anything.
Using default flags seems like the best way to me, since that's what most people will use.

Why is "cabal build" so slow compared with "make"?

If I have a package with several executables, which I initially build using cabal build. Now I change one file that impacts just one executable, cabal seems to take about a second or two to examine each executable to see if it's impacted or not. On the other hand, make, given an equivalent number of executables and source files, will determine in a fraction of a second what needs to be recompiled. Why the huge difference? Is there a reason, cabal can't just build its own version of a makefile and go from there?
Disclaimer: I'm not familiar enough with Haskell or make internals to give technical specifics, but some web searching does offer some insight that lines up with my proposal (trying to avoid eliciting opinions by providing references). Also, I'm assuming your makefile is calling ghc, as cabal apparently would.
Proposal: I believe there could be several key reasons, but the main one is that make is written in C, whereas cabal is written in Haskell. This would be coupled with superior dependency checking from make (although I'm not sure how to prove this without looking at the source code). Other supporting reasons, as found on the web:
cabal tries to do a lot more than simply compiling, e.g. appears to take steps with regard to packaging (https://www.haskell.org/cabal/)
cabal is written in haskell, although the run time is written in C (https://en.wikipedia.org/wiki/Glasgow_Haskell_Compiler)
Again, not being overly familiar with make internals, make may simply have a faster dependency checking mechanism, thereby better tracking these changes. I point this out because from the OP it sounds like there is a significant enough difference to where cabal may be doing a blanket check against all dependencies. I suspect this would be the primary reason for the speed difference, if true.
At any rate, these are open source and can be downloaded from their respective sites (haskell.org/cabal/ and savannah.gnu.org/projects/make/) allowing anyone to examine specifics of the implementations.
It is also likely one could see a lot of variance in speed based upon the switches passed to the compilers in use.
HTH at least point you in the right direction.

Can -pthreads (gcc) or -mt (sun studio) or similar options cause problems?

I'm working on a build for an old project that wasn't maintained well at all; it's more or less a hodgepodge of hundreds of independent projects that get cobbled together. Naturally, that means there's a lot of inappropriate things going on.
There's probably 50-100 executables, and around 300 shared/static libraries. Some of the libraries are built with the -mt flag (sun studio; -pthreads appears to be the gcc equivalent), others aren't.
This seems potentially problematic to me. Suppose I have libA.so and libB.so -- A was built with -mt, but not B. I expect if the application linked against A & B is single-threaded there won't be any problems (feel free to correct me on that). However, if the app is threaded then this situation opens a fun can of worms.
I'm inclined to just build everything with -mt and have done with it. Most of the office agrees with this plan, but there's one dissenter. My expectation is that this will just create a potential degradation in performance, but at the moment performance is already abysmal because of the poor state of this project; so I'm not worried about that.
In short: are there any potential pitfalls with doing this?
I would expect problems for libraries which are actively caring about being compiled with threads (#ifdef _REENTRANT is a sign of it). When some code has not been compiled for ages, and it suddenly becomes active, its problems might become visible. (It's even not so much about threading, just about any code which was ifdeffed out for a long time).
A special case of the above is a library which tries to use pthread_atfork in a manner which is described in RATIONALE section here: these things are far from well-defined, posixly speaking, because creating threads, releasing locks, etc. is not async-signal safe. Your platform may provide some guarantees about what actually happens (e.g. it might even specify whether atfork handlers are invoked for the fork happening in a signal handler). This thing should not matter if fork is not actually used, but otherwise it has a real chance to misbehave.
Summary: don't expect much problems from libraries that don't care about threading. Expect them from libraries that care and are (possibly, in some circumstances) doing it wrong.

The compilation of the compiler could affect the compiled programs?

Probably my question sounds weird, but my point is: i have to compile a program using GCC, if i compile GCC from the source i will get a slight edge in terms of performances from a software compiled with the fresh new GCC? What I should expect?
You won't get any faster programs out of a compiler built with optimizing flags. Since a program is the compilers' output, and optimizations don't change the output of a correct program, the programs stay the same.
You might, however, profit from new available options if your distributor ships an incomplete compiler. Look through the GCC manual for any options you want to enable (like certain target architecture variants), and if you can't enable them in your current compiler build, there might be potential in a custom-built compiler. However, it is unlikely that it's worth it.
Not unless you're building a newer version of gcc, or enabling cloog, graphite, etc.
the performance difference usually is nothing or is negligible.
in a very rare, really very rare cases you can see noticeable difference, but not always performance improvement. degradation is possible too.

How to benchmark a crypto library?

What are good tests to benchmark a crypto library?
Which unit (time,CPU cycles...) should we use to compare the differents crypto libraries?
Are there any tools, procedures....?
Any Idea, comment is welcome!
Thank you for your inputs!
I assume you mean performance benchmarks. I would say that both time and cycles are valid benchmarks, as some code may execute differently on different architectures (perhaps wildly differently if they're different enough).
If it is extremely important to you, I would do the testing myself. You can use some timer (almost all languages have one) or you can use some profiler (almost all languages have one of these too) to figure out the exact performance for the algorithms you are looking for on your target platform.
If you are looking at one algorithm vs. another one, you can look for data that others have already gathered and that will give you a rough idea. For instance, here are some benchmarks from Crypto++:
http://www.cryptopp.com/benchmarks.html
Note that they use MB/Second and Cycles/Byte as metrics. I think those are very good choices.
Some very good answers before me, but keep in mind optimizations are a very good way to leak key material by timing attack (for example see how devastating it can be for AES). If there is any chance an attacker can time your operations you want not the fastest but the most constant time library available (and possibly the most constant power usage available, if there is any chance someone can monitor yours). OpenSSL does a great job of keeping on top of current attacks, can't necessarily say the same things of other libraries.
What are good tests to benchmark a crypto library?
The answers below are in the context of Crypto++. I don't now about other libraries, like OpenSSL, Botan, BouncyCastle, etc.
The Crypto++ library has a built-in benchmarking suite.
Which unit (time,CPU cycles...) should we use to compare the differents crypto libraries?
You typically measure performance in cycles-per-byte. Cycles-per-byte depends upon the CPU frequency. Another related metric is throughput measured in MB/s. It also depends upon CPU frequency.
Are there any tools, procedures....?
git clone https://github.com/weidai11/cryptopp.git
cd cryptopp
make static cryptest.exe
# 2.0 GHz (use KB=1024; not 1000)
make bench CRYPTOPP_CPU_SPEED=1.8626
make bench will create a file called benchmark.html.
If you want to manually run the tests, then:
./cryptest.exe b <time in seconds> <cpu speed in GHz>
It will output an HTML-like table without <HEAD> and <BODY> tags. You will still be able to view it in a web browser.
You can also check the Crypto++ benchmark page at Crypto++ Benchmarks. The information is dated, and its on our TODO list.
You also need accumen for what looks right. For example, SSE4.2 and ARMv8 have a CRC32 instruction. Cycles-per-byte should go from about 3 or 5 cpb (software only) to about 1 or 1.5 cpb (hardware acceleration). It should equate to a change of roughly 300 or 500 MB/s (software only) to roughly 1.5 GB/s (hardware acceleration) on modern hardware running around 2 GHz.
Other technologies, like SSE2 and NEON, are trickier to work with. There's a theoretical cycles-per-byte and throughput you should see, but you may not know what it is. You may need to contact the authors of the algorithm to find out. For example, we contacted the authors of BLAKE2 to learn if our ARMv7/ARMv8 NEON implementation was performing as expected because it was missing benchmark results on the author's homepage.
I've also found GCC 4.6 (and above) and -O3 can make a big difference in software-only implementations. That's because GCC heavily vectorizes at -O3, and you might witness a 2x to 2.5x speedup.For example, the compiler may generate code that runs at 40 cpb at -O2. At -O3 it may run at 15 or 19 cpb. A good SSE2 or NEON implementation should outperform the software-only implementation by at least a few cycles per byte. In the same example, the SSE2 or NEON implementation may run at 8 to 13 cpb.
There's also sites like OpenBenchmarking.org that may be able to provide some metrics for you.
My comments above aside, the US government has the FIPS program that you might want to look at. It's not perfect (by a long shot) but it's a start -- you can get an idea of things they were looking at when evaluation cryptography.
I also suggest looking at the Computer Security Division of the NIST.
Also, on a side note ... reviewing what the master has to say (Bruce Schneier) on the subject of Security Pitfalls in Cryptography is always good. Also: Security is harder than it looks.

Resources