When I compile a large project (for example, Bitcoin) in both GCC (using MinGW) and in MSVC (using Visual Studio) using comparable optimization settings, the GCC binary is 6 mb and the MSVC binary is 4 mb.
Now I am wondering, does this say that MSVC produces better binaries (and I mean better as in smaller+faster)? Or doesnt this mean anything, and its just symbol-information or something unrelated to performance?
I expect a lot of comments: just benchmark it. But I'm more interested in the reason for the difference, not in the exact size/performance difference itself.
It is possible that with -o2 only, mingw may produce slower binaries than MSVC. I haven't tested and do not know. However I do know that with -march=native enabled, in my own benchmarks (http://plflib.org/colony.htm#benchmarks) mingw outperforms MSVC (with the appropriate target optimisations) by about 20%.
The main reason, I would imagine, is better customisation for individual CPU targets as opposed to MSVC's more scattershot approach. However it may be that GCC's code gen is simply better.
However, on other benchmarks MSVC might show a performance improvement. My own results are in no way definitive, but they are indicative.
Lastly I will note that yes, MSVC does produce smaller binaries in general - but watch what you #include. Including iostream in GCC/libstdc++ drags in a Ton of code, whereas in MSVC it drags in very little. And as others have said, smaller != faster necessarily.
According to this wxwidgets page on reducing executable size, Visual C++ is known to produce a smaller, faster executable, at least on Windows.
Use Microsoft Visual C++ instead of gcc (Cygwin or Mingw) on Windows.
It does produce smaller and faster executables.
Smaller is not necessarily faster. My latest compilations make extensive use of SIMD instructions that can have more than one set of instructions for one line of code, like some for AVX SIMD, some for SSE SIMD and some for SSE SISD. Then there can be significant loop unrolling (to maintain pipeline flow), with numerous repetitive instruction sequences.
Some might be following the same procedures as on Android via Eclipse, where a compile parameter, APP_ABI := all, generates code for arm64-v8a, armeabi, armeabi-v7a, mips, mips64, x86 and x86-64, selected automatically at run time.
Related
With GCC/Clang/ICC/etc I can use
-march=skylake etc to generate code optimized for a specific microarchitecture, and
-march=native to generate code optimized for the local machine.
How do I do these with MSVC?
Microsoft's compiler splits this into two separate areas. One is generating code specific to a particular instruction set, which won't work on a CPU that doesn't support that instruction set. This falls under its -arch: flag. The x64 compiler only supports two variants here: AVX and AVX2 (or no flag, which supports up to SSE2). The x86 version of the compiler adds a couple more flags for older instruction set extensions (e.g., SSE), but I doubt you care about that any more.
The other category is generating code that will work on any of a number of architectures, but favors one over another. This is supported by the -favor switch, which supports targets of ATOM, AMD64, INTEL64, and "blend" (which basically means to try not to favor one at the expense of others).
It doesn't have any (documented) flags for something like favoring Skylake vs. (say) Haswell or Broadwell though.
I have to install without root access some software (the gromacs simulation package) on a cluster server, on which jobs can be sent through slurm. I only have direct access to the front-end machine, and the home directory is shared among all the servers and front-end. I had to manually build and install locally:
gcc 4.8
automake, autoconf, cmake
openmpi
lapack libs
gromacs
Right now, I have installed all of this only on the front-end, which is an older Intel Xeon machine. The production servers have new AMD processor instead. This is my question: in order to achieve optimal performance, which parts of the aforementioned stack should be recompiled on the production servers? I guess it would make much sense to rebuild the final software (gromacs) and maybe the lapack libs, because of the different instruction sets and processor architecture, but I'm not exactly sure whether it would make any sense to rebuild the compiler or other parts of the system. Hence the question: does using a compiler (and the associated libraries) which have been built on a different machine result in higher execution times for the generated binaries?
In general, I'd expect a compiler to produce the same binaries if given the same output, so the answer would be no; but what about the libraries (as libstdc++) which have been compiled together with the compiler on the other machine?
thank you
In order to optimize gromacs (parallel molecular dynamics code), you can forget about recompiling the compileror the compilation tools: that's useless.
You should go after and check for optimizations. For Intel CPU using the Intel C compiler makes a difference. It's possible you observe some gains with AMDs as well.
Another alternative is to use the Portland Group compiler.
Regarding MPI, you need to be sure it's customized for your interconnect (for example, if you have infiniband, avoid to use the TCP standard version).
regarding lapack libraries, you need to install optimized lapack (ACML for AMDs, MKL for Intels. You can use with very good performance GOTO or ATLAS blas - they are included in many linux distros).
You have not mentioned FFT: they are indeed important for electromagnetics (Ewald summations) in the simulations: FFTW here is a good choice. You need to install the correct version for the processor or compile it on the target processor, because it performs a sort of "auto-tuning" in the compilation process.
Going below than this (tools, compilers) make no difference on the produced executables.
Building the GCC compiler already involves a four-stage bootstrap process, one of whose purposes is to QA the compiler by ensuring the last two stages produce the same output. So there is no reason to believe that a fifth stage will have any effect at all.
Probably my question sounds weird, but my point is: i have to compile a program using GCC, if i compile GCC from the source i will get a slight edge in terms of performances from a software compiled with the fresh new GCC? What I should expect?
You won't get any faster programs out of a compiler built with optimizing flags. Since a program is the compilers' output, and optimizations don't change the output of a correct program, the programs stay the same.
You might, however, profit from new available options if your distributor ships an incomplete compiler. Look through the GCC manual for any options you want to enable (like certain target architecture variants), and if you can't enable them in your current compiler build, there might be potential in a custom-built compiler. However, it is unlikely that it's worth it.
Not unless you're building a newer version of gcc, or enabling cloog, graphite, etc.
the performance difference usually is nothing or is negligible.
in a very rare, really very rare cases you can see noticeable difference, but not always performance improvement. degradation is possible too.
I am currently using a GCC 3.3.3 based cross compiler to compile for a Xscale PXA270 development board. However, I was wondering if there are other Xscale compilers out there that run on Linux (or Windows for that matter)? The cross compiler setup I am using has horrendous performance on the target device, with certain programs that do a decent amount of math operations performing 10 to 20 times worse on the Xscale processor than on a similarly clocked Pentium 2. Any other options for compilers out there or specific compiler flags I should be setting with my GCC-based compiler that may help with the performance?
Thanks,
Ben
Unlike the Pentium 2, the XScale architecture doesn't have native floating point instructions. That means floating point math has to be emulated using integer instructions - a 10 to 20 times slowdown sounds about right.
To improve performance, you can try a few things:
Where possible, minimise the use of floating point - in some places, you may be able to subsitute plain integer or fixed point calculations;
Trade-off memory for speed, by precalculating tables of values where possible;
Use floats instead of doubles in calculations where you do not need the precision of the latter (including using the C99 float versions of math.h functions);
Minimise conversions between integers and floating point types.
Yes, you don't have an FPU so floating point needs to be done in integer math. However, there are two mechanisms for doing this, and one is 11 times faster than the other.
GCC target arm-linux-gnu normally includes real floating point instructions in the code for ARM's first FPU, the "FPA", now so rare it is nonexistent. These cause illegal instruction traps which are then caught and emulated in the kernel. This is extremely slow due to the context switch.
-msoft-float instead inserts calls to library functions (in libgcc.a). This avoids the switch into kernel space and is 11 times faster that the emulated FPA instructions.
You don't say what floating point model you are using - it may be that you are already building the whole userland with -msoft-float - but it might be worth checking that your object files contain no FPA instructions. You can check with:
objdump -d file | grep '<space><tab>f' | less
where file is any object file, executable or library that your compiler outputs. All FPA instructions start with f, while no other ARM instructions do. Those are actual space and tab characters there, and you might need to say <control-V><tab> to get the tab character past your shell.
If it is using FPA insns, you need to compile your entire userland using -msoft-float.
The most comprehensive further reading on these issues is http://wiki.debian.org/ArmEabiPort which is primarily concerned with a third alternative: using an arm-linux-gnueabi compiler, a newer alternative ABI that is available from gcc-4.1.1 onwards and which has different characteristics. See the document for further details.
"Other xscale compilers"
Open source: llvm and pcc, of which llvm is the most linux-friendly and functional, and also has a gcc front-end; pcc, a descendant of the venerable Portable C Compiler, seems more bsd-oriented.
Commercial: The Keil compiler (owned by ARM Ltd) seems to produce faster code than GCC, but is not going to impact your lack of an FPU significantly.
Is a program shipped in assembler format portable between Linux distributions (modulo CPU architecture differences)?
Here's the background to my question: I'm working on a new programming language (named Aklo), whose modus operandi will be the classic compiling to .s and feeding the result to the GNU assembler.
Obviously it would be nice ultimately to have the implementation written in itself, but I had resigned myself to maintaining it in C++ to solve the chicken and egg problem: suppose you download the compiler for the first time and it is itself written in Aklo, how do you compile it? As I understand it, different Linux distributions and other UNIX like systems have different conventions for binary formats.
But it's just occurred to me, a solution might be to ship the .s file (well, one per CPU architecture): it's fair to assume you have or can install the GNU assembler. Of course I'd still need a bootstrap compiler, but that doesn't need to be fast; I can write it in Python.
Is assembler portable in the way that binaries are not? Are there any other stumbling blocks I haven't thought of?
Added in response to one answer:
I had looked wistfully at LLVM, there is certainly a lot of good stuff there and it would make my life easier -- except that it would incur a dependency on the correct version of LLVM being installed. It wouldn't be so bad having that dependency on development machines, but in a world where it's common to ship programs as source, the same dependency would be incurred for every user of every program ever written in Aklo, and I decided that was too high a price to pay.
But if the solution of shipping compiled programs as assembler works... then that solves that problem, and I can use LLVM after all, which would be a big win.
So the question about portability of assembler is even considerably more important than I had first realized.
Conclusion: from answers here and on the LLVM mailing list http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-January/028991.html it seems the bad news is the problem is unsolvable, but the good news is that means using LLVM makes it no worse, so I'm free to do so and obtain all the advantages thereof.
You might want to check out LLVM before going down this particular path. It might make your life a lot easier, as it provides a low level virtual machine that makes a lot of hard stuff just work and has been very popular.
At a very high level, the ABI consists of { instruction set, system calls, binary format, libraries }.
Distribution as .s may free you from the binary format. This is still rather pointless, because you are fixed to a particular ISA and still need to use libraries and/or make system calls. Libraries vary from distribution to distribution (although this isn't really that bad, especially if you just use libc) and syscalls vary from OS to OS.
It's basically 20 years since I last bootstrapped a C compiler. At the level of compilers, the differences between Linux distributions are minimal.
The much more important reason for going LLVM is cross-platform; if you're not writing some intermediate language, your compiler will be extremely difficult to retarget for different processors. And seeing as, on my laptop, I have compilers for x86, x86_64, two kinds of MIPS, PowerPC, ARM and AVR... you see where I'm going? I can compile multiple languages for most of those targets too (only C for AVR).