I am currently using a GCC 3.3.3 based cross compiler to compile for a Xscale PXA270 development board. However, I was wondering if there are other Xscale compilers out there that run on Linux (or Windows for that matter)? The cross compiler setup I am using has horrendous performance on the target device, with certain programs that do a decent amount of math operations performing 10 to 20 times worse on the Xscale processor than on a similarly clocked Pentium 2. Any other options for compilers out there or specific compiler flags I should be setting with my GCC-based compiler that may help with the performance?
Thanks,
Ben
Unlike the Pentium 2, the XScale architecture doesn't have native floating point instructions. That means floating point math has to be emulated using integer instructions - a 10 to 20 times slowdown sounds about right.
To improve performance, you can try a few things:
Where possible, minimise the use of floating point - in some places, you may be able to subsitute plain integer or fixed point calculations;
Trade-off memory for speed, by precalculating tables of values where possible;
Use floats instead of doubles in calculations where you do not need the precision of the latter (including using the C99 float versions of math.h functions);
Minimise conversions between integers and floating point types.
Yes, you don't have an FPU so floating point needs to be done in integer math. However, there are two mechanisms for doing this, and one is 11 times faster than the other.
GCC target arm-linux-gnu normally includes real floating point instructions in the code for ARM's first FPU, the "FPA", now so rare it is nonexistent. These cause illegal instruction traps which are then caught and emulated in the kernel. This is extremely slow due to the context switch.
-msoft-float instead inserts calls to library functions (in libgcc.a). This avoids the switch into kernel space and is 11 times faster that the emulated FPA instructions.
You don't say what floating point model you are using - it may be that you are already building the whole userland with -msoft-float - but it might be worth checking that your object files contain no FPA instructions. You can check with:
objdump -d file | grep '<space><tab>f' | less
where file is any object file, executable or library that your compiler outputs. All FPA instructions start with f, while no other ARM instructions do. Those are actual space and tab characters there, and you might need to say <control-V><tab> to get the tab character past your shell.
If it is using FPA insns, you need to compile your entire userland using -msoft-float.
The most comprehensive further reading on these issues is http://wiki.debian.org/ArmEabiPort which is primarily concerned with a third alternative: using an arm-linux-gnueabi compiler, a newer alternative ABI that is available from gcc-4.1.1 onwards and which has different characteristics. See the document for further details.
"Other xscale compilers"
Open source: llvm and pcc, of which llvm is the most linux-friendly and functional, and also has a gcc front-end; pcc, a descendant of the venerable Portable C Compiler, seems more bsd-oriented.
Commercial: The Keil compiler (owned by ARM Ltd) seems to produce faster code than GCC, but is not going to impact your lack of an FPU significantly.
Related
With GCC/Clang/ICC/etc I can use
-march=skylake etc to generate code optimized for a specific microarchitecture, and
-march=native to generate code optimized for the local machine.
How do I do these with MSVC?
Microsoft's compiler splits this into two separate areas. One is generating code specific to a particular instruction set, which won't work on a CPU that doesn't support that instruction set. This falls under its -arch: flag. The x64 compiler only supports two variants here: AVX and AVX2 (or no flag, which supports up to SSE2). The x86 version of the compiler adds a couple more flags for older instruction set extensions (e.g., SSE), but I doubt you care about that any more.
The other category is generating code that will work on any of a number of architectures, but favors one over another. This is supported by the -favor switch, which supports targets of ATOM, AMD64, INTEL64, and "blend" (which basically means to try not to favor one at the expense of others).
It doesn't have any (documented) flags for something like favoring Skylake vs. (say) Haswell or Broadwell though.
Attempting to solve linear programming problems using GLPLK's GLPSOL we've come upon a snag, namely that in very specific cases, the results between glpsol executables created with different compilers are different.
The situation is that we have a problem with several valid solutions. To put it simply, we have a table where each row (X) can be assigned only one column (Y), and viceversa. As such, all combinations that assign unique column/row pairs are valid.
Example, for a 2x2 table, these are valid:
{(X0,Y0),(X1,Y1)} {(X0,Y1),(X1,Y0)}
Now, the original glpsol binary we used under windows, returned the results in order, something like this:
{(X0,Y0),(X1,Y1)...(Xn,Yn)}
We noticed an issue with the Linux binary, in that it returned the solution in a different order, something like this:
{(X0,Y0),(Xn,Y1),(X1,Y2) ....}
Note that the order is not random, every execution follows the same pattern.
After much investigation I discovered that the issue lies in which compiler was used to create each binary. In our example above, the Windows binary was compiled using Visual C++, while the Linux binary used GCC.
I've verified this by recompiling the Windows binary using GCC, resulting in the same pattern. Compiling with Borland results in a different pattern.
So the question is, mainly, why is this happening?
I'm guessing it might be the result of how each compiler optimizes the binary, but I'm not sure, and my objective is to obtain the same results we had with the original executable (the one compiled with Visual C++) both for Windows and Linux. And I am suspecting cross-compiling with the Visual C++ toolchain won't be an option.
Note: I managed to determine the compiler used by each binary by opening them as text and locating text strings within the executable referencing Visual C++ and GNU GCC respectively.
Thanks!
Versions of the solver built with different compilers can take different paths during the optimization process which can result in the behavior you observe. Things that can affect this are: differences in floating-point semantics (possibly caused by -ffast-math), different implementations of sort (qsort is normally not a stable sort) - this is mentioned by Ben Voigt, different implementations of random number generators in standard libraries.
If both solutions are optimal, I wouldn't be too much concerned about this.
When I compile a large project (for example, Bitcoin) in both GCC (using MinGW) and in MSVC (using Visual Studio) using comparable optimization settings, the GCC binary is 6 mb and the MSVC binary is 4 mb.
Now I am wondering, does this say that MSVC produces better binaries (and I mean better as in smaller+faster)? Or doesnt this mean anything, and its just symbol-information or something unrelated to performance?
I expect a lot of comments: just benchmark it. But I'm more interested in the reason for the difference, not in the exact size/performance difference itself.
It is possible that with -o2 only, mingw may produce slower binaries than MSVC. I haven't tested and do not know. However I do know that with -march=native enabled, in my own benchmarks (http://plflib.org/colony.htm#benchmarks) mingw outperforms MSVC (with the appropriate target optimisations) by about 20%.
The main reason, I would imagine, is better customisation for individual CPU targets as opposed to MSVC's more scattershot approach. However it may be that GCC's code gen is simply better.
However, on other benchmarks MSVC might show a performance improvement. My own results are in no way definitive, but they are indicative.
Lastly I will note that yes, MSVC does produce smaller binaries in general - but watch what you #include. Including iostream in GCC/libstdc++ drags in a Ton of code, whereas in MSVC it drags in very little. And as others have said, smaller != faster necessarily.
According to this wxwidgets page on reducing executable size, Visual C++ is known to produce a smaller, faster executable, at least on Windows.
Use Microsoft Visual C++ instead of gcc (Cygwin or Mingw) on Windows.
It does produce smaller and faster executables.
Smaller is not necessarily faster. My latest compilations make extensive use of SIMD instructions that can have more than one set of instructions for one line of code, like some for AVX SIMD, some for SSE SIMD and some for SSE SISD. Then there can be significant loop unrolling (to maintain pipeline flow), with numerous repetitive instruction sequences.
Some might be following the same procedures as on Android via Eclipse, where a compile parameter, APP_ABI := all, generates code for arm64-v8a, armeabi, armeabi-v7a, mips, mips64, x86 and x86-64, selected automatically at run time.
Well, they bring (should bring at least) great increase in performance, isn’t it?
So, I haven’t seen any Linux kernel sources, but ‘d love to ask: are they used somehow? (In this case – there must be some special “code-cap” for system that has no such instructions?)
The SSE and MMX instruction sets have limited value outside of audio/video and gaming work. You might find a few explicit uses in dark corners of the kernel, but I wouldn't count on it. The answer in the general case is "no, they are not used", nor are they used in most non-kernel/userspace applications.
The kernel does sometimes optionally use certain x86 instructions that are specific to certain CPUs (e.g. present on some AMD or Intel models but not all, nor vice-versa), such as syscall, but these are different from the SIMD instruction sets you're referring to, and are not part of some wider set of similarly-themed extensions.
After Mark's answer, I went looking. The only place I could easily identify them being used is in the RAID 6 library (which also has support for AltiVec, which is the PowerPC SIMD instruction set).
(Be wary just grepping the tree, there are a lot of spots where the kernel "knows" about SSE/MMX to support user-space applications, but isn't actually using it. Also a couple cases of unfortunate variable names that have absolutely nothing to do with SSE, e.g. in the SCTP implementation.)
There are severe restrictions on using vector registers and floating point registers in kernel code. See e.g. chapter 6.3 of "Calling conventions for different C++ compilers and operating systems". http://www.agner.org/optimize/#manuals
They are used in the kernel for a few things, such as
Software RAID
Encryption (possibly)
However, I believe it always checks their presence first.
"cpu simd instructions use FPU"
erm, no, not as I understand it. They're in part a modern and (much) more efficient replacement for FPU instructions, but a large part of the SIMD instruction set deals with integer operations.
I've never looked very hard at it, but I suppose (ok, hope) that SIMD code generated by a recent gcc version will not clobber any registers or state.
Is a program shipped in assembler format portable between Linux distributions (modulo CPU architecture differences)?
Here's the background to my question: I'm working on a new programming language (named Aklo), whose modus operandi will be the classic compiling to .s and feeding the result to the GNU assembler.
Obviously it would be nice ultimately to have the implementation written in itself, but I had resigned myself to maintaining it in C++ to solve the chicken and egg problem: suppose you download the compiler for the first time and it is itself written in Aklo, how do you compile it? As I understand it, different Linux distributions and other UNIX like systems have different conventions for binary formats.
But it's just occurred to me, a solution might be to ship the .s file (well, one per CPU architecture): it's fair to assume you have or can install the GNU assembler. Of course I'd still need a bootstrap compiler, but that doesn't need to be fast; I can write it in Python.
Is assembler portable in the way that binaries are not? Are there any other stumbling blocks I haven't thought of?
Added in response to one answer:
I had looked wistfully at LLVM, there is certainly a lot of good stuff there and it would make my life easier -- except that it would incur a dependency on the correct version of LLVM being installed. It wouldn't be so bad having that dependency on development machines, but in a world where it's common to ship programs as source, the same dependency would be incurred for every user of every program ever written in Aklo, and I decided that was too high a price to pay.
But if the solution of shipping compiled programs as assembler works... then that solves that problem, and I can use LLVM after all, which would be a big win.
So the question about portability of assembler is even considerably more important than I had first realized.
Conclusion: from answers here and on the LLVM mailing list http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-January/028991.html it seems the bad news is the problem is unsolvable, but the good news is that means using LLVM makes it no worse, so I'm free to do so and obtain all the advantages thereof.
You might want to check out LLVM before going down this particular path. It might make your life a lot easier, as it provides a low level virtual machine that makes a lot of hard stuff just work and has been very popular.
At a very high level, the ABI consists of { instruction set, system calls, binary format, libraries }.
Distribution as .s may free you from the binary format. This is still rather pointless, because you are fixed to a particular ISA and still need to use libraries and/or make system calls. Libraries vary from distribution to distribution (although this isn't really that bad, especially if you just use libc) and syscalls vary from OS to OS.
It's basically 20 years since I last bootstrapped a C compiler. At the level of compilers, the differences between Linux distributions are minimal.
The much more important reason for going LLVM is cross-platform; if you're not writing some intermediate language, your compiler will be extremely difficult to retarget for different processors. And seeing as, on my laptop, I have compilers for x86, x86_64, two kinds of MIPS, PowerPC, ARM and AVR... you see where I'm going? I can compile multiple languages for most of those targets too (only C for AVR).