Make MSVC optimize for a specific microarch

Make MSVC optimize for a specific microarch - visual-c++

With GCC/Clang/ICC/etc I can use
-march=skylake etc to generate code optimized for a specific microarchitecture, and
-march=native to generate code optimized for the local machine.
How do I do these with MSVC?

Microsoft's compiler splits this into two separate areas. One is generating code specific to a particular instruction set, which won't work on a CPU that doesn't support that instruction set. This falls under its -arch: flag. The x64 compiler only supports two variants here: AVX and AVX2 (or no flag, which supports up to SSE2). The x86 version of the compiler adds a couple more flags for older instruction set extensions (e.g., SSE), but I doubt you care about that any more.
The other category is generating code that will work on any of a number of architectures, but favors one over another. This is supported by the -favor switch, which supports targets of ATOM, AMD64, INTEL64, and "blend" (which basically means to try not to favor one at the expense of others).
It doesn't have any (documented) flags for something like favoring Skylake vs. (say) Haswell or Broadwell though.

Related

Difference between GCC binaries and MSVC binaries?

When I compile a large project (for example, Bitcoin) in both GCC (using MinGW) and in MSVC (using Visual Studio) using comparable optimization settings, the GCC binary is 6 mb and the MSVC binary is 4 mb.
Now I am wondering, does this say that MSVC produces better binaries (and I mean better as in smaller+faster)? Or doesnt this mean anything, and its just symbol-information or something unrelated to performance?
I expect a lot of comments: just benchmark it. But I'm more interested in the reason for the difference, not in the exact size/performance difference itself.

It is possible that with -o2 only, mingw may produce slower binaries than MSVC. I haven't tested and do not know. However I do know that with -march=native enabled, in my own benchmarks (http://plflib.org/colony.htm#benchmarks) mingw outperforms MSVC (with the appropriate target optimisations) by about 20%.
The main reason, I would imagine, is better customisation for individual CPU targets as opposed to MSVC's more scattershot approach. However it may be that GCC's code gen is simply better.
However, on other benchmarks MSVC might show a performance improvement. My own results are in no way definitive, but they are indicative.
Lastly I will note that yes, MSVC does produce smaller binaries in general - but watch what you #include. Including iostream in GCC/libstdc++ drags in a Ton of code, whereas in MSVC it drags in very little. And as others have said, smaller != faster necessarily.

According to this wxwidgets page on reducing executable size, Visual C++ is known to produce a smaller, faster executable, at least on Windows.
Use Microsoft Visual C++ instead of gcc (Cygwin or Mingw) on Windows.
It does produce smaller and faster executables.

Smaller is not necessarily faster. My latest compilations make extensive use of SIMD instructions that can have more than one set of instructions for one line of code, like some for AVX SIMD, some for SSE SIMD and some for SSE SISD. Then there can be significant loop unrolling (to maintain pipeline flow), with numerous repetitive instruction sequences.
Some might be following the same procedures as on Android via Eclipse, where a compile parameter, APP_ABI := all, generates code for arm64-v8a, armeabi, armeabi-v7a, mips, mips64, x86 and x86-64, selected automatically at run time.

Are extended instruction sets (SSE, MMX) used in Linux kernel?

Well, they bring (should bring at least) great increase in performance, isn’t it?
So, I haven’t seen any Linux kernel sources, but ‘d love to ask: are they used somehow? (In this case – there must be some special “code-cap” for system that has no such instructions?)

The SSE and MMX instruction sets have limited value outside of audio/video and gaming work. You might find a few explicit uses in dark corners of the kernel, but I wouldn't count on it. The answer in the general case is "no, they are not used", nor are they used in most non-kernel/userspace applications.
The kernel does sometimes optionally use certain x86 instructions that are specific to certain CPUs (e.g. present on some AMD or Intel models but not all, nor vice-versa), such as syscall, but these are different from the SIMD instruction sets you're referring to, and are not part of some wider set of similarly-themed extensions.
After Mark's answer, I went looking. The only place I could easily identify them being used is in the RAID 6 library (which also has support for AltiVec, which is the PowerPC SIMD instruction set).
(Be wary just grepping the tree, there are a lot of spots where the kernel "knows" about SSE/MMX to support user-space applications, but isn't actually using it. Also a couple cases of unfortunate variable names that have absolutely nothing to do with SSE, e.g. in the SCTP implementation.)

There are severe restrictions on using vector registers and floating point registers in kernel code. See e.g. chapter 6.3 of "Calling conventions for different C++ compilers and operating systems". http://www.agner.org/optimize/#manuals

They are used in the kernel for a few things, such as
Software RAID
Encryption (possibly)
However, I believe it always checks their presence first.

"cpu simd instructions use FPU"
erm, no, not as I understand it. They're in part a modern and (much) more efficient replacement for FPU instructions, but a large part of the SIMD instruction set deals with integer operations.
I've never looked very hard at it, but I suppose (ok, hope) that SIMD code generated by a recent gcc version will not clobber any registers or state.

AMD 6 Core and compiling open-source projects

I have been wanting to build my own box with AMD 6 core. I have always used Intel based machines and frankly have not done open-source projects. I want to get into that along with Systems programming but am worried if open-source projects (mainly Linux based) are going to be a problem to compile on AMD?
How difficult is porting (if it is needed) from AMD to Intel and vice-versa. Thanks.

Both AMD and Intel processors use the x86 ISA. You don't generally compile for a specific processor, you compile for the ISA.
Unless you turn on very specific flags (such as -march) while compiling, a binary built on one processor will run on another.
To say it again, there is no problem.
This does not mean the processors are the same. They have different performance characteristics, support different motherboard chipsets, and have different feature sets (for instance, IOMMUs or other advanced virtualization features). But you won't usually be accessing things like the processor-internal performance registers in your everyday life, so don't worry about it and get whichever CPU is right for your desired system configuration and price/performance point.

It's unlikely that it'll be any more complex than compiling anywhere else. I'm pretty sure these kinds of very minor architectural differences are almost a non-issue these days.

Does developing applications for SPARC, IBM power CPU require separate compilers, other than x86, x86-64 targets?

Does developing applications for SPARC, IBM PowerPC require separate compliers, other than x86 and x86-64 targets?
If true, how easily could x86, x64 binaries in Linux be ported to SPARC and PowerPC? Is there a way to simulate these environments using virtualization?

First answer is, yes, to develop compiled code for Power Architecture or SPARC you need compilers that will generate code for those processors. A compiler that generates x86 or x86_64 code will not generate code that runs on Power Architecture or SPARC. You might find cross compilers running on x86 (32 or 64) that will generate Power or SPARC code, though. But the other thing to be aware of is the object file format (elf, xcoff, and so on). Instruction set is just part of the picture. You might get clearer answers if your provide more details of your particular starting point and goals.
Second, one normally doesn't talk of porting binaries. We port source code, which may include assembly language as well as C or other languages. The process for doing this includes compiler selection, after which you can begin an iterative process of compiling, porting, compiling, and linking the code for the new hardware. I'm omitting many details. Again, if you provide more specifics in your question, you might get more specific answers.
Third, as others have said, no, you can't use virtualization in the scenarios you allude to. You might find acceptable emulation solutions. Again, please provide more specifics if you can.

No, virtualization is not the answer. Virtualization takes your hardware platform and creates an independent "virtual" machine of the same hardware. So when running on x86, you use virtualization to create a second x86 machine.
To simulate a completely different hardware architecture, you would want to look into emulation.
How easy / hard it is to port software from one architecture to another architecture depends completely on how the software was written. If it uses something particular to one architecture but not the other (for example, x86 can handle non-aligned memory accesses while SPARC does not) you are going to need to fix things like that. Another example that could make it difficult to port would be if the software has assumed a specific endian-ess of the hardware.

SPARC, IBM PowerPC require separate
compliers, other than x86 and x86-64
targets?
I hate to be really snippy, but given that IBM PowerPC and SPARC do not support the x86 or x86-64 command sets (i.e. talk totally separate machine langauge), where did you even get the idea they would be compatible?
Is there a way to simulate these
environments using virtualization?
Possibly yes, but it would be REALLY slow, because you would have to either translate the machine code, or - well - interpret it. Hardware virtualiaztion would not work, given that the CPU architectures are different. SPARC and PowerPC are not just "different labels for the same thing", they are really different internally.

Use Java or LLVM, or try QEMU to test other CPUs.
It's easy if your code was written to be portable, it's not if it wasn't. Varying sizes of data types per platform and code that depends on it, inline assembly, etc. will make it harder.
Home page for LLVM and QEMU:
http://llvm.org/
http://wiki.qemu.org/Main_Page

Xscale compilers for Linux? (also Xscale compile flags question)

I am currently using a GCC 3.3.3 based cross compiler to compile for a Xscale PXA270 development board. However, I was wondering if there are other Xscale compilers out there that run on Linux (or Windows for that matter)? The cross compiler setup I am using has horrendous performance on the target device, with certain programs that do a decent amount of math operations performing 10 to 20 times worse on the Xscale processor than on a similarly clocked Pentium 2. Any other options for compilers out there or specific compiler flags I should be setting with my GCC-based compiler that may help with the performance?
Thanks,
Ben

Unlike the Pentium 2, the XScale architecture doesn't have native floating point instructions. That means floating point math has to be emulated using integer instructions - a 10 to 20 times slowdown sounds about right.
To improve performance, you can try a few things:
Where possible, minimise the use of floating point - in some places, you may be able to subsitute plain integer or fixed point calculations;
Trade-off memory for speed, by precalculating tables of values where possible;
Use floats instead of doubles in calculations where you do not need the precision of the latter (including using the C99 float versions of math.h functions);
Minimise conversions between integers and floating point types.

Yes, you don't have an FPU so floating point needs to be done in integer math. However, there are two mechanisms for doing this, and one is 11 times faster than the other.
GCC target arm-linux-gnu normally includes real floating point instructions in the code for ARM's first FPU, the "FPA", now so rare it is nonexistent. These cause illegal instruction traps which are then caught and emulated in the kernel. This is extremely slow due to the context switch.
-msoft-float instead inserts calls to library functions (in libgcc.a). This avoids the switch into kernel space and is 11 times faster that the emulated FPA instructions.
You don't say what floating point model you are using - it may be that you are already building the whole userland with -msoft-float - but it might be worth checking that your object files contain no FPA instructions. You can check with:
objdump -d file | grep '<space><tab>f' | less
where file is any object file, executable or library that your compiler outputs. All FPA instructions start with f, while no other ARM instructions do. Those are actual space and tab characters there, and you might need to say <control-V><tab> to get the tab character past your shell.
If it is using FPA insns, you need to compile your entire userland using -msoft-float.
The most comprehensive further reading on these issues is http://wiki.debian.org/ArmEabiPort which is primarily concerned with a third alternative: using an arm-linux-gnueabi compiler, a newer alternative ABI that is available from gcc-4.1.1 onwards and which has different characteristics. See the document for further details.

"Other xscale compilers"
Open source: llvm and pcc, of which llvm is the most linux-friendly and functional, and also has a gcc front-end; pcc, a descendant of the venerable Portable C Compiler, seems more bsd-oriented.
Commercial: The Keil compiler (owned by ARM Ltd) seems to produce faster code than GCC, but is not going to impact your lack of an FPU significantly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string