MSVC and ICC both support the intrinsics _addcarry_u64 and _addcarryx_u64.
According to Intel's Intrinsic Guide and white paper these should map to adcx and adox respectively. However, by looking at the generated assembly it's clear they map to adc and adcx respectively and there is no intrinsic which maps to adox.
Additionally, telling the compiler to enable AVX2 with /arch:AVX2 in MSVC or -march=core-avx2 with ICC on Linux makes no difference.
I'm not sure how to enable ADX with MSVC and ICC.
The documentation for MSVC lists _addcarryx_u64 with the technology of ADX whereas _addcarry_u64 has no listed technology. However, the link in MSVC's documentation for these intrinsics goes directly to the Intel Intrinsic guide which contradicts MSVC's own documentation and the generated assembly.
From this I conclude that Intel's Intrinsic guide and white paper are wrong.
This makes some sense for MSVC sense it does not allow inline assembly it should provide a way to use adc which it does with _addcarry_u64.
One of the big advantages of adcx and adox is that they operate on different flags (carry CF and overflow OF) and this allows two independent parallel carry chains. However, since there is no intrinsic for adox how is this possible? With ICC at least one can use inline assembly but this is not possible with MSVC in 64-bit mode.
Microsoft and Intel's documentation (both the white paper and the intrinsic guide online) both agree now.
The _addcarry_u64 intrinsic documentation says produces only adc. The _addcarryx_u64 intrinsic can produce either adcx or adox. With MSVC 2013 and 2015, however, _addcarryx_u64 only produces adcx. ICC produces both.
They map to adc, adcx AND adox. The compiler decides which instructions to use, based on how you use them. If you perform two big-int additions in parallel the compiler will use adcx and adox, for higher throughput. For example:
unsigned char c1 = 0, c2 = 0
for(i=0; i< 100; i++){
c1 = _addcarry_u64(c1, res[i], a[i], &res[i]);
c2 = _addcarry_u64(c2, res[i], b[i], &res[i]);
}
Related, GCC does not support ADOX and ADCX at the moment. "At the moment" includes GCC 6.4 (Fedora 25) and GCC 7.1 (Fedora 26). GCC effectively disabled the intrinsics, but it still advertises support by defining __ADX__ in the preprocessor. Also see Issue 67317, Silly code generation for _addcarry_u32/_addcarry_u64. Many thanks to Xi Ruoyao for finding the issue.
According to Uros Bizjak on the GCC Help mailing list, GCC may never support the intrinsics. Also see GCC does not generate ADCX or ADOX for _addcarryx_u64.
Clang has its own set of issues with respect to ADOX and ADCX. Clang 3.9 and 4.0 crash when attempting to use them. Also see Issue 34249, Panic when using _addcarryx_u64 with Clang 3.9. According to Craig Topper, it should be fixed in Clang 5.0.
My apologies for posting the information under a MSVC question. This is one of the few hits when searching for information about using the intrinsics.
Related
Do we have a compiler for RISC-V vector instructions now? I have searched online and it seems we still don't have one.
It seems like on various RISC-V cores there are some jobs done already. Like PULP from ETH Zurich, Università di Bologna, they design SIMD-like extensions and also have corresponding GCC with modifications.
There is some preliminary work done on LLVM: https://github.com/rkruppe/rvv-llvm and there are multiple custom extensions that do similar things but are not following the (not yet frozen) standard. Most notably the RI5CY core from the PULP project has been used not only in academia but also commercial ASICs like on the gapuino (GAP8) and VEGABoard (RV32M1) that can be used with a GCC port.
Also, see some pointers regarding upstream and SiFive GCC support for the V extension here.
With GCC/Clang/ICC/etc I can use
-march=skylake etc to generate code optimized for a specific microarchitecture, and
-march=native to generate code optimized for the local machine.
How do I do these with MSVC?
Microsoft's compiler splits this into two separate areas. One is generating code specific to a particular instruction set, which won't work on a CPU that doesn't support that instruction set. This falls under its -arch: flag. The x64 compiler only supports two variants here: AVX and AVX2 (or no flag, which supports up to SSE2). The x86 version of the compiler adds a couple more flags for older instruction set extensions (e.g., SSE), but I doubt you care about that any more.
The other category is generating code that will work on any of a number of architectures, but favors one over another. This is supported by the -favor switch, which supports targets of ATOM, AMD64, INTEL64, and "blend" (which basically means to try not to favor one at the expense of others).
It doesn't have any (documented) flags for something like favoring Skylake vs. (say) Haswell or Broadwell though.
Summary:
I am having troubles with one library dynamically loading another another and I'm wondering if difference in the compilers is the root cause.
Problem Details:
My application links into libgbm.so which dynamically loads libpvrGBMWSEGL.so and then requests the gbm_backend function.
#libgbm.so
module = dlopen("/usr/lib/libpvrGBMWSEGL.so", RTLD_NOW | RTLD_GLOBAL)
dlsym(module, entrypoint)
When I try to use the symbol provided, it throws a segmentation fault.
Analysis:
libpvrGBMWSEGL.so is provided as a proprietary binary blob. A quick analysis shows that it was build with Linaro GCC 5.3-2016.02
> strings libpvrGBMWSEGL.so | grep GCC
GCC: (Linaro GCC 5.3-2016.02) 5.3.1 20160113
Meanwhile the library libgbm which dynamically calls it was build with Buildroot GCC 6.4.0
> strings libgbm.so | grep GCC
GCC: (Buildroot 2017.11-git-00884-g7af8140-dirty) 6.4.0
Question:
Should I expect these two library to be compatible in the manner in which I am using them?
For many platforms, there is a published ABI document to which compilers are expected to adhere. For C++ and on top of those platform ABIs, there is the Itanium C++ ABI (which has nothing to do with Itanium anymore and will be Itanium's lasting contribution to computing, I assume).
This does not extend to libraries, though. There are many libcs for Linux, and something compiled and linked against glibc will not run on Bionic libc (Android) and vice versa, even if the architectures match. Essentially the same thing is true for the C++ standard library (and even the implementation that comes with GCC comes with slightly different ABIs as option).
With ARM, there is also a considerable amount of sub-architecture variation.
The summary is: When everyone makes an effort, then what you are trying to do will work. If not, probably not. Getting this right for C++ is more difficult than for C.
When I compile a large project (for example, Bitcoin) in both GCC (using MinGW) and in MSVC (using Visual Studio) using comparable optimization settings, the GCC binary is 6 mb and the MSVC binary is 4 mb.
Now I am wondering, does this say that MSVC produces better binaries (and I mean better as in smaller+faster)? Or doesnt this mean anything, and its just symbol-information or something unrelated to performance?
I expect a lot of comments: just benchmark it. But I'm more interested in the reason for the difference, not in the exact size/performance difference itself.
It is possible that with -o2 only, mingw may produce slower binaries than MSVC. I haven't tested and do not know. However I do know that with -march=native enabled, in my own benchmarks (http://plflib.org/colony.htm#benchmarks) mingw outperforms MSVC (with the appropriate target optimisations) by about 20%.
The main reason, I would imagine, is better customisation for individual CPU targets as opposed to MSVC's more scattershot approach. However it may be that GCC's code gen is simply better.
However, on other benchmarks MSVC might show a performance improvement. My own results are in no way definitive, but they are indicative.
Lastly I will note that yes, MSVC does produce smaller binaries in general - but watch what you #include. Including iostream in GCC/libstdc++ drags in a Ton of code, whereas in MSVC it drags in very little. And as others have said, smaller != faster necessarily.
According to this wxwidgets page on reducing executable size, Visual C++ is known to produce a smaller, faster executable, at least on Windows.
Use Microsoft Visual C++ instead of gcc (Cygwin or Mingw) on Windows.
It does produce smaller and faster executables.
Smaller is not necessarily faster. My latest compilations make extensive use of SIMD instructions that can have more than one set of instructions for one line of code, like some for AVX SIMD, some for SSE SIMD and some for SSE SISD. Then there can be significant loop unrolling (to maintain pipeline flow), with numerous repetitive instruction sequences.
Some might be following the same procedures as on Android via Eclipse, where a compile parameter, APP_ABI := all, generates code for arm64-v8a, armeabi, armeabi-v7a, mips, mips64, x86 and x86-64, selected automatically at run time.
I am currently using a GCC 3.3.3 based cross compiler to compile for a Xscale PXA270 development board. However, I was wondering if there are other Xscale compilers out there that run on Linux (or Windows for that matter)? The cross compiler setup I am using has horrendous performance on the target device, with certain programs that do a decent amount of math operations performing 10 to 20 times worse on the Xscale processor than on a similarly clocked Pentium 2. Any other options for compilers out there or specific compiler flags I should be setting with my GCC-based compiler that may help with the performance?
Thanks,
Ben
Unlike the Pentium 2, the XScale architecture doesn't have native floating point instructions. That means floating point math has to be emulated using integer instructions - a 10 to 20 times slowdown sounds about right.
To improve performance, you can try a few things:
Where possible, minimise the use of floating point - in some places, you may be able to subsitute plain integer or fixed point calculations;
Trade-off memory for speed, by precalculating tables of values where possible;
Use floats instead of doubles in calculations where you do not need the precision of the latter (including using the C99 float versions of math.h functions);
Minimise conversions between integers and floating point types.
Yes, you don't have an FPU so floating point needs to be done in integer math. However, there are two mechanisms for doing this, and one is 11 times faster than the other.
GCC target arm-linux-gnu normally includes real floating point instructions in the code for ARM's first FPU, the "FPA", now so rare it is nonexistent. These cause illegal instruction traps which are then caught and emulated in the kernel. This is extremely slow due to the context switch.
-msoft-float instead inserts calls to library functions (in libgcc.a). This avoids the switch into kernel space and is 11 times faster that the emulated FPA instructions.
You don't say what floating point model you are using - it may be that you are already building the whole userland with -msoft-float - but it might be worth checking that your object files contain no FPA instructions. You can check with:
objdump -d file | grep '<space><tab>f' | less
where file is any object file, executable or library that your compiler outputs. All FPA instructions start with f, while no other ARM instructions do. Those are actual space and tab characters there, and you might need to say <control-V><tab> to get the tab character past your shell.
If it is using FPA insns, you need to compile your entire userland using -msoft-float.
The most comprehensive further reading on these issues is http://wiki.debian.org/ArmEabiPort which is primarily concerned with a third alternative: using an arm-linux-gnueabi compiler, a newer alternative ABI that is available from gcc-4.1.1 onwards and which has different characteristics. See the document for further details.
"Other xscale compilers"
Open source: llvm and pcc, of which llvm is the most linux-friendly and functional, and also has a gcc front-end; pcc, a descendant of the venerable Portable C Compiler, seems more bsd-oriented.
Commercial: The Keil compiler (owned by ARM Ltd) seems to produce faster code than GCC, but is not going to impact your lack of an FPU significantly.