We recently started seeing unit tests fail on our build machine (certain numerical calculations fell out of tolerance). Upon investigation we found that some of our developers could not reproduce the test failure. To cut a long story short, we eventually tracked the problem down to what appeared to be a rounding error, but that error was only occurring with x64 builds on the latest Haswell chips (to which our build server was recently upgraded). We narrowed it down and pulled out a single calculation from one of our tests:
#include "stdafx.h"
#include <cmath>
int _tmain(int argc, _TCHAR* argv[])
{
double rate = 0.0021627412080263146;
double T = 4.0246575342465754;
double res = exp(-rate * T);
printf("%0.20e\n", res);
return 0;
}
When we compile this x64 in VS2013 (with the default compiler switches, including /fp:precise), it gives different results on the older Sandy Bridge chip and the newer Haswell chip. The difference is in the 15th significant digit, which I think is outside the machine epsilon for double on both machines).
If we compile the same code in VS2010 or VS2012 (or, incidentally, VS2013 x86) we get the exact same answer on both chips.
In the past several years, we've gone through many versions of Visual Studio and many different Intel chips for testing, and no-one can recall us ever having to adjust our regression test expectations based on different rounding errors between chips.
This obviously led to a game of whack-a-mole between developers with the older and newer hardware as to what should be the expectation for the tests...
Is there a compiler option in VS2013 that we need to be using to somehow mitigate the discrepancy?
Update:
Results on Sandy Bridge developer PC:
VS2010-compiled-x64: 9.91333479983898980000e-001
VS2012-compiled-x64: 9.91333479983898980000e-001
VS2013-compiled-x64: 9.91333479983898980000e-001
Results on Haswell build server:
VS2010-compiled-x64: 9.91333479983898980000e-001
VS2012-compiled-x64: 9.91333479983898980000e-001
VS2013-compiled-x64: 9.91333479983899090000e-001
Update:
I used procexp to capture the list of DLLs loaded into the test program.
Sandy Bridge developer PC:
apisetschema.dll
ConsoleApplication8.exe
kernel32.dll
KernelBase.dll
locale.nls
msvcr120.dll
ntdll.dll
Haswell build server:
ConsoleApplication8.exe
kernel32.dll
KernelBase.dll
locale.nls
msvcr120.dll
ntdll.dll
The results you documented are affected by the value of the MXCSR register, the two bits that select the rounding mode are important here. To get the "happy" number you like, you need to force the processor to round down. Like this:
#include "stdafx.h"
#include <cmath>
#include <float.h>
int _tmain(int argc, _TCHAR* argv[]) {
unsigned prev;
_controlfp_s(&prev, _RC_DOWN, _MCW_RC);
double rate = 0.0021627412080263146;
double T = 4.0246575342465754;
double res = exp(-rate * T);
printf("%0.20f\n", res);
return 0;
}
Output: 0.99133347998389898000
Change _RC_DOWN to _RC_NEAR to have MXCSR in normal rounding mode, the way the operating system programs it before it starts your program. Which produces 0.99133347998389909000. Or in other words, your Haswell machines are in fact producing the expected value.
Exactly how this happened can be very hard to diagnose, the control register is the worst possible global variable you can think of. The usual cause is an injected DLL that reprograms the FPU. A debugger can show the loaded DLLs, compare the lists between the two machines to find a candidate.
Due to a bug in the MS 2013 CRT x64 CRT code improperly detecting AVX and FMA3 support.
Fixed in an updated 2013 runtime, or using a newer MSVC version, or just disabling the feature support at runtime by calling" "_set_FMA3_enable(0);".
See:
https://support.microsoft.com/en-us/help/3174417/fix-programs-that-are-built-in-visual-c-2013-crash-with-illegal-instruction-exception
Related
I'm using Windows 10, Visual Studio 2019, Platform: x64 and have the following test script in a single-file Visual Studio Solution:
#include <iostream>
#include <intrin.h>
using namespace std;
int main() {
unsigned __int64 mask = 0x0fffffffffffffff; //1152921504606846975;
unsigned long index;
_BitScanReverse64(&index, mask);
if (index != 59) {
cout << "Fails!" << endl;
return EXIT_FAILURE;
}
else {
cout << "Success!" << endl;
return EXIT_SUCCESS;
}
}
In my property solution I've set the 'Enable Enhanced Instruction Set' to 'Advanced Vector Extenstions 2 (/arch:AVX2)'.
When compiling with msvc (setting 'Platform Toolset' to 'Visual Studio 2019 (v142)') the code returns EXIT_SUCCESS, but when compiling with clang-cl (setting 'Platform Toolset' to 'LLVM (clang-cl)') I get EXIT_FAILURE. When debugging the clang-cl run, the value of index is 4, when it should be 59. This suggests to me that clang-cl is reading the bits in the opposite direction of MSVC.
This isn't the case when I set 'Enable Enhanced Instruction Set' to 'Not Set'. In this scenario, both MSVC and clang-cl return EXIT_SUCCESS.
All of the dlls are loaded and shown in the Debug Output window come from C:\Windows\System32###.dll in all cases.
Does anyone understand this behavior? I would appreciate any insight here.
EDIT: I failed to mention earlier: I compiled this with IntelCore i7-3930K CPU #3.20GHz.
Getting 4 instead of 59 sounds like clang implemented _BitScanReverse64 as 63 - lzcnt. Actual bsr is slow on AMD, so yes there are reasons why a compiler would want to compiler a BSR intrinsic to a different instruction.
But then you ran the executable on a computer that doesn't actually support BMI so lzcnt decoded as rep bsr = bsr, giving the leading-zero count instead of the bit-index of the highest set bit.
AFAIK, all CPUs that have AVX2 also have BMI. If your CPU doesn't have that, you shouldn't expect your executables build with /arch:AVX2 to run correctly on your CPU. And in this case the failure mode wasn't an illegal instruction, it was lzcnt running as bsr.
MSVC doesn't generally optimize intrinsics, apparently including this case, so it just uses bsr directly.
Update: i7-3930K is SandyBridge-E. It doesn't have AVX2, so that explains your results.
clang-cl doesn't error when you tell it to build an AVX2 executable on a non-AVX2 computer. The use-case for that would be compiling on one machine to create an executable to run on different machines.
It also doesn't add CPUID-checking code to your executable for you. If you want that, write it yourself. This is C++, it doesn't hold your hand.
target CPU options
MSVC-style /arch options are much more limited than normal GCC/clang style. There aren't any for different levels of SSE like SSE4.1; it jumps straight to AVX.
Also, /arch:AVX2 apparently implies BMI1/2, even though those are different instruction-sets with different CPUID feature bits. In kernel code for example you might want integer BMI instructions but not SIMD instructions that touch XMM/YMM registers.
clang -O3 -mavx2 would not also enable -mbmi. You normally would want that, but if you failed to also enable BMI then clang would have been stuck using bsr. (Which is actually better for Intel CPUs than 63-lzcnt). I think MSVC's /arch:AVX2 is something like -march=haswell, if it also enables FMA instructions.
And nothing in MSVC has any support for making binaries optimized to run on the computer you build them on. That makes sense, it's designed for a closed-source binary-distribution model of software development.
But GCC and clang have -march=native to enable all the instruction sets your computer supports. And also importantly, set tuning options appropriate for your computer. e.g. don't worry about making code that would be slow on an AMD CPU, or on older Intel, just make asm that's good for your CPU.
TL:DR: CPU selection options in clang-cl are very coarse, lumping non-SIMD extensions in with some level of AVX. That's why /arch:AVX2 enabled integer BMI extension, while clang -mavx2 would not have.
I executed a code on the linux gcc-4.3.2 and on the windows visual studio express 2010.
the execution time for Linux was around 54 seconds , while on the windows system it was around 1207 seconds.
Why is this so ?
The code uses the C++ stl map, set and vector.
The same code when executed on ideone took 9 seconds.
http://ideone.com/MxGogf
Are the stl implementations different ?
To measure time I used the following :
int main(){
clock_t tStart = clock();
.
.
printf("\n%.4f\n",float(clock()-tStart)/CLOCKS_PER_SEC);
return 0;
}
I know this method to measure time is not accurate , but 54 and 1207 are too far apart.
Probably because you're comparing debug builds.
Don't do that.
If you want to know how fast your code is, compile it with optimizations.
MSVC++ does a lot of additional asserts and debug checks in debug builds.
I wrote a very simple program in Linux using c++, which downloads images from some website over http (basically developed a http client request), using cURL libraries. http://curl.haxx.se/libcurl/c/allfuncs.html
#define CURL_STATICLIB
#include <stdio.h>
#include <stdlib.h>
#include </usr/include/curl/curl.h>
#include </usr/include/curl/stdcheaders.h>
#include </usr/include/curl/easy.h>
size_t write_data(void *ptr, size_t size, size_t nmemb, FILE *stream) {
size_t written = fwrite(ptr, size, nmemb, stream);
return written;
}
int main(void) {
CURL *curl;
FILE *fp;
CURLcode res;
char *url = "http://www.example.com/test_img.png";
char outfilename[FILENAME_MAX] = "/home/c++_proj/output/web_req_img.png";
curl = curl_easy_init();
if (curl) {
fp = fopen(outfilename,"wb");
curl_easy_setopt(curl, CURLOPT_URL, url);
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_data);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, fp);
res = curl_easy_perform(curl);
/* always cleanup */
curl_easy_cleanup(curl);
fclose(fp);
}
return 0;
}
I verified the code, and it works fine. I can see the image is downloaded and that I can view the image (with no errors or warnings). Since I plan on expanding my code, I tried to install ddd, and use the debugger, but the debugger doesn't work, and my program exits with some sort of Signal errors, when I try to run my program with ddd.
This is the error:
(Threadd debugging using libthread_db enabled)
Using host libthread_db library "/lib/arm-linux-gnueadihf/libthread_db.so.1"
Program received signal SIGILL, illegal instruction.
0xb6a5c4C0 in ?? () from /usr/lib/arm-linux-gnueadbihf/libcrypto.so.1.0.0
First I thought that I didn't properly install ddd, so I went back to gdb, but I get the exact same errors, when I run the program. (And I believe that I am using the latest version of gdb and ddd)
Then I tried to use ddd on another simple program, that doesn't involve cURL library, and it worked fine !!!
Does anyone know why this is the case, and what is the solution? Do I somehow need to point to cURL libraries while ddd is running? But, in the past, I don't recall doing this with different set of libraries! Maybe it is something abuot the cURL that ddd doesn't like? But the program runs fine itself without the debugger! I would appreciate some help.
I am guessing it may be part of some instruction set detection code. Just let the program continue and see if it handles the signal by itself (since it runs outside of gdb, it probably does). Alternatively, you can tell gdb to not bother you with SIGILL at all before you run the program: handle SIGILL pass nostop noprint.
It's only a problem if the program dies, which was not clear from your question.
Program received signal SIGILL, illegal instruction.
0xb6a5c4C0 in ?? () from /usr/lib/arm-linux-gnueadbihf/libcrypto.so.1.0.0
Does anyone know why this is the case, and what is the solution?
Jester gave you the solution. Here's the reason why it happens.
libcrypto.so is OpenSSL's crypto library. OpenSSL performs cpu feature probes by executing an instruction to see if its available. If a SIGILL is generated, then the feature is not available and an appropriate function is used instead.
The reason you see them on ARM and not IA-32 is, on Intel's IA-32 the cpuid instruction is non-privileged. Any program can execute cpuid to detect cpu features so there's no need for SIGILL-based feature program.
In contrast to IA-32, ARM's equivalent of cpuid is a privileged instruction. Your program needs Exception Level 1 (EL-1), but your program runs at EL-0. To side step the need for privileges on ARM programs setup a jmpbuf and install a SIGILL handler. They then try the instruction in question and the SIGILL handler indicates if the instruction or feature is available or not.
OpenSSL recently changed to SIGILL-free feature detection on some Apple platforms because Apple corrupts things. Also see PR 3108, SIGILL-free processor capabilities detection on MacOS X. Other libraries are doing similar. Also see How to determine ARMv8 features at runtime?
OpenSSL also documents the SIGILL behavior in their FAQ. See item 17 in the OpenSSL FAQ for more details: When debugging I observe SIGILL during OpenSSL initialization: why? Also see SSL_library_init cause SIGILL when running under gdb on Stack Overflow.
For Android developers you can disable SIGILL in Android Studio:
https://developer.oculus.com/documentation/native/android/mobile-studio-debug/#troubleshooting
For Intel+NVIDIA dual-GPU "Optimus" setups, an application can export NvOptimusEnablement as explained in OptimusRenderingPolicies.pdf. This option allows an application to ensure the use of the high-speed discrete GPU without needing profile updates or user interaction, which is usually desired for certain classes of applictions.
Is there an equivalent trick for systems with AMD GPUs (Windows-only is fine), and if so, what is it? I have not been able to find any concrete information via Googling; only a lot of people asking the same question on various forums with no answers, or SO articles on the NVIDIA trick with a "maybe AMD has something similar, I don't know" comment.
According to https://community.amd.com/thread/169965
extern "C"
{
__declspec(dllexport) int AmdPowerXpressRequestHighPerformance = 1;
}
This will select the high performance GPU as long as no profile exists that assigns the application to another GPU.
Please make sure to use a 13.35 or newer driver. Older drivers do not support this.
This code will be ignored when you compile on non-windows machines:
#ifdef _WIN32
#include <windows.h>
extern "C" __declspec(dllexport) DWORD NvOptimusEnablement = 0x00000001;
extern "C" __declspec(dllexport) DWORD AmdPowerXpressRequestHighPerformance = 0x00000001;
#endif
I recently found an interesting issue. When using SetEnvironmentVariable, I can use Process Explorer to get the newly created environment variable. However when the process itself is 32 bit and the OS as 64 bit, Process Explorer (at least v10 ~ the latest v11.33) cannot find the new variables. If the program is native 64 bit then everything works fine, just as well as 32 bit process running on 32 bit OS.
The SetEnvironmentVariable API call should be successful, because the return value is TRUE and calling GetEnvironmentVariable returns correct value. Also if you create a child process, you can find the variable was correctly set in the new process by using Process Explorer.
I'm not if this is the limitation of SysWOW64 or a bug in Process Explorer. Anyone knows?
And, is there any way to get the 32 bit environment variables correctly? (for example, force Process Explorer to run in 32 bit mode, or some other tools)
Sample source to reproduce:
#include <stdio.h>
#include <windows.h>
int main(int argc, char *argv[])
{
printf("setting variable... %s\n",
SetEnvironmentVariable("a_new_var", "1.0") ? "OK" : "FAILED");
printf("press anykey to continue...\n");
getchar();
// system(argv[0]); // uncomment to inspect the child process
return 0;
}
I'm not sure how WOW64 works, but I'm pretty (99%) sure there are two PEBs (Process Environment Blocks) created - a 32-bit one and a 64-bit one. The process parameter structures (RTL_USER_PROCESS_PARAMETERS) are probably duplicated as well. So when you call SetEnvironmentVariable it is only modifying the 32-bit environment block. PE would be running as a native 64-bit program, which would mean it only knows about the 64-bit PEB and the 64-bit environment block (which hasn't changed).
Update (2010-07-10):
Just some new info on this old topic: You can find the 32-bit PEB by calling NtQueryInformationProcess with ProcessWow64Information. It gives you a PVOID with the address of the PEB.