I executed a code on the linux gcc-4.3.2 and on the windows visual studio express 2010.
the execution time for Linux was around 54 seconds , while on the windows system it was around 1207 seconds.
Why is this so ?
The code uses the C++ stl map, set and vector.
The same code when executed on ideone took 9 seconds.
Are the stl implementations different ?
To measure time I used the following :
int main(){
clock_t tStart = clock();
return 0;
I know this method to measure time is not accurate , but 54 and 1207 are too far apart.

Probably because you're comparing debug builds.
Don't do that.
If you want to know how fast your code is, compile it with optimizations.
MSVC++ does a lot of additional asserts and debug checks in debug builds.


Are there compatibility issues with clang-cl and arch:avx2?

I'm using Windows 10, Visual Studio 2019, Platform: x64 and have the following test script in a single-file Visual Studio Solution:
#include <iostream>
#include <intrin.h>
using namespace std;
int main() {
unsigned __int64 mask = 0x0fffffffffffffff; //1152921504606846975;
unsigned long index;
_BitScanReverse64(&index, mask);
if (index != 59) {
cout << "Fails!" << endl;
else {
cout << "Success!" << endl;
In my property solution I've set the 'Enable Enhanced Instruction Set' to 'Advanced Vector Extenstions 2 (/arch:AVX2)'.
When compiling with msvc (setting 'Platform Toolset' to 'Visual Studio 2019 (v142)') the code returns EXIT_SUCCESS, but when compiling with clang-cl (setting 'Platform Toolset' to 'LLVM (clang-cl)') I get EXIT_FAILURE. When debugging the clang-cl run, the value of index is 4, when it should be 59. This suggests to me that clang-cl is reading the bits in the opposite direction of MSVC.
This isn't the case when I set 'Enable Enhanced Instruction Set' to 'Not Set'. In this scenario, both MSVC and clang-cl return EXIT_SUCCESS.
All of the dlls are loaded and shown in the Debug Output window come from C:\Windows\System32###.dll in all cases.
Does anyone understand this behavior? I would appreciate any insight here.
EDIT: I failed to mention earlier: I compiled this with IntelCore i7-3930K CPU #3.20GHz.
Getting 4 instead of 59 sounds like clang implemented _BitScanReverse64 as 63 - lzcnt. Actual bsr is slow on AMD, so yes there are reasons why a compiler would want to compiler a BSR intrinsic to a different instruction.
But then you ran the executable on a computer that doesn't actually support BMI so lzcnt decoded as rep bsr = bsr, giving the leading-zero count instead of the bit-index of the highest set bit.
AFAIK, all CPUs that have AVX2 also have BMI. If your CPU doesn't have that, you shouldn't expect your executables build with /arch:AVX2 to run correctly on your CPU. And in this case the failure mode wasn't an illegal instruction, it was lzcnt running as bsr.
MSVC doesn't generally optimize intrinsics, apparently including this case, so it just uses bsr directly.
Update: i7-3930K is SandyBridge-E. It doesn't have AVX2, so that explains your results.
clang-cl doesn't error when you tell it to build an AVX2 executable on a non-AVX2 computer. The use-case for that would be compiling on one machine to create an executable to run on different machines.
It also doesn't add CPUID-checking code to your executable for you. If you want that, write it yourself. This is C++, it doesn't hold your hand.
target CPU options
MSVC-style /arch options are much more limited than normal GCC/clang style. There aren't any for different levels of SSE like SSE4.1; it jumps straight to AVX.
Also, /arch:AVX2 apparently implies BMI1/2, even though those are different instruction-sets with different CPUID feature bits. In kernel code for example you might want integer BMI instructions but not SIMD instructions that touch XMM/YMM registers.
clang -O3 -mavx2 would not also enable -mbmi. You normally would want that, but if you failed to also enable BMI then clang would have been stuck using bsr. (Which is actually better for Intel CPUs than 63-lzcnt). I think MSVC's /arch:AVX2 is something like -march=haswell, if it also enables FMA instructions.
And nothing in MSVC has any support for making binaries optimized to run on the computer you build them on. That makes sense, it's designed for a closed-source binary-distribution model of software development.
But GCC and clang have -march=native to enable all the instruction sets your computer supports. And also importantly, set tuning options appropriate for your computer. e.g. don't worry about making code that would be slow on an AMD CPU, or on older Intel, just make asm that's good for your CPU.
TL:DR: CPU selection options in clang-cl are very coarse, lumping non-SIMD extensions in with some level of AVX. That's why /arch:AVX2 enabled integer BMI extension, while clang -mavx2 would not have.

Under Chisel 3, it takes 10 min to compile the Verilator generated C++ of Rocket Chip. Are there any ways to speed this up?

We are modifying Rocket Chip code. After each modification, we need to run the assembly programs, to be sure everything still runs correctly.
To do this, the steps are:
1) Run Chisel, to generate Verilog
2) Run the verilog through Verilator, to generate C++
3) Compile generated C++
4) Run tests
Step 3 is about 10 times longer than it was under Chisel 2. It takes about 10 minutes, which slows development.
Is there any way to speed this up?
I have found a non-trivial amount of build and run time is spent on not-really-synthesizable constructs that are used for verification support.
For example, I disable the TLMonitors through the Config options. You can find an example in the subsystem Configs.
class WithoutTLMonitors extends Config ((site, here, up) => {
case MonitorsEnabled => false

Numerical regression in x64 with the VS2013 compiler with latest Haswell chips?

We recently started seeing unit tests fail on our build machine (certain numerical calculations fell out of tolerance). Upon investigation we found that some of our developers could not reproduce the test failure. To cut a long story short, we eventually tracked the problem down to what appeared to be a rounding error, but that error was only occurring with x64 builds on the latest Haswell chips (to which our build server was recently upgraded). We narrowed it down and pulled out a single calculation from one of our tests:
#include "stdafx.h"
#include <cmath>
int _tmain(int argc, _TCHAR* argv[])
double rate = 0.0021627412080263146;
double T = 4.0246575342465754;
double res = exp(-rate * T);
printf("%0.20e\n", res);
return 0;
When we compile this x64 in VS2013 (with the default compiler switches, including /fp:precise), it gives different results on the older Sandy Bridge chip and the newer Haswell chip. The difference is in the 15th significant digit, which I think is outside the machine epsilon for double on both machines).
If we compile the same code in VS2010 or VS2012 (or, incidentally, VS2013 x86) we get the exact same answer on both chips.
In the past several years, we've gone through many versions of Visual Studio and many different Intel chips for testing, and no-one can recall us ever having to adjust our regression test expectations based on different rounding errors between chips.
This obviously led to a game of whack-a-mole between developers with the older and newer hardware as to what should be the expectation for the tests...
Is there a compiler option in VS2013 that we need to be using to somehow mitigate the discrepancy?
Results on Sandy Bridge developer PC:
VS2010-compiled-x64: 9.91333479983898980000e-001
VS2012-compiled-x64: 9.91333479983898980000e-001
VS2013-compiled-x64: 9.91333479983898980000e-001
Results on Haswell build server:
VS2010-compiled-x64: 9.91333479983898980000e-001
VS2012-compiled-x64: 9.91333479983898980000e-001
VS2013-compiled-x64: 9.91333479983899090000e-001
I used procexp to capture the list of DLLs loaded into the test program.
Sandy Bridge developer PC:
Haswell build server:
The results you documented are affected by the value of the MXCSR register, the two bits that select the rounding mode are important here. To get the "happy" number you like, you need to force the processor to round down. Like this:
#include "stdafx.h"
#include <cmath>
#include <float.h>
int _tmain(int argc, _TCHAR* argv[]) {
unsigned prev;
_controlfp_s(&prev, _RC_DOWN, _MCW_RC);
double rate = 0.0021627412080263146;
double T = 4.0246575342465754;
double res = exp(-rate * T);
printf("%0.20f\n", res);
return 0;
Output: 0.99133347998389898000
Change _RC_DOWN to _RC_NEAR to have MXCSR in normal rounding mode, the way the operating system programs it before it starts your program. Which produces 0.99133347998389909000. Or in other words, your Haswell machines are in fact producing the expected value.
Exactly how this happened can be very hard to diagnose, the control register is the worst possible global variable you can think of. The usual cause is an injected DLL that reprograms the FPU. A debugger can show the loaded DLLs, compare the lists between the two machines to find a candidate.
Due to a bug in the MS 2013 CRT x64 CRT code improperly detecting AVX and FMA3 support.
Fixed in an updated 2013 runtime, or using a newer MSVC version, or just disabling the feature support at runtime by calling" "_set_FMA3_enable(0);".

Detecting when in power-save mode

I'm trying to detect when the computer enters power-save mode. Problem is, this program has to run on both Windows XP and 7. RegisterPowerSettingNotification only works for Vista and newer, so that's not an option. I also tried using SystemParametersInfo with the SPI_GETSCREENSAVERRUNNING but that doesn't work for the power-save mode, which is what the computer is actually set for. Any other suggestions?
To answer my own question, grabbing the screensaver timeout and the last user input, and comparing the two appears to be the best way:
int screenTimeout;
lastInput.cbSize = sizeof(LASTINPUTINFO);
DWORD ticks = GetTickCount();
int lastInputTime = (ticks-lastInput.dwTime)/1000;
GetLastInputInfo returns the number of ticks since the last user input. According to MSDN, the ticks occur between 10 to 16 ms, so this isn't a precise way of measuring time, but it's good enough for my purposes.

what is the best way to test the performance of a program in linux

Suppose I write a program, then I make some "optimization" for the code.
In my case I want to test how much the new feature std::move in C++11 can improve the performance of a program.
I would like to check whether the "optimization" do make sense.
Currently I test it by the following steps:
write a program(without std::move) , compile ,get binary file m1
optimize it(using std::move), compile, get binary file m2
use command "time" to compare the time consuming:
time ./m1 ; time ./m2
In order to get the statistical result, it was needed to run the test thousands of times.
Is there any better ways to do that or is there some tools can help on it ?
In general measuring performance using a simple time comparison, e.g. endTime-beginTime is always a good start, for a rough estimation.
Later on you can use a profiler, like Valgrind to get measures of how different parts of your program is performing.
With profiling you can measure space (memory) or time complexity of a program, usage of particular instructions or frequency/duration of function calls.
There's also AMD CodeAnalyst if you want more advanced profiling functionality using a GUI. It's free/open source.
There are several tools (among others) that can do profiling for you:
GNU gprof
google gperftools
intel VTune amplifier (part of the intel XE compiler package)
kernel perf
AMD CodeXL (successor of AMD CodeAnalyst)-
Some of them require a specific way of compilation or a specific compiler. Some of them are specifically good at profiling for a given processor-architecture (AMD/Intel...).
Since you seem to have access to C++11 and if you just want to measure some timings you can use std::chrono.
#include <chrono>
#include <iostream>
class high_resolution_timer
typedef std::chrono::high_resolution_clock clock;
clock::time_point m_time_point;
high_resolution_timer (void)
: m_time_point(clock::now()) { }
void restart (void)
m_time_point = clock::now();
template<class Duration>
Duration stopover (void)
return std::chrono::duration_cast<Duration>
int main (void)
using std::chrono::microseconds;
high_resolution_timer timer;
// do stuff here
microseconds first_result = timer.stopover<microseconds>();
// do other stuff here
microseconds second_result = timer.stopover<microseconds>();
std::cout << "First took " << first_result.count() << " x 10^-6;";
std::cout << " second took " << second_result.count() << " x 10^-6.";
std::cout << std::endl;
But you should be aware that there's almost no sense in optimizing several milliseconds of overall runtime (if your program runtime will be >= 1s). You should instead time highly repetitive events in your code (if there are any, or at least those which are the bottlenecks). If those improve significantly (and this can be in terms of milli or microseconds) your overall performance will likely increase, too.
