The c++ code compiled by clang runs a lot faster than the same code compiled by MSVC. And I checked the ASM code, found out that clang automatically uses SIMD instructions for speed purposes. So I rewrite the main calculation part by using AVX Intrinsics code. Still, the program compiled by Clang gains a 10% benefit of speed.
Is it common sense that Clang performs better than MSVC on Windows? Or I missed some important optimization configurations of MSVC.
I've tested these code:
static __inline int RGBToY(unsigned char r, unsigned char g, unsigned char b) {
return (66 * r + 129 * g + 25 * b + 0x1080) >> 8;
void ToYRow_C(const unsigned char* src_argb0, unsigned char* dst_y, int width) {
int x;
for (x = 0; x < width; ++x) {
dst_y[0] = RGBToY(src_argb0[2], src_argb0[1], src_argb0[0]);
src_argb0 += 3;
dst_y += 1;
And the compiling flags for Clang: -O2 -mavx2, flags for MSVC: /O2 /arch:AVX2.
Processing a 2560x1440 image on a clang-compiled program costs 1.2ms, and 4.2ms for a MSVC-compiled program.


DLL Floating point results differ according to caller

This is a follow up question to my earlier one asked yesterday
The problems were occurring in a MSVS 2008 C++ DLL that has over 4000 lines of code, but I have managed to produce a simple case that demonstrates the problem as it occurs on my CPU (an AMD Phenom II X6 1050T).
Will it show the problem occurring on another system? I'd really like to know!
Here is a simple class (Point.cpp), it needs to be compiled as a DLL:
#include <math.h>
#define EXPORT extern "C" __declspec(dllexport)
namespace Test {
struct Point {
double x;
double y;
/* Constructor for a Point object */
Point(double xx, double yy) : x(xx), y(yy) {}
/* Copy constructor */
Point(const Point &rhs) : x(rhs.x), y(rhs.y) {}
double mag() const;
Point norm() const;
double Point::mag() const {return sqrt(x*x + y*y);}
Point Point::norm() const {
double m = mag();
return Point(x/m, y/m);
EXPORT void __stdcall GetNorm(double x, double y, double *nx, double *ny)
Point P = Point(x, y);
Point N = P.norm();
*nx = N.x;
*ny = N.y;
Here is the test program (TestPoint.c), which needs to be linked to the lib created for the DLL:
#include <stdio.h>
#define IMPORT extern __declspec(dllimport)
IMPORT void __stdcall GetNorm(double x, double y, double *nx, double *ny);
void dhex(double x) { // double to hex
union {
unsigned long n[2];
double d;
} value;
value.d = x;
printf("(0x%0x%0x)\n", value.n[1], value.n[0]);
double i64tod(unsigned long long n) { // hex to double
double *DP = (double *) &n;
return *DP;
int main(int argc, char **argv) {
double vx, vy;
double ux, uy;
vx = i64tod(0xbfc7a30f3a53d351);
vy = i64tod(0xc01b578b34e3ce1d);
GetNorm(vx, vy, &ux, &uy);
printf(" vx = %20.18f ", vx); dhex(vx);
printf(" vy = %20.18f ", vy); dhex(vy);
printf(" ux = %20.18f ", ux); dhex(ux);
printf(" uy = %20.18f ", uy); dhex(uy);
return 0;
On my system, with TestPoint compiled with VC++, the output is:
vx = -0.18466368053455054 (0xbfc7a30f3a53d351)
vy = -6.8354919685403077 (0xc01b578b34e3ce1d)
ux = -0.027005566159023012 (0xbf9ba758ddda1454,
uy = -0.99963528318903927 (0xbfeffd032227301b)
However, if the same code is compiled with gcc, or indeed, it seems, ANY equivalent program (eg VB6, PowerBasic), the results (ux and uy) are subtly but definitely different (the last hex digit):
vx = -0.184663680534550540 (0xbfc7a30f3a53d351)
vy = -6.835491968540307700 (0xc01b578b34e3ce1d)
ux = -0.027005566159023008 (0xbf9ba758ddda1453)
uy = -0.999635283189039160 (0xbfeffd032227301a)
This might seem an insignificant difference, but when it occurs in a physics engine, these differences accumulate in an alarming fashion. .
If the engine is going to get different results depending on who calls it I might have to abandon the use of VC++ altogether and try g++ instead.
Ok, I think I know how this happens. Looking at a disassembler listing of Point.dll, I noticed that the GetNorm function was pretty much what you'd expect, a couple of FMUL's and FDIV's. What was not present was an FLDCW instruction.
There weren't any FLDCW's in the MSVC calling program either, but I found FLDCW's in both the gcc and a PowerBasic versions of the calling program.
So I tweaked one of the executables (the PowerBasic EXE was the easiest to find the right place to tweak), and hey presto, I then got answers that matched MSVC. Presumably the FLDCW had changed the FPU rounding mode, hence the difference in the least significant bits.

Existence of "simd reduction(:)" In GCC and MSVC?

simd pragma can be used with icc compiler to perform a reduction operator:
#pragma simd
#pragma simd reduction(+:acc)
#pragma ivdep
for(int i( 0 ); i < N; ++i )
acc += x[i];
Is there any equivalent solution in msvc or/and gcc?
For Visual Studio 2012:
With options /O1 /O2/GL, to report vectorization use /Qvec-report:(1/2)
int s = 0;
for ( int i = 0; i < 1000; ++i )
s += A[i]; // vectorizable
In the case of reductions over "float" or "double" types, vectorization requires that the /fp:fast switch is thrown. This is because vectorizing the reduction operation depends upon "floating point reassociation". Reassociation is only allowed when /fp:fast is thrown
Ref(associated doc;p12)
GCC definitely can vectorize. Suppose you have file reduc.c with contents:
int foo(int *x, int N)
int acc, i;
for( i = 0; i < N; ++i )
acc += x[i];
return acc;
Compile it (I used gcc 4.7.2) with command line:
$ gcc -O3 -S reduc.c -ftree-vectorize -msse2
Now you can see vectorized loop in assembler.
Also you may switch on verbose vectorizer output say with
$ gcc -O3 -S reduc.c -ftree-vectorize -msse2 -ftree-vectorizer-verbose=1
Now you will get console report:
Analyzing loop at reduc.c:5
Vectorizing loop at reduc.c:5
reduc.c:1: note: vectorized 1 loops in function.
Look at the official docs to better understand cases where GCC can and cannot vectorize.
gcc requires -ffast-math to enable this optimization (as mentioned in the reference given above), regardless of use of #pragma omp simd reduction.
icc is becoming less reliant on pragma for this optimization (except that /fp:fast is needed in absence of pragma), but the extra ivdep and simd pragmas in the original post are undesirable. icc may do bad things when given a pragma simd which doesn't include all relevant reduction, firstprivate, and lastprivate clauses (and gcc may break with -ffast-math, particularly in combination with -march or -mavx).
msvc 2012/2013 are very limited in auto-vectorization. There are no simd reductions, no vectorization within OpenMP parallel regions, no vectorization of conditionals, and no advantage is taken of __restrict in vectorizations (there is some run-time check to vectorize less efficiently but safely without __restrict).

Conditional Compilation of CUDA Function

I created a CUDA function for calculating the sum of an image using its histogram.
I'm trying to compile the kernel and the wrapper function for multiple compute capabilities.
__global__ void calc_hist(unsigned char* pSrc, int* hist, int width, int height, int pitch)
int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
#if __CUDA_ARCH__ > 110 //Shared Memory For Devices Above Compute 1.1
__shared__ int shared_hist[256];
int global_tid = yIndex * pitch + xIndex;
int block_tid = threadIdx.y * blockDim.x + threadIdx.x;
if(xIndex>=width || yIndex>=height) return;
#if __CUDA_ARCH__ == 110 //Calculate Histogram In Global Memory For Compute 1.1
atomicAdd(&hist[pSrc[global_tid]],1); /*< Atomic Add In Global Memory */
#elif __CUDA_ARCH__ > 110 //Calculate Histogram In Shared Memory For Compute Above 1.1
shared_hist[block_tid] = 0; /*< Clear Shared Memory */
atomicAdd(&shared_hist[pSrc[global_tid]],1); /*< Atomic Add In Shared Memory */
if(shared_hist[block_tid] > 0) /* Only Write Non Zero Bins Into Global Memory */
return; //Do Nothing For Devices Of Compute Capabilty 1.0
Wrapper Function:
int sum_8u_c1(unsigned char* pSrc, double* sum, int width, int height, int pitch, cudaStream_t stream = NULL)
#if __CUDA_ARCH__ == 100
printf("Compute Capability Not Supported\n");
return 0;
int *hHist,*dHist;
cudaHostAlloc(&hHist,256 * sizeof(int),cudaHostAllocDefault);
cudaMemsetAsync(dHist,0,256 * sizeof(int),stream);
dim3 Block(16,16);
dim3 Grid;
Grid.x = (width + Block.x - 1)/Block.x;
Grid.y = (height + Block.y - 1)/Block.y;
cudaMemcpyAsync(hHist,dHist,256 * sizeof(int),cudaMemcpyDeviceToHost,stream);
(*sum) = 0.0;
for(int i=1; i<256; i++)
(*sum) += (hHist[i] * i);
printf("sum = %f\n",(*sum));
return 1;
Question 1:
When compiling for sm_10, the wrapper and the kernel shouldn't execute. But that is not what happens. The whole wrapper function executes. The output shows sum = 0.0.
I expected the output to be Compute Capability Not Supported as I have added the printf statement in the start of the wrapper function.
How can I prevent the wrapper function from executing on sm_10? I don't want to add any run-time checks like if statements etc. Can it be achieved through template meta programming?
Question 2:
When compiling for greater than sm_10, the program executes correctly only if I add cudaStreamSynchronize after the kernel call. But if I do not synchronize, the output is sum = 0.0. Why is it happening? I want the function to be asynchronous w.r.t the host as much as possible. Is it possible to shift the only loop inside the kernel?
I am using GTX460M, CUDA 5.0, Visual Studio 2008 on Windows 8.
Ad. Question 1
As already Robert explained in the comments - __CUDA_ARCH__ is defined only when compiling device code. To clarify: when you invoke nvcc, the code is parsed and compiled twice - once for CPU and once for GPU. The existence of __CUDA_ARCH__ can be used to check which of those two passes occurs, and then for the device code - as you do in the kernel - it can be checked which GPU are you targetting.
However, for the host side it is not all lost. While you don't have __CUDA_ARCH__, you can call API function cudaGetDeviceProperties which returns lots of information about your GPU. In particular, you can be interested in fields major and minor which indicate the Compute Capability. Note - this is done at run-time, not a preprocessing stage, so the same CPU code will work on all GPUs.
Ad. Question 2
Kernel calls and cudaMemoryAsync are asynchronous. It means that if you don't call cudaStreamSynchronize (or alike) the followup CPU code will continue running even if your GPU hasn't finished your work. This means, that the data you copy from dHist to hHist might not be there yet when you begin operating on hHist in the loop. If you want to work on the output from a kernel you have to wait till the kernel finishes.
Note that cudaMemcpy (without Async) has an implicit synchronization inside.

Generating a comprehensive callgraph using GCC & Egypt

I am trying to generate a comprehensive callgraph (complete with low level calls to Linux, runtime, the lot).
I have statically compiled my source files with "-fdump-rtl-expand" and created RTL files, which I passed to a PERL script called Egypt (which I believe is Graphviz/Dot) and generated a PDF file of the callgraph. This works perfectly, no problems at all.
Except, there are calls being made into some libraries that are getting shown as built-in. I was looking to see if there is a way for the callgraph not to be printed as and instead the real calls made into the libraries ?
Please let me know if the question is unclear.
Basically, I am trying to avoid the callgraph from generating < built-in >
Is there a way to do that ?
-------- CODE ---------
#include <cilk/cilk.h>
#include <stdio.h>
#include <stdlib.h>
unsigned long int t0, t5;
unsigned int NOSPAWN_THRESHOLD = 32;
int fib_nospawn(int n)
if (n < 2)
return n;
int x = fib_nospawn(n-1);
int y = fib_nospawn(n-2);
return x + y;
// spawning fibonacci function
int fib(long int n)
long int x, y;
if (n < 2)
return n;
else if (n <= NOSPAWN_THRESHOLD)
x = fib_nospawn(n-1);
y = fib_nospawn(n-2);
return x + y;
x = cilk_spawn fib(n-1);
y = cilk_spawn fib(n-2);
return x + y;
int main(int argc, char *argv[])
int n;
long int result;
long int exec_time;
n = atoi(argv[1]);
NOSPAWN_THRESHOLD = atoi(argv[2]);
result = fib(n);
printf("%ld\n", result);
return 0;
I compiled the Cilk Library from source.
I might have found the partial solution to the problem:
You need to pass the following option to egypt
This produced a slightly more comprehensive callgraph, although there still is the " visible
Can anyone suggest if I get more depth in the callgraph ?
You can use the GCC VCG Plugin: A gcc plugin, which can be loaded when debugging gcc, to show internal structures graphically.
gcc -fplugin=/path/to/ -fplugin-arg-vcg_plugin-cgraph foo.c
Call-graph is place to store data needed
for inter-procedural optimization. All datastructures
are divided into three components:
local_info that is produced while analyzing
the function, global_info that is result
of global walking of the call-graph on the end
of compilation and rtl_info used by RTL
back-end to propagate data from already compiled
functions to their callers.

How do I make a multi-threaded app use all the cores on Ubuntu under VMWare?

I have got a multi-threaded app that process a very large data file. Works great on Window 7, the code is all C++, uses the pthreads library for cross-platform multi-threading. When I run it under Windows on my Intel i3 - Task manager shows all four cores pegged to the limit, which is what I want. Compiled the same code using g++ Ubuntu/VMWare workstation - same number of threads are launched, but all threads are running on one core (as far as I can tell - Task Manager only shows one core busy).
I'm going to dive into the pThreads calls - perhaps I missed some default setting - but if anybody has any idea, I'd like to hear them, and I can give more info -
Update: I did setup VMWare to see all four cores and /proc/cpuinfo shows 4 cores
Update 2 - just wrote a simple app to show the problem - maybe it's VMWare only? - any Linux natives out there want to try and see if this actually loads down multiple cores? To run this on Windows you will need the pThread library - easily downloadable. And if anyone can suggest something more cpu intensive than printf- go ahead!
#ifdef _WIN32
#include "stdafx.h"
#include "stdio.h"
#include "stdlib.h"
#include "pthread.h"
void *Process(void *data)
long id = (long)data;
for (int i=0;i<100000;i++)
printf("Process %ld says Hello World\n",id);
return NULL;
#ifdef _WIN32
int _tmain(int argc, _TCHAR* argv[])
int main(int argc, char* argv[])
int numCores = 1;
if (argc>1)
numCores = strtol(&argv[1][2],NULL,10);
pthread_t *thread_ids = (pthread_t *)malloc(numCores*sizeof(pthread_t));
for (int i=0;i<numCores;i++)
pthread_create(&thread_ids[i],NULL,Process,(void *)i);
for (int i=0;i<numCores;i++)
return 0;
I changed your code a bit. I changed numCores = strtol(&argv[1][2], NULL, 10); to numCores = strtol(&argv[1][0], NULL, 10); to make it work under Linux by calling ./core 4 maybe you where passing something in front of the number of cores, or because type _TCHAR is 3byte per char? Not that familiar with windows.. Further more since I wasn't able to stress the CPU with only printf I also changed Process a bit.
void *Process(void *data)
long hdata = (long)data;
long id = (long)data;
for (int i=0;i<10000000;i++)
printf("Process %ld says Hello World\n",id);
for (int j = 0; j < 100000; j++)
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
return (void*)hdata;
And now when I run gcc -O2 -lpthread -std=gnu99 core.c -o core && ./core 4 You can see that all 4 threads are running on 4 different cores well probably the are swapped from core to core at a time but all 4 cores are working overtime.
core.c: In function ‘main’:
core.c:75:50: warning: cast to pointer from integer of different size [-Wint-to-pointer- cast]
Starting 4 threads
Process 0 says Hello World
Process 1 says Hello World
Process 3 says Hello World
Process 2 says Hello World
I verified it with htop hope it helps.. :) I run dedicated Debian SID x86_64 with 4cores in case you're wondering.
