Existence of "simd reduction(:)" In GCC and MSVC? - visual-c++

simd pragma can be used with icc compiler to perform a reduction operator:
#pragma simd
#pragma simd reduction(+:acc)
#pragma ivdep
for(int i( 0 ); i < N; ++i )
{
acc += x[i];
}
Is there any equivalent solution in msvc or/and gcc?
Ref(p28): http://d3f8ykwhia686p.cloudfront.net/1live/intel/CompilerAutovectorizationGuide.pdf

For Visual Studio 2012:
With options /O1 /O2/GL, to report vectorization use /Qvec-report:(1/2)
int s = 0;
for ( int i = 0; i < 1000; ++i )
{
s += A[i]; // vectorizable
}
In the case of reductions over "float" or "double" types, vectorization requires that the /fp:fast switch is thrown. This is because vectorizing the reduction operation depends upon "floating point reassociation". Reassociation is only allowed when /fp:fast is thrown
Ref(associated doc;p12) http://blogs.msdn.com/b/nativeconcurrency/archive/2012/07/10/auto-vectorizer-in-visual-studio-11-cookbook.aspx

GCC definitely can vectorize. Suppose you have file reduc.c with contents:
int foo(int *x, int N)
{
int acc, i;
for( i = 0; i < N; ++i )
{
acc += x[i];
}
return acc;
}
Compile it (I used gcc 4.7.2) with command line:
$ gcc -O3 -S reduc.c -ftree-vectorize -msse2
Now you can see vectorized loop in assembler.
Also you may switch on verbose vectorizer output say with
$ gcc -O3 -S reduc.c -ftree-vectorize -msse2 -ftree-vectorizer-verbose=1
Now you will get console report:
Analyzing loop at reduc.c:5
Vectorizing loop at reduc.c:5
5: LOOP VECTORIZED.
reduc.c:1: note: vectorized 1 loops in function.
Look at the official docs to better understand cases where GCC can and cannot vectorize.

gcc requires -ffast-math to enable this optimization (as mentioned in the reference given above), regardless of use of #pragma omp simd reduction.
icc is becoming less reliant on pragma for this optimization (except that /fp:fast is needed in absence of pragma), but the extra ivdep and simd pragmas in the original post are undesirable. icc may do bad things when given a pragma simd which doesn't include all relevant reduction, firstprivate, and lastprivate clauses (and gcc may break with -ffast-math, particularly in combination with -march or -mavx).
msvc 2012/2013 are very limited in auto-vectorization. There are no simd reductions, no vectorization within OpenMP parallel regions, no vectorization of conditionals, and no advantage is taken of __restrict in vectorizations (there is some run-time check to vectorize less efficiently but safely without __restrict).

Related

Clang performs better than MSVC on Windows

The c++ code compiled by clang runs a lot faster than the same code compiled by MSVC. And I checked the ASM code, found out that clang automatically uses SIMD instructions for speed purposes. So I rewrite the main calculation part by using AVX Intrinsics code. Still, the program compiled by Clang gains a 10% benefit of speed.
Is it common sense that Clang performs better than MSVC on Windows? Or I missed some important optimization configurations of MSVC.
I've tested these code:
static __inline int RGBToY(unsigned char r, unsigned char g, unsigned char b) {
return (66 * r + 129 * g + 25 * b + 0x1080) >> 8;
}
void ToYRow_C(const unsigned char* src_argb0, unsigned char* dst_y, int width) {
int x;
for (x = 0; x < width; ++x) {
dst_y[0] = RGBToY(src_argb0[2], src_argb0[1], src_argb0[0]);
src_argb0 += 3;
dst_y += 1;
}
}
And the compiling flags for Clang: -O2 -mavx2, flags for MSVC: /O2 /arch:AVX2.
Processing a 2560x1440 image on a clang-compiled program costs 1.2ms, and 4.2ms for a MSVC-compiled program.

Parallel processing a prime finder with openMP

I am trying to construct a prime finder for a bit of C practice. I've got the algorithm down and I've done a bunch of optimisations to make it faster, I then decided to try to parallelize it because, hey why not! Turns out to be harder than I thought. I can either get all threads running the same process (with same args) or a single thread will run if I try and supply different args to each process. I really have no idea what I'm doing here but you can see some experimental values I'm using in this code:
// gcc -std=c99 -o multithread multithread.c -fopenmp -lm
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
int pf(unsigned int start, unsigned int limit, unsigned int q);
int main(int argc, char *argv[])
{
printf("prime finder\n");
int j, slimits[4] = {1,10000000,20000000,30000000}, elimits[4] = {10000000,20000000,30000000,40000000};
double startTime = omp_get_wtime();
#pragma omp parallel shared(slimits, elimits primes)
{
#pragma omp for
for (j = 0; j < 4; j++)
{
primes += pf(slimits[j], elimits[j], atoi(argv[2]));
}
}
printf("%d prime numbers found in %.2f seconds.\n\n", primes, omp_get_wtime() - startTime);
return 0;
}
I havn't included the pf function as it is quite large but it works on its own, it returns the number of primes found. Im sure the issue is here somewhere.
Any help would be greatly appreciated!
You have made at least one obvious (to me) and serious mistake. You've declared primes shared and allowed all the threads in the program to update it. You have, thereby, programmed a data race. Nothing in OpenMP (nor in C if I recall correctly) guarantees that += will be implemented atomically. You haven't actually specified what the problem with your program is, or what the problems are, but this must surely be one of them.
I'll tell you how to fix this later but I think there is a more serious underlying design problem you should address first. You seem to have decided that you would have 4 threads running and that you should divide the range of integers to test for primality into 4 and pass one chunk to each thread. Sure, you can make that work but it's not a smart approach to using OpenMP. Nor is it a smart approach to dividing the work of primality testing.
A smarter approach to OpenMP program design is to start off by making no assumptions about the number of threads that will be available to the executing program. Design for any number of threads, do not design a program whose behaviour depends on the number of threads it gets at run-time. Use OpenMP's facilities, specifically the schedule clause, to distribute the workload at run time.
Turning to primality testing. Draw, or at least think about, a scatter plot of points (i,t(i)), where i is an integer and t(i) is the time it takes to determine whether or not i is prime. The pattern in this plot is about as difficult to discern as the pattern in the plot of the occurrence of primes in the integers. In other words, the time to determine the primality of an integer is very unpredictable. It does tend to rise as the integers increase (well, excluding large even integers which I'm sure your test doesn't consider anyway).
One implication of this unpredictability is that if you divide a range of integers into N sub-ranges and give one sub-range to each of N threads you are not giving the threads the same amount of work to do. Indeed, in the range of integers 1..m (any m) there is one integer which takes much longer to test than any other integer in the range, and this time is the irreducible minimum that your program will take. A naive distribution of the range will produce a seriously unbalanced workload.
Here's what I think you should do to fix your program.
First, write a function which tests the primality of a single integer. This will be the basic task for your computation. Call this is_prime. Next, study the schedule clause for the parallel for construct. OpenMP provides a number of task scheduling options, I won't explain them here, you will find plenty of good documentation online. Finally, study also the reduction clause; this provides the solution to the data race you have programmed.
Applying all this I suggest you change
#pragma omp parallel shared(slimits, elimits primes)
{
#pragma omp for
for (j = 0; j < 4; j++)
{
primes += pf(slimits[j], elimits[j], atoi(argv[2]));
}
}
to
#pragma omp parallel shared(slimits, elimits, max_int_to_test)
{
#pragma omp for reduction(+:primes) schedule (dynamic, 10)
for (j = 3; j < max_int_to_test; j += 2)
{
primes += is_prime(j);
}
}
With any luck my rudimentary C hasn't screwed up the syntax too much.

Performance decrease with threaded implementation

I implemented a small program in C to calculate PI using a Monte Carlo method (mainly because of personal interest and training). After having implemented the basic code structure, I added a command-line option allowing to execute the calculations threaded.
I expected major speed ups, but I got disappointed. The command-line synopsis should be clear. The final number of iterations made to approximate PI is the product of the number of -iterations and -threads passed via the command-line. Leaving -threads blank defaults it to 1 thread resulting in execution in the main thread.
The tests below are tested with 80 Million iterations in total.
On Windows 7 64Bit (Intel Core2Duo Machine):
Compiled using Cygwin GCC 4.5.3: gcc-4 pi.c -o pi.exe -O3
On Ubuntu/Linaro 12.04 (8Core AMD):
Compiled using GCC 4.6.3: gcc pi.c -lm -lpthread -O3 -o pi
Performance
On Windows, the threaded version is a few milliseconds faster than the un-threaded. I expected a better performance, to be honest. On Linux, ew! What the heck? Why does it take even 2000% longer? Of course this is depending much on the implementation, so here it goes. An excerpt after the command-line argument parsing was done and the calculation is started:
// Begin computation.
clock_t t_start, t_delta;
double pi = 0;
if (args.threads == 1) {
t_start = clock();
pi = pi_mc(args.iterations);
t_delta = clock() - t_start;
}
else {
pthread_t* threads = malloc(sizeof(pthread_t) * args.threads);
if (!threads) {
return alloc_failed();
}
struct PIThreadData* values = malloc(sizeof(struct PIThreadData) * args.threads);
if (!values) {
free(threads);
return alloc_failed();
}
t_start = clock();
for (i=0; i < args.threads; i++) {
values[i].iterations = args.iterations;
values[i].out = 0.0;
pthread_create(threads + i, NULL, pi_mc_threaded, values + i);
}
for (i=0; i < args.threads; i++) {
pthread_join(threads[i], NULL);
pi += values[i].out;
}
t_delta = clock() - t_start;
free(threads);
threads = NULL;
free(values);
values = NULL;
pi /= (double) args.threads;
}
While pi_mc_threaded() is implemented as:
struct PIThreadData {
int iterations;
double out;
};
void* pi_mc_threaded(void* ptr) {
struct PIThreadData* data = ptr;
data->out = pi_mc(data->iterations);
}
You can find the full source code at http://pastebin.com/jptBTgwr.
Question
Why is this? Why this extreme difference on Linux? I expected the anmount of time taken to calculate to be at least 3/4 of the original time. It would of course be possible that I simply made wrong use of the pthread library. A clarifcation on how to do correct in this case would be very nice.
The problem is that in glibc's implementation, rand() calls __random(), and that
long int
__random ()
{
int32_t retval;
__libc_lock_lock (lock);
(void) __random_r (&unsafe_state, &retval);
__libc_lock_unlock (lock);
return retval;
}
locks around each call to the function __random_r that does the actual work.
Thus, as soon as you have more than one thread using rand(), you make each thread wait for the other(s) on almost every call to rand(). Directly using random_r() with your own buffers in each thread should be much faster.
Performance and threading is a black art. The answer depends on the specifics of the compiler and libraries used to do threading, how well the kernel handles it, etc. Basically, if your libraries for *nix are not efficient in switching, moving objects around etc, threading will in fact, be slower . THis is one of the reasons a lot us doing thread work now work with JVM or JVM-like languages. We can trust the runtime JVM's behavior -- it's overall speed may vary with platform, but it's consistent on that platform. In addition, you may have some hidden wait/race conditions that you uncovered just due to timing that may not show up on Windows.
If you are in a position to change your language, consider Scala or D. Scala is the actor driven model successor to Java, and D, the successor to C. Both languages show their roots -- if you can write in C, D should be no problem. Both languages however, implement the actor model. NO MORE THREAD POOLS, NO MORE RACE CONDITIONS ETC!!!!!!
For comparison, I just tried your app on Windows Vista, compiled with Borland C++, and the 2 thread version performed nearly twice as fast as the single thread.
pi.exe -iterations 20000000 -stats -threads 1
3.141167
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 12.511000 sec
Threads: Main
pi.exe -iterations 10000000 -stats -threads 2
3.142397
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 6.584000 sec
Threads: 2
That's compiled against the thread-safe run-time library. Using the single thread library, both versions run at twice their thread-safe speed.
pi.exe -iterations 20000000 -stats -threads 1
3.141167
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 6.458000 sec
Threads: Main
pi.exe -iterations 10000000 -stats -threads 2
3.141314
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 3.978000 sec
Threads: 2
So the 2 thread version is still twice as fast, but the 1 thread version with the single thread library is actually faster than the 2 thread version on the thread-safe library.
Looking at Borland's rand implementation, they use thread local storage for the seed in the thread-safe implementation, so it's not going to have the same negative impact on threaded code as glibc's lock, but the thread-safe implementation will obviously be slower than the single thread implementation.
The bottom line though, is that your compiler's rand implementation is probably the main performance issue in both cases.
Update
I've just tried replacing your rand_01 calls with inline implementations of Borland's rand function using a local variable for the seed, and the results are consistently twice as fast in the 2 thread case.
The updated code looks like this:
#define MULTIPLIER 0x015a4e35L
#define INCREMENT 1
double pi_mc(int iterations) {
unsigned seed = 1;
long long inner = 0;
long long outer = 0;
int i;
for (i=0; i < iterations; i++) {
seed = MULTIPLIER * seed + INCREMENT;
double x = ((int)(seed >> 16) & 0x7fff) / (double) RAND_MAX;
seed = MULTIPLIER * seed + INCREMENT;
double y = ((int)(seed >> 16) & 0x7fff) / (double) RAND_MAX;
double d = sqrt(pow(x, 2.0) + pow(y, 2.0));
if (d <= 1.0) {
inner++;
}
else {
outer++;
}
}
return ((double) inner / (double) iterations) * 4;
}
I don't know how good that is as rand implementations go, but it's worth at least trying on Linux to see whether it makes a difference to the performance.

GCC error: 'for' loop initial declaration used outside C99 mode

I'm getting error: 'for' loop initial declaration used outside C99 mode when I try to compile with make. I found a wiki that says
Put -std=c99 in the compilation line: gcc -std=c99 foo.c -o foo
Problem is I don't know how to specify this in make. I opened Makefile, found CC = gcc and changed it to CC = gcc -std=c99 with no results. Any ideas?
Put CFLAGS=-std=c99 at the top of your Makefile.
To remove the error without using C99, you just need to declare your iterator variable at the top of the block the for loop is inside.
Instead of:
for (int i = 0; i < count; i++)
{
}
Use:
int i;
//other code
for (i = 0; i < count; i++)
{
}
NEW: i tried make CFLAGS=-std=c99,finally useful.
OLD:if you have added CFLAGS=-std=c99 into MakeFile,and got error too.
maybe use make clean before make is a good idea.

Parallelizing Loops: Where is the unsafe dependence?

I'm trying to parallelize the following loop using the automatic parallelization options present in the Solaris Studio Complier.
int max = A->m;
complex** A_me2;
complex fred;
for ( i = 0; i < max; i++ )
{
for ( j = 0; j < i-1; j++ )
{
A_me2[i][j] = fred;
A_me2[i][j] = fred;
}
}
However when I run this loop through the compiler I get a message saying: "not parallelized, unsafe dependence". Where exactly is the unsafe dependence? There is clearly no aliasing between the inputs and outputs of both assignment statements, and i and j are private to each thread... I'm extremely stumped as to why this is happening. Any guidance would be greatly appreciated!
Since A_me2 is an array of pointers, the compiler doesn't know (for example) that A_me2[0] and A_me2[1] don't overlap, leading to multiple writes to the same location that need to be ordered correctly. There is often a compiler #pragma that will tell the compiler to assume that there are no dependencies, overriding the automatic safety mechanisms.

Resources