memcpy() performance- Ubuntu x86_64 - 64-bit

I am observing some weird behavior which I am not being able to explain. Following are the details :-
#include <sched.h>
#include <sys/resource.h>
#include <time.h>
#include <iostream>
void memcpy_test() {
int size = 32*4;
char* src = new char[size];
char* dest = new char[size];
general_utility::ProcessTimer tmr;
unsigned int num_cpy = 1024*1024*16;
struct timespec start_time__, end_time__;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
for(unsigned int i=0; i < num_cpy; ++i) {
__builtin_memcpy(dest, src, size);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
std::cout << "time = " << (double)(end_time__.tv_nsec - start_time__.tv_nsec)/num_cpy << std::endl;
delete [] src;
delete [] dest;
}
When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ? If anything, I would expect -march=native to produce optimized code. Is there other functions which could show this type of behavior ?
EDIT 1:
Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated
EDIT 2:
Following are the detailed performance analysis (__builtin_memcpy()) :-
size = 32 * 4, without -march=native - 7.5 ns, with -march=native - 19.3
size = 32 * 8, without -march=native - 26.3 ns, with -march=native - 26.5
EDIT 3 :
This observation does not change even if I allocate int64_t/int32_t.
EDIT 4 :
size = 8192, without -march=native ~ 2750 ns, with -march=native ~ 2750 (Earlier, there was an error in reporting this number, it was wrongly written as 26.5, now it is correct )
I have run these many times and numbers are consistent across each run.

I have replicated your findings with: g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, Linux 2.6.38-10-generic #46-Ubuntu x86_64 on my Core 2 Duo. Results will probably vary depending on your compiler version and CPU. I get ~26 and ~9.
When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ?
Because the -march=native version gets compiled into (found using objdump -D you could also use gcc -S -fverbose-asm):
rep movsq %ds:(%rsi),%es:(%rdi) ; where rcx = 128 / 8
And the version without gets compiled into 16 load/store pairs like:
mov 0x20(%rbp),%rdx
mov %rdx,0x20(%rbx)
Which apparently is faster on our computers.
If anything, I would expect -march=native to produce optimized code.
In this case it turned out to be a pessimization to favor rep movsq over a series of moves, but that might not always be the case. The first version is shorter, which might be better in some (most?) cases. Or it could be a bug in the optimizer.
Is there other functions which could show this type of behavior ?
Any function for which the generated code differs when you specify -march=native, suspects include functions implemented as macros or static in headers, has a name beginning with __builtin. Possibly also (floating point) math functions.
Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated
This is because then they both compile to rep movsq, 128 is probably the largest size for which GCC will generate a series of load/stores (would be interesting to see if this also for other platforms). BTW when the compiler doesn't know the size at compile time (e.g. int size=atoi(argv[1]);) then it simply turns into a call to memcpy with or without the switch.

It's quite known issue (and really old one).
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
look at some bottom comment in a bug report:
"Just FYI: mesa is now defaulting to -fno-builtin-memcmp to workaround this
problem"
Looks like glibc's memcpy is far better than builtin...

Related

Are there compatibility issues with clang-cl and arch:avx2?

I'm using Windows 10, Visual Studio 2019, Platform: x64 and have the following test script in a single-file Visual Studio Solution:
#include <iostream>
#include <intrin.h>
using namespace std;
int main() {
unsigned __int64 mask = 0x0fffffffffffffff; //1152921504606846975;
unsigned long index;
_BitScanReverse64(&index, mask);
if (index != 59) {
cout << "Fails!" << endl;
return EXIT_FAILURE;
}
else {
cout << "Success!" << endl;
return EXIT_SUCCESS;
}
}
In my property solution I've set the 'Enable Enhanced Instruction Set' to 'Advanced Vector Extenstions 2 (/arch:AVX2)'.
When compiling with msvc (setting 'Platform Toolset' to 'Visual Studio 2019 (v142)') the code returns EXIT_SUCCESS, but when compiling with clang-cl (setting 'Platform Toolset' to 'LLVM (clang-cl)') I get EXIT_FAILURE. When debugging the clang-cl run, the value of index is 4, when it should be 59. This suggests to me that clang-cl is reading the bits in the opposite direction of MSVC.
This isn't the case when I set 'Enable Enhanced Instruction Set' to 'Not Set'. In this scenario, both MSVC and clang-cl return EXIT_SUCCESS.
All of the dlls are loaded and shown in the Debug Output window come from C:\Windows\System32###.dll in all cases.
Does anyone understand this behavior? I would appreciate any insight here.
EDIT: I failed to mention earlier: I compiled this with IntelCore i7-3930K CPU #3.20GHz.
Getting 4 instead of 59 sounds like clang implemented _BitScanReverse64 as 63 - lzcnt. Actual bsr is slow on AMD, so yes there are reasons why a compiler would want to compiler a BSR intrinsic to a different instruction.
But then you ran the executable on a computer that doesn't actually support BMI so lzcnt decoded as rep bsr = bsr, giving the leading-zero count instead of the bit-index of the highest set bit.
AFAIK, all CPUs that have AVX2 also have BMI. If your CPU doesn't have that, you shouldn't expect your executables build with /arch:AVX2 to run correctly on your CPU. And in this case the failure mode wasn't an illegal instruction, it was lzcnt running as bsr.
MSVC doesn't generally optimize intrinsics, apparently including this case, so it just uses bsr directly.
Update: i7-3930K is SandyBridge-E. It doesn't have AVX2, so that explains your results.
clang-cl doesn't error when you tell it to build an AVX2 executable on a non-AVX2 computer. The use-case for that would be compiling on one machine to create an executable to run on different machines.
It also doesn't add CPUID-checking code to your executable for you. If you want that, write it yourself. This is C++, it doesn't hold your hand.
target CPU options
MSVC-style /arch options are much more limited than normal GCC/clang style. There aren't any for different levels of SSE like SSE4.1; it jumps straight to AVX.
Also, /arch:AVX2 apparently implies BMI1/2, even though those are different instruction-sets with different CPUID feature bits. In kernel code for example you might want integer BMI instructions but not SIMD instructions that touch XMM/YMM registers.
clang -O3 -mavx2 would not also enable -mbmi. You normally would want that, but if you failed to also enable BMI then clang would have been stuck using bsr. (Which is actually better for Intel CPUs than 63-lzcnt). I think MSVC's /arch:AVX2 is something like -march=haswell, if it also enables FMA instructions.
And nothing in MSVC has any support for making binaries optimized to run on the computer you build them on. That makes sense, it's designed for a closed-source binary-distribution model of software development.
But GCC and clang have -march=native to enable all the instruction sets your computer supports. And also importantly, set tuning options appropriate for your computer. e.g. don't worry about making code that would be slow on an AMD CPU, or on older Intel, just make asm that's good for your CPU.
TL:DR: CPU selection options in clang-cl are very coarse, lumping non-SIMD extensions in with some level of AVX. That's why /arch:AVX2 enabled integer BMI extension, while clang -mavx2 would not have.

clang miss assembler error?

It seems to me, that clang++ miss errors in assembler code that g++ pick up. Or am I missing some compiler flag for clang? I'm new to assembler code.
Using clang++ I have compiled and linked my application error and warning free, yet I have had nasty segmentation faults. Switching to g++, I on the other hand I got these errors:
GO_F_ImageColourConversion.cpp: Assembler messages:
GO_F_ImageColourConversion.cpp:4679: Error: `(%rsi,%edx,2)' is not a valid base/index expression
GO_F_ImageColourConversion.cpp:4682: Error: `(%rcx,%edx,1)' is not a valid base/index expression
I am using these compiler flags:
-DLINUX -g -Wno-deprecated -D_GNU_SOURCE -D_REENTRANT -D__STDC_CONSTANT_MACROS -fPIC -fPIE
I have the following code (omitting unrelevant parts):
Ipp8u * pSrc;
Ipp8u * pDst;
int x, y;
asm volatile
(
"movl (%1, %0, 2), %%eax;\n"
"shlw $8, %%ax;\n"
"shrl $8, %%eax;\n"
"movw %%ax, (%2, %0, 1);\n"
: /* no output */
: "r" (x), "r" (pSrc), "r" (pDst)
: "eax", "memory");
}
From looking at this answer on SO, I realized I had a 32/64 bit isssue (I am porting to 64-bit).The Ipp8u* is 8 bit but int only 4 bit on my machine.
Changing the int to uintptr_t x, y; seems to fix the issue. Why does clang not give error on compile?
gcc and clang both choke on your code for me:
6 : error: base register is 64-bit, but index register is not
"movl (%1, %0, 2), %%eax\n"
^
<inline asm>:1:13: note: instantiated into assembly here
movl (%rdi, %edx, 2), %eax
From clang 3.8 on the godbolt compiler explorer, with a function wrapped around it so it's testable, which you failed to provide in the question. Are you sure your clang was building 64bit code? (-m64, not -m32 or -mx32).
Provide a link to your code on godbolt with some version of clang silently mis-compiling it, otherwise all I can say for your actual question is just "can't reproduce".
And yes, your problem is that x is an int, and your problem is mixed register sizes in the addressing mode. (%rsi,%edx,2) isn't encodable.
Using %q0 to get %rdx doesn't guarantee that there isn't garbage in the high 32bits of the register (although it's highly unlikely). Instead, you could use "r" ((int64_t)x) to sign-extend x to 64bits.
Why do you need inline asm at all? How bad is the compiler output for your C version of this?
If you do want to use inline asm, this is much better:
uint32_t asm_tmp = *(uint32_t *)(x*2 + (char*)pSrc); // I think I've reproduced the same pointer math as the addressing mode you used.
asm ( "shlw $8, %w[v]\n\t" // e.g. ax
"shrl $8, %k[v]\n\t" // e.g. eax. potential partial-register slowdown from reading eax after writing ax on older CPUs
: [v] "+&r" (asm_tmp)
);
*(uint16_t *)(x + (char*)pDst) = asm_tmp; // store the low 16
This compiles nicely with clang, but gcc is kinda braindead about generating the address. Maybe with a different expression for the addresses?
Your code was defeating the purpose of constraints by starting with a load and ending with a store. Always let the compiler handle as much as possible. It's possible you'd get better code from this without inline asm, and the compiler would understand what it does and could potentially auto-vectorize or do other transformations. Removing the need for the asm statement to be volatile with a "memory" clobber is already a big improvement for the optimizer: Now it's a pure function that the compiler knows just transforms one register.
Also see the end of this answer for more guides to writing inline asm that doesn't suck.

How to write c++ code that the compiler can efficiently compile to SSE or AVX?

Let's say I have a function written in c++ that performs matrix vector multiplications on a lot of vectors. It takes a pointer to the array of vectors to transform. Am I correct to assume that the compiler cannot efficiently optimize that to SIMD instructions because it does not know the alignment of the passed pointer (requiring a 16 byte alignment for SSE or 32 byte alignment for AVX) at compile time? Or is the memory alignment of the data irrelevant for optimal SIMD code and the data alignment will only affect cache performance?
If alignment is important for the generated code, how can I let the (visual c++) compiler know that I intend to only pass values with a certain alignment to the function?
In theory alignment should not matter on Intel processors since Nehalem. Therefore, your compiler should be able to produce code in which a pointer being aligned or not is not an issue.
Unaligned load/store instructions have the same performance on Intel processors since Nehalem. However, until AVX arrived with Sandy Bridge unaligned loads could not be folded with another operation for micro-op fusion.
Additionally, even before AVX to avoid the penalty of cache line splits having 16 byte aligned memory could still be helpful so it would still be reasonable for a compiler to add code until the pointer is 16 byte aligned.
Since AVX there is no advantage to using aligned load/store instructions anymore and there is no reason for a compiler to add code to make a pointer 16 byte or 32 byte aligned..
However, there is till a reason to use aligned memory to avoid cache-line splits with AVX. Therefore, it would would be reasonable for a compiler to add code to make the pointer 32 byte aligned even if it still used an unaligned load instruction.
So in practice some compilers produce much simpler code when they are told to assume that a pointer is aligned.
I'm not aware of a method to tell MSVC that a pointer is aligned. With GCC and Clang (since 3.6) you can use a built in __builtin_assume_aligned. With ICC and also GCC you can use #pragma omp simd aligned. With ICC you can also use __assume_aligned.
For example with GCC compiling this simple loop
void foo(float * __restrict a, float * __restrict b, int n)
{
//a = (float*)__builtin_assume_aligned (a, 16);
//b = (float*)__builtin_assume_aligned (b, 16);
for(int i=0; i<(n & (-4)); i++) {
b[i] = 3.14159f*a[i];
}
}
with gcc -O3 -march=nehalem -S test.c and then wc test.s gives 160 lines. Whereas if use __builtin_assume_aligned then wc test.s gives only 45 lines. When I did this with in both cases clang return 110 lines.
So on clang informing the compiler the arrays were aligned made no difference (in this case) but with GCC it did. Counting lines of code is not a sufficient metric to gauge performance but I'm not going to post all the assembly here I just want to illustrate that your compiler may produce very different code when it is told the arrays are aligned.
Of course, the additional overhead that GCC has for not assuming the arrays are aligned may make no difference in practice. You have to test and see.
In any case, if you want to get the most most from SIMD I would not rely on the compiler to do it correctly (especially with MSVC). Your example of matrix*vector is a poor one in general (but maybe not for some special cases) since it's memory bandwidth bound. But if you choose matrix*matrix no compiler is going to optimize that well without a lot of help which does not conform to the C++ standard. In these cases you will need intrinsics/built-ins/assembly in which you have explicit control of the alignment anyway.
Edit:
The assembly from GCC contains a lot of extraneous lines which are not part of the text segment. Doing gcc -O3 -march=nehalem -S test.c and then using objdump -d and counting the lines in the text (code) segment gives 108 lines without using __builtin_assume_aligned and only 16 lines with it. This shows more clearly that GCC produces very different code when it assumes the arrays are aligned.
Edit:
I went ahead and tested the foo function above in MSVC 2013. It produces unaligned loads and the code is much shorter than GCC (I only show the main loop here):
$LL3#foo:
movsxd rax, r9d
vmulps xmm1, xmm0, XMMWORD PTR [r10+rax*4]
vmovups XMMWORD PTR [r11+rax*4], xmm1
lea eax, DWORD PTR [r9+4]
add r9d, 8
movsxd rcx, eax
vmulps xmm1, xmm0, XMMWORD PTR [r10+rcx*4]
vmovups XMMWORD PTR [r11+rcx*4], xmm1
cmp r9d, edx
jl SHORT $LL3#foo
This should be fine on processors since Nehalem (late 2008). But MSVC still has cleanup code for arrays that are not a multiple of four even thought I told the compiler that it was a multiple of four ((n & (-4)). At least GCC gets that right.
Since AVX can fold unalinged loads I checked GCC with AVX to see if the code was the same.
void foo(float * __restrict a, float * __restrict b, int n)
{
//a = (float*)__builtin_assume_aligned (a, 32);
//b = (float*)__builtin_assume_aligned (b, 32);
for(int i=0; i<(n & (-8)); i++) {
b[i] = 3.14159f*a[i];
}
}
without __builtin_assume_aligned GCC produces 168 lines of assembly and with it it only produces 17 lines.
My original answer became too messy to edit so I am adding a new answer here and making my original answer community wiki.
I did some tests using aligned and unaligned memory on a pre Nehalem system and on a Haswell system with GCC, Clang, and MSVC.
The assembly shows that only GCC adds code to check and fix alignment. Due to this with __builtin_assume_aligned GCC produces much simpler code. But using __builtin_assume_aligned with Clang only changes unaligned instructions to aligned (the number of instructions stay the same). MSVC just uses unaligned instructions.
The results in performance is that on per-Nehalem systems Clang and MSVC are much slower than GCC with auto-vectorization when the memory is not aligned.
But the penalty for cache-line splits is small since Nehalem. It turns out the extra code GCC adds to check and align the memory more than makes up for the small penalty due to cache-line splits. This explains why neither Clang nor MSVC worry about cache-line splits with vectorization.
So my original claim that auto-vecorization does not need to know about the alignment is more or less correct since Nehalem. That's not the same thing as saying that aligning memory is not useful since Nehalem.

Pthreads & Multicore compiler

I'm working with the support SMP kernel: Snapgear 2.6.21.
I have created 4 threads in my c application, and I am trying to set thread 1 to run on CPU1, thread2 on CPU 2, etc.
However, the compiler sparc-linux-gcc does not recognize these functions:
CPU_SET (int cpu, cpu_set_t * set);
CPU_ZERO (cpu_set_t * set);
and this type: cpu_set_t
It always gives me these errors:
implicit declaration of function 'CPU_ZERO'
implicit declaration of function 'CPU_SET'
'cpu_set_t' undeclared (first use in this function)
Here is my code to bind active thread to processor 0:
cpu_set_t mask;
CPU_ZERO (& mask);
CPU_SET (0, & mask) // bind processor 0
sched_setaffinity (0, sizeof(mask), & mask);
I have included and defined at the top :
**define _GNU_SOURCE
include <sched.h>**
But I always get the same errors. can you help me please?
You should read sched_setaffinity(2) carefully and test its result (and display errno on failure, e.g. with perror).
Actually, I believe you should use pthread_setaffinity_np(3) instead (and of course test its failure, etc...)
Even more, I believe that you should not bother to explicitly set the affinity. Recent Linux kernels are often quite good at dispatching running threads on different CPUs.
So simply use pthreads and don't bother about affinity, unless you see actual issues when benchmarking.
BTW, passing the -H flag to your GCC (cross-)compiler could be helpful. It shows you the included files. Perhaps also look into the preprocessed form obtained with gcc -C -E ; it looks like some header files are missing or not found (maybe some missing -I include-directory at compilation time, or some missing headers on your development system)
BTW, your kernel version looks ancient. Can't you upgrade your kernel to something newer (3.15.x or some 3.y)?

Unexpectedly good performance with openmp parallel for loop

I have edited my question after previous comments (especially #Zboson) for better readability
I have always acted on, and observed, the conventional wisdom that the number of openmp threads should roughly match the number of hyper-threads on a machine for optimal performance. However, I am observing odd behaviour on my new laptop with Intel Core i7 4960HQ, 4 cores - 8 threads. (See Intel docs here)
Here is my test code:
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main() {
const int n = 256*8192*100;
double *A, *B;
posix_memalign((void**)&A, 64, n*sizeof(double));
posix_memalign((void**)&B, 64, n*sizeof(double));
for (int i = 0; i < n; ++i) {
A[i] = 0.1;
B[i] = 0.0;
}
double start = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
B[i] = exp(A[i]) + sin(B[i]);
}
double end = omp_get_wtime();
double sum = 0.0;
for (int i = 0; i < n; ++i) {
sum += B[i];
}
printf("%g %g\n", end - start, sum);
return 0;
}
When I compile it using gcc 4.9-4.9-20140209, with the command: gcc -Ofast -march=native -std=c99 -fopenmp -Wa,-q I see the following performance as I change OMP_NUM_THREADS [the points are an average of 5 runs, the error bars (which are hardly visible) are the standard deviations]:
The plot is clearer when shown as the speed up with respect to OMP_NUM_THREADS=1:
The performance more or less monotonically increases with thread number, even when the the number of omp threads very greatly exceeds the core and also hyper-thread count! Usually the performance should drop off when too many threads are used (at least in my previous experience), due to the threading overhead. Especially as the calculation should be cpu (or at least memory) bound and not waiting on I/O.
Even more weirdly, the speed-up is 35 times!
Can anyone explain this?
I also tested this with much smaller arrays 8192*4, and see similar performance scaling.
In case it matters, I am on Mac OS 10.9 and the performance data where obtained by running (under bash):
for i in {1..128}; do
for k in {1..5}; do
export OMP_NUM_THREADS=$i;
echo -ne $i $k "";
./a.out;
done;
done > out
EDIT: Out of curiosity I decided to try much larger numbers of threads. My OS limits this to 2000. The odd results (both speed up and low thread overhead) speak for themselves!
EDIT: I tried #Zboson latest suggestion in their answer, i.e. putting VZEROUPPER before each math function within the loop, and it did fix the scaling problem! (It also sent the single threaded code from 22 s to 2 s!):
The problem is likely due to the clock() function. It does not return the wall time on Linux. You should use the function omp_get_wtime(). It's more accurate than clock and works on GCC, ICC, and MSVC. In fact I use it for timing code even when I'm not using OpenMP.
I tested your code with it here
http://coliru.stacked-crooked.com/a/26f4e8c9fdae5cc2
Edit: Another thing to consider which may be causing your problem is that exp and sin function which you are using are compiled WITHOUT AVX support. Your code is compiled with AVX support (actually AVX2). You can see this from GCC explorer with your code if you compile with -fopenmp -mavx2 -mfma Whenever you call a function without AVX support from code with AVX you need to zero the upper part of the YMM register or pay a large penalty. You can do this with the intrinsic _mm256_zeroupper (VZEROUPPER). Clang does this for you but last I checked GCC does not so you have to do it yourself (see the comments to this question Math functions takes more cycles after running any intel AVX function and also the answer here Using AVX CPU instructions: Poor performance without "/arch:AVX"). So every iteration you are have a large delay due to not calling VZEROUPPER. I'm not sure why this is what matters with multiple threads but if GCC does this each time it starts a new thread then it could help explain what you are seeing.
#include <immintrin.h>
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
_mm256_zeroupper();
B[i] = sin(B[i]);
_mm256_zeroupper();
B[i] += exp(A[i]);
}
Edit A simpler way to test do this is to instead of compiling with -march=native don't set the arch (gcc -Ofast -std=c99 -fopenmp -Wa) or just use SSE2 (gcc -Ofast -msse2 -std=c99 -fopenmp -Wa).
Edit GCC 4.8 has an option -mvzeroupper which may be the most convenient solution.
This option instructs GCC to emit a vzeroupper instruction before a transfer of control flow out of the function to minimize the AVX to SSE transition penalty as well as remove unnecessary zeroupper intrinsics.

Resources