clang miss assembler error? - linux

It seems to me, that clang++ miss errors in assembler code that g++ pick up. Or am I missing some compiler flag for clang? I'm new to assembler code.
Using clang++ I have compiled and linked my application error and warning free, yet I have had nasty segmentation faults. Switching to g++, I on the other hand I got these errors:
GO_F_ImageColourConversion.cpp: Assembler messages:
GO_F_ImageColourConversion.cpp:4679: Error: `(%rsi,%edx,2)' is not a valid base/index expression
GO_F_ImageColourConversion.cpp:4682: Error: `(%rcx,%edx,1)' is not a valid base/index expression
I am using these compiler flags:
-DLINUX -g -Wno-deprecated -D_GNU_SOURCE -D_REENTRANT -D__STDC_CONSTANT_MACROS -fPIC -fPIE
I have the following code (omitting unrelevant parts):
Ipp8u * pSrc;
Ipp8u * pDst;
int x, y;
asm volatile
(
"movl (%1, %0, 2), %%eax;\n"
"shlw $8, %%ax;\n"
"shrl $8, %%eax;\n"
"movw %%ax, (%2, %0, 1);\n"
: /* no output */
: "r" (x), "r" (pSrc), "r" (pDst)
: "eax", "memory");
}
From looking at this answer on SO, I realized I had a 32/64 bit isssue (I am porting to 64-bit).The Ipp8u* is 8 bit but int only 4 bit on my machine.
Changing the int to uintptr_t x, y; seems to fix the issue. Why does clang not give error on compile?

gcc and clang both choke on your code for me:
6 : error: base register is 64-bit, but index register is not
"movl (%1, %0, 2), %%eax\n"
^
<inline asm>:1:13: note: instantiated into assembly here
movl (%rdi, %edx, 2), %eax
From clang 3.8 on the godbolt compiler explorer, with a function wrapped around it so it's testable, which you failed to provide in the question. Are you sure your clang was building 64bit code? (-m64, not -m32 or -mx32).
Provide a link to your code on godbolt with some version of clang silently mis-compiling it, otherwise all I can say for your actual question is just "can't reproduce".
And yes, your problem is that x is an int, and your problem is mixed register sizes in the addressing mode. (%rsi,%edx,2) isn't encodable.
Using %q0 to get %rdx doesn't guarantee that there isn't garbage in the high 32bits of the register (although it's highly unlikely). Instead, you could use "r" ((int64_t)x) to sign-extend x to 64bits.
Why do you need inline asm at all? How bad is the compiler output for your C version of this?
If you do want to use inline asm, this is much better:
uint32_t asm_tmp = *(uint32_t *)(x*2 + (char*)pSrc); // I think I've reproduced the same pointer math as the addressing mode you used.
asm ( "shlw $8, %w[v]\n\t" // e.g. ax
"shrl $8, %k[v]\n\t" // e.g. eax. potential partial-register slowdown from reading eax after writing ax on older CPUs
: [v] "+&r" (asm_tmp)
);
*(uint16_t *)(x + (char*)pDst) = asm_tmp; // store the low 16
This compiles nicely with clang, but gcc is kinda braindead about generating the address. Maybe with a different expression for the addresses?
Your code was defeating the purpose of constraints by starting with a load and ending with a store. Always let the compiler handle as much as possible. It's possible you'd get better code from this without inline asm, and the compiler would understand what it does and could potentially auto-vectorize or do other transformations. Removing the need for the asm statement to be volatile with a "memory" clobber is already a big improvement for the optimizer: Now it's a pure function that the compiler knows just transforms one register.
Also see the end of this answer for more guides to writing inline asm that doesn't suck.

Related

Is there any way to complie a microsoft style inline-assembly code on a linux platform?

As mentioned in title, i'm wondering that is there any way to compile a microsoft style inline-assembly code (as showed below) in a linux OS (e.g. ubuntu).
_asm{
mov edi, A;
....
EMMS;
}
The sample code is part of a inline-assembly code which can be compiled successfully on win10 with cl.exe compiler. Is there any way to compile it on linux? Do i have to rewrite it in GNU c/c++ style (i.e. __asm__{;;;})?
First of all, you should usually replace inline asm (with intrinsics or pure C) instead of porting it. https://gcc.gnu.org/wiki/DontUseInlineAsm
clang -fasm-blocks is mostly compatible with MSVC's inefficient inline asm syntax. But it doesn't support returning a value by leaving it in EAX and then falling off the end of a non-void function.
So you have to write inline asm that puts the value in a named C variable and return that, typically leading to an extra store/reload making MSVC syntax even worse. (Pretty bad unless you're writing a whole loop in asm that amortizes that store/reload overhead of getting data into / out of the asm block). See What is the difference between 'asm', '__asm' and '__asm__'? for a comparison of how inefficient MSVC inline-asm is when wrapping a single instruction. It's less dumb inside functions with stack args when those functions don't inline, but that only happens if you're already making things inefficient (e.g. using legacy 32-bit calling conventions and not using link-time optimization to inline small functions).
MSVC can substitute A with an immediate 1 when inlining into a caller, but clang can't. Both defeat constant-propagation but MSVC at least avoids bouncing constant inputs through a store/reload. (As long as you only use it with instructions that can support an immediate source operand.)
Clang accepts __asm, asm, or __asm__ to introduce an asm-block. MSVC accepts __asm (2 underscores like clang) or _asm (more commonly used, but clang doesn't accept it).
So for existing MSVC code you probably want #define _asm __asm so your code can compile with both MSVC and clang, unless you need to make separate versions anyway. Or use clang -D_asm=asm to set a CPP macro on the command line.
Example: compile with MSVC or with clang -fasm-blocks
(Don't forget to enable optimization: clang -fasm-blocks -O3 -march=native -flto -Wall. Omit or modify -march=native if you want a binary that can run on earlier/other CPUs than your compile host.)
int a_global;
inline
long foo(int A, int B, int *arr) {
int out;
// You can't assume A will be in RDI: after inlining it prob. won't be
__asm {
mov ecx, A // comment syntax
add dword ptr [a_global], 1
mov out, ecx
}
return out;
}
Compiling with x86-64 Linux clang 8.0 on Godbolt shows that clang can inline the wrapper function containing the inline-asm, and how much store/reload MSVC syntax entails (vs. GNU C inline asm which can take inputs and outputs in registers).
I'm using clang in Intel-syntax asm output mode, but it also compiles Intel-syntax asm blocks when it's outputting in AT&T syntax mode. (Normally clang compiles straight to machine-code anyway, which it also does correctly.)
## The x86-64 System V ABI passes args in rdi, rsi, rdx, ...
# clang -O3 -fasm-blocks -Wall
foo(int, int, int*):
mov dword ptr [rsp - 4], edi # compiler-generated store of register arg to the stack
mov ecx, dword ptr [rsp - 4] # start of inline asm
add dword ptr [rip + a_global], 1
mov dword ptr [rsp - 8], ecx # end of inline asm
movsxd rax, dword ptr [rsp - 8] # reload `out` with sign-extension to long (64-bit) : compiler-generated
ret
Notice how the compiler substituted [rsp - 4] and [rsp - 8] for the C local variables A and out in the asm source block. And that a variable in static storage gets RIP-relative addressing. GNU C inline asm doesn't do this, you need to declare %[name] operands and tell the compiler where to put them.
We can even see clang inline that function twice into one caller, and optimize away the sign-extension to 64-bit because this function only returns int.
int caller() {
return foo(1, 2, nullptr) + foo(1, 2, nullptr);
}
caller(): # #caller()
mov dword ptr [rsp - 4], 1
mov ecx, dword ptr [rsp - 4] # first inline asm
add dword ptr [rip + a_global], 1
mov dword ptr [rsp - 8], ecx
mov eax, dword ptr [rsp - 8] # compiler-generated reload
mov dword ptr [rsp - 4], 1 # and store of A=1 again
mov ecx, dword ptr [rsp - 4] # second inline asm
add dword ptr [rip + a_global], 1
mov dword ptr [rsp - 8], ecx
add eax, dword ptr [rsp - 8] # compiler-generated reload
ret
So we can see that just reading A from inline asm creates a missed-optimization: the compiler stores a 1 again even though the asm only read that input without modifying it.
I haven't done tests like assigning to or reading a_global before/between/after the asm statements to make sure the compiler "knows" that variable is modified by the asm statement.
I also haven't tested passing a pointer into an asm block and looping over the pointed-to array, to see if it's like a "memory" clobber in GNU C inline asm. I'd assume it is.
My Godbolt link also includes an example of falling off the end of a non-void function with a value in EAX. That's supported by MSVC, but is UB like usual for clang and breaks when inlining into a caller. (Strangely with no warning, even at -Wall). You can see how x86 MSVC compiles it on my Godbolt link above.
https://gcc.gnu.org/wiki/DontUseInlineAsm
Porting MSVC asm to GNU C inline asm is almost certainly the wrong choice. Compiler support for optimizing intrinsics is very good, so you can usually get the compiler to generate good-quality efficient asm for you.
If you're going to do anything to existing hand-written asm, usually replacing them with pure C will be most efficient, and certainly the most future-proof, path forward. Code that can auto-vectorize to wider vectors in the future is always good. But if you do need to manually vectorize for some tricky shuffling, then intriniscs are the way to go unless the compiler makes a mess of it somehow.
Look at the compiler-generated asm you get from intrinsics to make sure it's as good or better than the original.
If you're using MMX EMMS, now is probably a good time to replace your MMX code with SSE2 intrinsics. SSE2 is baseline for x86-64, and few Linux systems are running obsolete 32-bit kernels.
Is there any way to complie a microsoft style inline-assembly code on a linux platform?
Yes, it is possible. Kind of.
For GCC you have to use both Intel and AT&T syntax. It does not work with Clang due to Issue 24232, Inline assembly operands don't work with .intel_syntax and Issue 39895, Error: unknown token in expression using inline asm.
Here is the pattern. The assembler template uses .intel_syntax. Then, at the end of your asm template, you switch back to .att_syntax mode so it's in the right mode for the rest of the compiler-generated asm.
#include <cstddef>
int main(int argc, char* argv[])
{
size_t ret = 1, N = 0;
asm __volatile__
(
".intel_syntax noprefix ;\n"
"xor esi, esi ;\n" // zero RSI
"neg %1 ;\n" // %1 is replaced with the operand location chosen by the compiler, in this case RCX
"inc %1 ;\n"
"push %1 ;\n" // UNSAFE: steps on the red-zone
"pop rax ;\n"
".att_syntax prefix ;\n"
: "=a" (ret) // output-only operand in RAX
"+c" (N) // read-write operand in RCX
: // no read-only inputs
: "%rsi" // RSI is clobbered: input and output register constraints can't pick it
);
return (int)ret;
}
This won't work if you use any memory operands, because the compiler will substitute AT&T syntax 4(%rsp) into the template instead of [rsp + 4], for example.
This also only works if you don't compile with gcc -masm=intel. Otherwise you'll put the assembler into AT&T mode when GCC is emitting Intel syntax. So using .intel_syntax noprefix breaks your ability to use either syntax with GCC.
mov edi, A;
The code I help with does not use variables in the assembler like you show. I don't know how well (poorly?) it works with Intel style ASM. I know a MASM style-grammar is not supported.
You may be able to do it using asmSymbolicNames. See the GCC Extended ASM HowTo for details.
However, to convert to something GCC can consume, you only need to use positional arguments:
__asm__ __volatile__
(
".intel_syntax noprefix ;\n"
"mov edi, %0 \n"; // inefficient: use a "D" constraint instead of a mov
...
".att_syntax prefix ;\n"
: : "r" (A) : "%edi"
);
Or better, use a "D" constraint to ask for the variable in EDI / RDI in the first place. If a GNU C inline asm statement ever starts or ends with a mov, that's usually a sign you're doing it wrong.
Regarding asmSymbolicNames, here is what the GCC Extended ASM HowTo has to say about them:
This code makes no use of the optional asmSymbolicName. Therefore it
references the first output operand as %0 (were there a second, it
would be %1, etc). The number of the first input operand is one
greater than that of the last output operand. In this i386 example,
that makes Mask referenced as %1:
uint32_t Mask = 1234;
uint32_t Index;
asm ("bsfl %1, %0"
: "=r" (Index)
: "r" (Mask)
: "cc");
That code overwrites the variable Index (‘=’), placing the value in a
register (‘r’). Using the generic ‘r’ constraint instead of a
constraint for a specific register allows the compiler to pick the
register to use, which can result in more efficient code. This may not
be possible if an assembler instruction requires a specific register.
The following i386 example uses the asmSymbolicName syntax. It
produces the same result as the code above, but some may consider it
more readable or more maintainable since reordering index numbers is
not necessary when adding or removing operands. The names aIndex and
aMask are only used in this example to emphasize which names get used
where. It is acceptable to reuse the names Index and Mask.
uint32_t Mask = 1234;
uint32_t Index;
asm ("bsfl %[aMask], %[aIndex]"
: [aIndex] "=r" (Index)
: [aMask] "r" (Mask)
: "cc");
The sample code is part of a inline-assembly code which can be compiled successfully on win10 with cl.exe compiler...
Stepping back to 10,000 feet, if you are looking for something easy to use to integrate inline ASM like in Microsoft environments, then you don't have it on Linux. GCC inline ASM absolutely sucks. The GCC inline assembler is an archaic, difficult to use tool that I despise interacting with.
(And you have not experienced the incomprehensible error messages with bogus line information, yet).
Peter's idea solved my problem. I just added a macro into my source file in which all functions consist of single big inline-asm block of intel syntax. The macro is showed below:
#define _asm\
asm(".intel_syntax noprefix\n");\
asm\
After that i compiled it with command:
clang++ -c -fasm-blocks source.cpp
Then everthing is OK.

Why does calling the C abort() function from an x86_64 assembly function lead to segmentation fault (SIGSEGV) instead of an abort signal?

Consider the program:
main.c
#include <stdlib.h>
void my_asm_func(void);
__asm__(
".global my_asm_func;"
"my_asm_func:;"
"call abort;"
"ret;"
);
int main(int argc, char **argv) {
if (argv[1][0] == '0') {
abort();
} else if (argv[1][0] == '1') {
__asm__("call abort");
} else {
my_asm_func();
}
}
Which I compile as:
gcc -ggdb3 -O0 -o main.out main.c
Then I have:
$ ./main.out 0; echo $?
Aborted (core dumped)
134
$ ./main.out 1; echo $?
Aborted (core dumped)
134
$ ./main.out 2; echo $?
Segmentation fault (core dumped)
139
Why do I get the segmentation fault only for the last run, and not an abort signal as expected?
man 7 signal:
SIGABRT 6 Core Abort signal from abort(3)
SIGSEGV 11 Core Invalid memory reference
confirms the signals due to the 128 + SIGNUM rule.
As a sanity check I also tried to make other function calls from assembly as in:
#include <stdlib.h>
void my_asm_func(void);
__asm__(
".global my_asm_func;"
"my_asm_func:;"
"lea puts_message(%rip), %rdi;"
"call puts;"
"ret;"
"puts_message: .asciz \"hello puts\""
);
int main(void) {
my_asm_func();
}
and that did work and print:
hello puts
Tested in Ubuntu 19.04 amd64, GCC 8.3.0, glibc 2.29.
I also tried it in an Ubunt Ubuntu 18.04 docker, and the results were the same, except that the program outputs when running:
./main.out: Symbol `abort' causes overflow in R_X86_64_PC32 relocation
./main.out: Symbol `abort' causes overflow in R_X86_64_PC32 relocation
which feels like a good clue.
In this code that defines a function at global scope (with basic assembly):
void my_asm_func(void);
__asm__(
".global my_asm_func;"
"my_asm_func:;"
"call abort;"
"ret;"
);
You violate one of the x86-64(AMD64) System V ABI rules that requires 16 byte stack alignment (may be higher depending on the parameters) at a point just before a CALL is made.
3.2.2 The Stack Frame
In addition to registers, each function has a frame on the run-time stack. This stack grows downwards from high
addresses. Figure 3.3 shows the stack organization.
The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed
on stack) byte boundary. In other words, the value (%rsp + 8) is
always a multiple of 16 (32) when control is transferred to the
function entry point. The stack pointer, %rsp, always points to the
end of the latest allocated stack frame.
Upon entry to a function the stack will be misaligned by 8 because the 8 byte return address is now on the stack. To align the stack back on a 16 byte boundary subtract 8 from RSP at the beginning of the function and add 8 back to RSP when finished. You can also just push any register like RBP at the beginning and pop it after to get the same effect.
This version of the code should work:
void my_asm_func(void);
__asm__(
".global my_asm_func;"
"my_asm_func:;"
"push %rbp;"
"call abort;"
"pop %rbp;"
"ret;"
);
Regarding this code that happened to work:
__asm__("call abort");
The compiler likely generated the main function in such away that the stack was aligned on a 16 byte boundary prior to this call so it happened to work. You shouldn't rely on this behavior. There are other potential issues with this code, but don't present as a failure in this case. The stack should be properly aligned before the call; you should be concerned in general about the red zone; and you should specify all the volatile registers in the calling conventions as clobbers including RAX/RCX/RDX/R8/R9/R10/R11, the FPU registers, and the SIMD registers. In this case abort never returns so this isn't an issue related to your code.
The red-zone is defined in the ABI this way:
The 128-byte area beyond the location pointed to by %rsp is considered to
be reserved and shall not be modified by signal or interrupt handlers.8 Therefore,
functions may use this area for temporary data that is not needed across function
calls. In particular, leaf functions may use this area for their entire stack frame,
rather than adjusting the stack pointer in the prologue and epilogue. This area is
known as the red zone.
It is generally a bad idea to call a function in inline assembly. An example of calling printf can be found in this other Stackoverflow answer which shows the complexities of doing a CALL especially in 64-bit code with red-zone. David Wohlferd's Dont Use Inline Asm is always a good read.
This code happened to work:
void my_asm_func(void);
__asm__(
".global my_asm_func;"
"my_asm_func:;"
"lea puts_message(%rip), %rdi;"
"call puts;"
"ret;"
"puts_message: .asciz \"hello puts\""
);
but you were probably lucky that puts didn't need proper alignment and you happened to get no failure. You should be aligning the stack before calling puts as described earlier with the my_asm_func that called abort. Ensuring compliance with the ABI is the key to ensuring code will work as expected.
Regarding the relocation errors, that is probably because the version of Ubuntu being used is using Position Independent Code (PIC) by default for GCC code generation. You could fix the issue by making the C library calls though the Procedure Linkage Table by appending #plt to the function names you CALL. Peter Cordes wrote a related Stackoverflow answer on this topic.

How to write c++ code that the compiler can efficiently compile to SSE or AVX?

Let's say I have a function written in c++ that performs matrix vector multiplications on a lot of vectors. It takes a pointer to the array of vectors to transform. Am I correct to assume that the compiler cannot efficiently optimize that to SIMD instructions because it does not know the alignment of the passed pointer (requiring a 16 byte alignment for SSE or 32 byte alignment for AVX) at compile time? Or is the memory alignment of the data irrelevant for optimal SIMD code and the data alignment will only affect cache performance?
If alignment is important for the generated code, how can I let the (visual c++) compiler know that I intend to only pass values with a certain alignment to the function?
In theory alignment should not matter on Intel processors since Nehalem. Therefore, your compiler should be able to produce code in which a pointer being aligned or not is not an issue.
Unaligned load/store instructions have the same performance on Intel processors since Nehalem. However, until AVX arrived with Sandy Bridge unaligned loads could not be folded with another operation for micro-op fusion.
Additionally, even before AVX to avoid the penalty of cache line splits having 16 byte aligned memory could still be helpful so it would still be reasonable for a compiler to add code until the pointer is 16 byte aligned.
Since AVX there is no advantage to using aligned load/store instructions anymore and there is no reason for a compiler to add code to make a pointer 16 byte or 32 byte aligned..
However, there is till a reason to use aligned memory to avoid cache-line splits with AVX. Therefore, it would would be reasonable for a compiler to add code to make the pointer 32 byte aligned even if it still used an unaligned load instruction.
So in practice some compilers produce much simpler code when they are told to assume that a pointer is aligned.
I'm not aware of a method to tell MSVC that a pointer is aligned. With GCC and Clang (since 3.6) you can use a built in __builtin_assume_aligned. With ICC and also GCC you can use #pragma omp simd aligned. With ICC you can also use __assume_aligned.
For example with GCC compiling this simple loop
void foo(float * __restrict a, float * __restrict b, int n)
{
//a = (float*)__builtin_assume_aligned (a, 16);
//b = (float*)__builtin_assume_aligned (b, 16);
for(int i=0; i<(n & (-4)); i++) {
b[i] = 3.14159f*a[i];
}
}
with gcc -O3 -march=nehalem -S test.c and then wc test.s gives 160 lines. Whereas if use __builtin_assume_aligned then wc test.s gives only 45 lines. When I did this with in both cases clang return 110 lines.
So on clang informing the compiler the arrays were aligned made no difference (in this case) but with GCC it did. Counting lines of code is not a sufficient metric to gauge performance but I'm not going to post all the assembly here I just want to illustrate that your compiler may produce very different code when it is told the arrays are aligned.
Of course, the additional overhead that GCC has for not assuming the arrays are aligned may make no difference in practice. You have to test and see.
In any case, if you want to get the most most from SIMD I would not rely on the compiler to do it correctly (especially with MSVC). Your example of matrix*vector is a poor one in general (but maybe not for some special cases) since it's memory bandwidth bound. But if you choose matrix*matrix no compiler is going to optimize that well without a lot of help which does not conform to the C++ standard. In these cases you will need intrinsics/built-ins/assembly in which you have explicit control of the alignment anyway.
Edit:
The assembly from GCC contains a lot of extraneous lines which are not part of the text segment. Doing gcc -O3 -march=nehalem -S test.c and then using objdump -d and counting the lines in the text (code) segment gives 108 lines without using __builtin_assume_aligned and only 16 lines with it. This shows more clearly that GCC produces very different code when it assumes the arrays are aligned.
Edit:
I went ahead and tested the foo function above in MSVC 2013. It produces unaligned loads and the code is much shorter than GCC (I only show the main loop here):
$LL3#foo:
movsxd rax, r9d
vmulps xmm1, xmm0, XMMWORD PTR [r10+rax*4]
vmovups XMMWORD PTR [r11+rax*4], xmm1
lea eax, DWORD PTR [r9+4]
add r9d, 8
movsxd rcx, eax
vmulps xmm1, xmm0, XMMWORD PTR [r10+rcx*4]
vmovups XMMWORD PTR [r11+rcx*4], xmm1
cmp r9d, edx
jl SHORT $LL3#foo
This should be fine on processors since Nehalem (late 2008). But MSVC still has cleanup code for arrays that are not a multiple of four even thought I told the compiler that it was a multiple of four ((n & (-4)). At least GCC gets that right.
Since AVX can fold unalinged loads I checked GCC with AVX to see if the code was the same.
void foo(float * __restrict a, float * __restrict b, int n)
{
//a = (float*)__builtin_assume_aligned (a, 32);
//b = (float*)__builtin_assume_aligned (b, 32);
for(int i=0; i<(n & (-8)); i++) {
b[i] = 3.14159f*a[i];
}
}
without __builtin_assume_aligned GCC produces 168 lines of assembly and with it it only produces 17 lines.
My original answer became too messy to edit so I am adding a new answer here and making my original answer community wiki.
I did some tests using aligned and unaligned memory on a pre Nehalem system and on a Haswell system with GCC, Clang, and MSVC.
The assembly shows that only GCC adds code to check and fix alignment. Due to this with __builtin_assume_aligned GCC produces much simpler code. But using __builtin_assume_aligned with Clang only changes unaligned instructions to aligned (the number of instructions stay the same). MSVC just uses unaligned instructions.
The results in performance is that on per-Nehalem systems Clang and MSVC are much slower than GCC with auto-vectorization when the memory is not aligned.
But the penalty for cache-line splits is small since Nehalem. It turns out the extra code GCC adds to check and align the memory more than makes up for the small penalty due to cache-line splits. This explains why neither Clang nor MSVC worry about cache-line splits with vectorization.
So my original claim that auto-vecorization does not need to know about the alignment is more or less correct since Nehalem. That's not the same thing as saying that aligning memory is not useful since Nehalem.

What is the return value of the “inline assembly” code?

// gcc -g stack.c -o stack
//
unsigned long sp(void){ __asm__("mov %esp, %eax");}
int main(int argc, char **argv)
{
unsigned long esp = sp();
printf("Stack pointer (ESP : 0x%lx)\n",esp);
return 0;
}
Please check the above code. And in fact, the sp() will return the esp register value via esp->eax, I guess. But why? The default return value of sp() is eax?
Who could tell me more about it? Thanks!
The way a processor architecture organizes arguments, calls, and returns, (and syscalls to kernel) i.e. calling conventions, is specificed in the ABI (application binary interface). For Linux on x86-64 you should read the x86-64 ABI document. And yes, the returned value for a function returning a long is thru %eax on x86-64. (There is also the X32 ABI)
Notice that it is mostly conventional, but if the convention changes, you'll need to change the compiler, perhaps the linker, the kernel, and all the libraries. Actually, it is so important that processor makers are designing the silicon with existing ABIs in mind (e.g. importance of the %esp register, SYSENTER instruction....).
This is the rules!
The calling convention used by GCC for 32-bit assembly is for the return value of a integer-returning function to be the value in %eax. GCC adopts this for inline assembly functions as well.
See Wikipedia for all the details.
IIRC the correct command should be "mov eax, esp" instead of "mov esp, eax".
unsigned long sp(void){ __asm__("mov %eax, %esp");}

memcpy() performance- Ubuntu x86_64

I am observing some weird behavior which I am not being able to explain. Following are the details :-
#include <sched.h>
#include <sys/resource.h>
#include <time.h>
#include <iostream>
void memcpy_test() {
int size = 32*4;
char* src = new char[size];
char* dest = new char[size];
general_utility::ProcessTimer tmr;
unsigned int num_cpy = 1024*1024*16;
struct timespec start_time__, end_time__;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
for(unsigned int i=0; i < num_cpy; ++i) {
__builtin_memcpy(dest, src, size);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
std::cout << "time = " << (double)(end_time__.tv_nsec - start_time__.tv_nsec)/num_cpy << std::endl;
delete [] src;
delete [] dest;
}
When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ? If anything, I would expect -march=native to produce optimized code. Is there other functions which could show this type of behavior ?
EDIT 1:
Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated
EDIT 2:
Following are the detailed performance analysis (__builtin_memcpy()) :-
size = 32 * 4, without -march=native - 7.5 ns, with -march=native - 19.3
size = 32 * 8, without -march=native - 26.3 ns, with -march=native - 26.5
EDIT 3 :
This observation does not change even if I allocate int64_t/int32_t.
EDIT 4 :
size = 8192, without -march=native ~ 2750 ns, with -march=native ~ 2750 (Earlier, there was an error in reporting this number, it was wrongly written as 26.5, now it is correct )
I have run these many times and numbers are consistent across each run.
I have replicated your findings with: g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, Linux 2.6.38-10-generic #46-Ubuntu x86_64 on my Core 2 Duo. Results will probably vary depending on your compiler version and CPU. I get ~26 and ~9.
When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ?
Because the -march=native version gets compiled into (found using objdump -D you could also use gcc -S -fverbose-asm):
rep movsq %ds:(%rsi),%es:(%rdi) ; where rcx = 128 / 8
And the version without gets compiled into 16 load/store pairs like:
mov 0x20(%rbp),%rdx
mov %rdx,0x20(%rbx)
Which apparently is faster on our computers.
If anything, I would expect -march=native to produce optimized code.
In this case it turned out to be a pessimization to favor rep movsq over a series of moves, but that might not always be the case. The first version is shorter, which might be better in some (most?) cases. Or it could be a bug in the optimizer.
Is there other functions which could show this type of behavior ?
Any function for which the generated code differs when you specify -march=native, suspects include functions implemented as macros or static in headers, has a name beginning with __builtin. Possibly also (floating point) math functions.
Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated
This is because then they both compile to rep movsq, 128 is probably the largest size for which GCC will generate a series of load/stores (would be interesting to see if this also for other platforms). BTW when the compiler doesn't know the size at compile time (e.g. int size=atoi(argv[1]);) then it simply turns into a call to memcpy with or without the switch.
It's quite known issue (and really old one).
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
look at some bottom comment in a bug report:
"Just FYI: mesa is now defaulting to -fno-builtin-memcmp to workaround this
problem"
Looks like glibc's memcpy is far better than builtin...

Resources