Implementing rint() in x86-64

Implementing rint() in x86-64 - visual-c++

MSVC 2012 doesn't have the rint() function. For 32-bit, I'm using the following:
double rint(double x) {
__asm {
fld x
frndint
}
}
This doesn't work in x64. There's _mm_round_sd() but that requires SSE4. What is an efficient preferrably branchless way of getting the same behavior?

rint 64-bit mode
#include <emmintrin.h>
static inline double rint (double const x) {
return (double)_mm_cvtsd_si32(_mm_load_sd(&x));
}
See Agner Fog's Optimizing C++ manual for lrint
32-bit mode
// Example 14.19
static inline int lrint (double const x) { // Round to nearest integer
int n;
#if defined(__unix__) || defined(__GNUC__)
// 32-bit Linux, Gnu/AT&T syntax:
__asm ("fldl %1 \n fistpl %0 " : "=m"(n) : "m"(x) : "memory" );
#else
// 32-bit Windows, Intel/MASM syntax:
__asm fld qword ptr x;
__asm fistp dword ptr n;
#endif
return n;
}
64-bit mode
// Example 14.21. // Only for SSE2 or x64
#include <emmintrin.h>
static inline int lrint (double const x) {
return _mm_cvtsd_si32(_mm_load_sd(&x));
}
Edit:
I just realized that this method will limit the values to to +/- 2^31. If you want a version with a larger range with SSE2 it's complicated (but easy with SSE4.1). See the round function in Agner Fog's Vector Class in the file vectorf128.h for an example.

Related

DLL Floating point results differ according to caller

This is a follow up question to my earlier one asked yesterday
The problems were occurring in a MSVS 2008 C++ DLL that has over 4000 lines of code, but I have managed to produce a simple case that demonstrates the problem as it occurs on my CPU (an AMD Phenom II X6 1050T).
Will it show the problem occurring on another system? I'd really like to know!
Here is a simple class (Point.cpp), it needs to be compiled as a DLL:
#include <math.h>
#define EXPORT extern "C" __declspec(dllexport)
namespace Test {
struct Point {
double x;
double y;
/* Constructor for a Point object */
Point(double xx, double yy) : x(xx), y(yy) {}
/* Copy constructor */
Point(const Point &rhs) : x(rhs.x), y(rhs.y) {}
double mag() const;
Point norm() const;
};
double Point::mag() const {return sqrt(x*x + y*y);}
Point Point::norm() const {
double m = mag();
return Point(x/m, y/m);
}
EXPORT void __stdcall GetNorm(double x, double y, double *nx, double *ny)
Point P = Point(x, y);
Point N = P.norm();
*nx = N.x;
*ny = N.y;
}
}
Here is the test program (TestPoint.c), which needs to be linked to the lib created for the DLL:
#include <stdio.h>
#define IMPORT extern __declspec(dllimport)
IMPORT void __stdcall GetNorm(double x, double y, double *nx, double *ny);
void dhex(double x) { // double to hex
union {
unsigned long n[2];
double d;
} value;
value.d = x;
printf("(0x%0x%0x)\n", value.n[1], value.n[0]);
}
double i64tod(unsigned long long n) { // hex to double
double *DP = (double *) &n;
return *DP;
}
int main(int argc, char **argv) {
double vx, vy;
double ux, uy;
vx = i64tod(0xbfc7a30f3a53d351);
vy = i64tod(0xc01b578b34e3ce1d);
GetNorm(vx, vy, &ux, &uy);
printf(" vx = %20.18f ", vx); dhex(vx);
printf(" vy = %20.18f ", vy); dhex(vy);
printf("\n");
printf(" ux = %20.18f ", ux); dhex(ux);
printf(" uy = %20.18f ", uy); dhex(uy);
return 0;
}
On my system, with TestPoint compiled with VC++, the output is:
vx = -0.18466368053455054 (0xbfc7a30f3a53d351)
vy = -6.8354919685403077 (0xc01b578b34e3ce1d)
ux = -0.027005566159023012 (0xbf9ba758ddda1454,
uy = -0.99963528318903927 (0xbfeffd032227301b)
However, if the same code is compiled with gcc, or indeed, it seems, ANY equivalent program (eg VB6, PowerBasic), the results (ux and uy) are subtly but definitely different (the last hex digit):
vx = -0.184663680534550540 (0xbfc7a30f3a53d351)
vy = -6.835491968540307700 (0xc01b578b34e3ce1d)
ux = -0.027005566159023008 (0xbf9ba758ddda1453)
uy = -0.999635283189039160 (0xbfeffd032227301a)
This might seem an insignificant difference, but when it occurs in a physics engine, these differences accumulate in an alarming fashion. .
If the engine is going to get different results depending on who calls it I might have to abandon the use of VC++ altogether and try g++ instead.

Ok, I think I know how this happens. Looking at a disassembler listing of Point.dll, I noticed that the GetNorm function was pretty much what you'd expect, a couple of FMUL's and FDIV's. What was not present was an FLDCW instruction.
There weren't any FLDCW's in the MSVC calling program either, but I found FLDCW's in both the gcc and a PowerBasic versions of the calling program.
So I tweaked one of the executables (the PowerBasic EXE was the easiest to find the right place to tweak), and hey presto, I then got answers that matched MSVC. Presumably the FLDCW had changed the FPU rounding mode, hence the difference in the least significant bits.

VC++ SSE code generation - is this a compiler bug?

A very particular code sequence in VC++ generated the following instruction (for Win32):
unpcklpd xmm0,xmmword ptr [ebp-40h]
2 questions arise:
(1) As far as I understand the intel manual, unpcklpd accepts as 2nd argument a 128-aligned memory address. If the address is relative to a stack frame alignment cannot be forced. Is this really a compiler bug?
(2) Exceptions are thrown from at the execution of this instruction only when run from the debugger, and even then not always. Even attaching to the process and executing this code does not throw. How can this be??
The particular exception thrown is access violation at 0xFFFFFFFF, but AFAIK that's just a code for misalignment.
[Edit:]
Here's some source that demonstrates the bad code generation - but typically doesn't cause a crash. (that's mostly what I'm wondering about)
[Edit 2:]
The code sample now reproduces the actual crash. This one also crashes outside the debugger - I suspect the difference occurs because the debugger launches the program at different typical base addresses.
// mock.cpp
#include <stdio.h>
struct mockVect2d
{
double x, y;
mockVect2d() {}
mockVect2d(double a, double b) : x(a), y(b) {}
mockVect2d operator + (const mockVect2d& u) {
return mockVect2d(x + u.x, y + u.y);
}
};
struct MockPoly
{
MockPoly() {}
mockVect2d* m_Vrts;
double m_Area;
int m_Convex;
bool m_ParClear;
void ClearPar() { m_Area = -1.; m_Convex = 0; m_ParClear = true; }
MockPoly(int len) { m_Vrts = new mockVect2d[len]; }
mockVect2d& Vrt(int i) {
if (!m_ParClear) ClearPar();
return m_Vrts[i];
}
const mockVect2d& GetCenter() { return m_Vrts[0]; }
};
struct MockItem
{
MockItem() : Contour(1) {}
MockPoly Contour;
};
struct Mock
{
Mock() {}
MockItem m_item;
virtual int GetCount() { return 2; }
virtual mockVect2d GetCenter() { return mockVect2d(1.0, 2.0); }
virtual MockItem GetItem(int i) { return m_item; }
};
void testInner(int a)
{
int c = 8;
printf("%d", c);
Mock* pMock = new Mock;
int Flag = true;
int nlr = pMock->GetCount();
if (nlr == 0)
return;
int flr = 1;
if (flr == nlr)
return;
if (Flag)
{
if (flr < nlr && flr>0) {
int c = 8;
printf("%d", c);
MockPoly pol(2);
mockVect2d ctr = pMock->GetItem(0).Contour.GetCenter();
// The mess happens here:
// ; 74 : pol.Vrt(1) = ctr + mockVect2d(0., 1.0);
//
// call ? Vrt#MockPoly##QAEAAUmockVect2d##H#Z; MockPoly::Vrt
// movdqa xmm0, XMMWORD PTR $T4[ebp]
// unpcklpd xmm0, QWORD PTR tv190[ebp] **** crash!
// movdqu XMMWORD PTR[eax], xmm0
pol.Vrt(0) = ctr + mockVect2d(1.0, 0.);
pol.Vrt(1) = ctr + mockVect2d(0., 1.0);
}
}
}
void main()
{
testInner(2);
return;
}
If you prefer, download a ready vcxproj with all the switches set from here. This includes the complete ASM too.

Update: this is now a confirmed VC++ compiler bug, hopefully to be resolved in VS2015 RTM.
Edit: The connect report, like many others, is now garbage. However the compiler bug seems to be resolved in VS2017 - not in 2015 update 3.

Since no one else has stepped up, I'm going to take a shot.
1) If the address is relative to a stack frame alignment cannot be forced. Is this really a compiler bug?
I'm not sure it is true that you cannot force alignment for stack variables. Consider this code:
struct foo
{
char a;
int b;
unsigned long long c;
};
int wmain(int argc, wchar_t* argv[])
{
foo moo;
moo.a = 1;
moo.b = 2;
moo.c = 3;
}
Looking at the startup code for main, we see:
00E31AB0 push ebp
00E31AB1 mov ebp,esp
00E31AB3 sub esp,0DCh
00E31AB9 push ebx
00E31ABA push esi
00E31ABB push edi
00E31ABC lea edi,[ebp-0DCh]
00E31AC2 mov ecx,37h
00E31AC7 mov eax,0CCCCCCCCh
00E31ACC rep stos dword ptr es:[edi]
00E31ACE mov eax,dword ptr [___security_cookie (0E440CCh)]
00E31AD3 xor eax,ebp
00E31AD5 mov dword ptr [ebp-4],eax
Adding __declspec(align(16)) to moo gives
01291AB0 push ebx
01291AB1 mov ebx,esp
01291AB3 sub esp,8
01291AB6 and esp,0FFFFFFF0h <------------------------
01291AB9 add esp,4
01291ABC push ebp
01291ABD mov ebp,dword ptr [ebx+4]
01291AC0 mov dword ptr [esp+4],ebp
01291AC4 mov ebp,esp
01291AC6 sub esp,0E8h
01291ACC push esi
01291ACD push edi
01291ACE lea edi,[ebp-0E8h]
01291AD4 mov ecx,3Ah
01291AD9 mov eax,0CCCCCCCCh
01291ADE rep stos dword ptr es:[edi]
01291AE0 mov eax,dword ptr [___security_cookie (12A40CCh)]
01291AE5 xor eax,ebp
01291AE7 mov dword ptr [ebp-4],eax
Apparently the compiler (VS2010 compiled debug for Win32), recognizing that we will need specific alignments for the code, takes steps to ensure it can provide that.
2) Exceptions are thrown from at the execution of this instruction only when run from the debugger, and even then not always. Even attaching to the process and executing this code does not throw. How can this be??
So, a couple of thoughts:
"and even then not always" - Not standing over your shoulder when you run this, I can't say for certain. However it seems plausible that just by random chance, stacks could get created with the alignment you need. By default, x86 uses 4byte stack alignment. If you need 16 byte alignment, you've got a 1 in 4 shot.
As for the rest (from https://msdn.microsoft.com/en-us/library/aa290049%28v=vs.71%29.aspx#ia64alignment_topic4):
On the x86 architecture, the operating system does not make the alignment fault visible to the application. ...you will also suffer performance degradation on the alignment fault, but it will be significantly less severe than on the Itanium, because the hardware will make the multiple accesses of memory to retrieve the unaligned data.
TLDR: Using __declspec(align(16)) should give you the alignment you want, even for stack variables. For unaligned accesses, the OS will catch the exception and handle it for you (at a cost of performance).
Edit1: Responding to the first 2 comments below:
Based on MS's docs, you are correct about the alignment of stack parameters, but they propose a solution as well:
You cannot specify alignment for function parameters. When data that
has an alignment attribute is passed by value on the stack, its
alignment is controlled by the calling convention. If data alignment
is important in the called function, copy the parameter into correctly
aligned memory before use.
Neither your sample on Microsoft connect nor the code about produce the same code for me (I'm only on vs2010), so I can't test this. But given this code from your sample:
struct mockVect2d
{
double x, y;
mockVect2d(double a, double b) : x(a), y(b) {}
It would seem that aligning either mockVect2d or the 2 doubles might help.

Reading x86 MSR from kernel module

My main aim is to get the address values of the last 16 branches maintained by the LBR registers when a program crashes. I tried two ways till now -
1) msr-tools
This allows me to read the msr values from the command line. I make system calls to it from the C program itself and try to read the values. But the register values seem no where related to the addresses in the program itself. Most probably the registers are getting polluted from the other branches in system code. I tried turning off recording of branches in ring 0 and far jumps. But that doesn't help. Still getting unrelated values.
2) accessing through kernel module
Ok I wrote a very simple module (I've never done this before) to access the msr registers directly and possibly avoid register pollution.
Here's what I have -
#define LBR 0x1d9 //IA32_DEBUGCTL MSR
//I first set this to some non 0 value using wrmsr (msr-tools)
static void __init do_rdmsr(unsigned msr, unsigned unused2)
{
uint64_t msr_value;
__asm__ __volatile__ (" rdmsr"
: "=A" (msr_value)
: "c" (msr)
);
printk(KERN_EMERG "%lu \n",msr_value);
}
static int hello_init(void)
{
printk(KERN_EMERG "Value is ");
do_rdmsr (LBR,0);
return 0;
}
static void hello_exit(void)
{
printk(KERN_EMERG "End\n");
}
module_init(hello_init);
module_exit(hello_exit);
But the problem is that every time I use dmesg to read the output I get just
Value is 0
(I have tried for other registers - it always comes as 0)
Is there something that I am forgetting here?
Any help? Thanks

Use the following:
unsigned long long x86_get_msr(int msr)
{
unsigned long msrl = 0, msrh = 0;
/* NOTE: rdmsr is always return EDX:EAX pair value */
asm volatile ("rdmsr" : "=a"(msrl), "=d"(msrh) : "c"(msr));
return ((unsigned long long)msrh << 32) | msrl;
}

You can use Ilya Matveychikov's answer... or... OR :
#include <asm/msr.h>
int err;
unsigned int msr, cpu;
unsigned long long val;
/* rdmsr without exception handling */
val = rdmsrl(msr);
/* rdmsr with exception handling */
err = rdmsrl_safe(msr, &val);
/* rdmsr on a given CPU (instead of current one) */
err = rdmsrl_safe_on_cpu(cpu, msr, &val);
And there are many more functions, such as :
int msr_set_bit(u32 msr, u8 bit)
int msr_clear_bit(u32 msr, u8 bit)
void rdmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr *msrs)
int rdmsr_safe_regs_on_cpu(unsigned int cpu, u32 regs[8])
Have a look at /lib/modules/<uname -r>/build/arch/x86/include/asm/msr.h

Just out of curiosity: how come linux kernel "optimized" strcpy is much slower the libc imp?

I tried to benchmark optimized string operations under http://lxr.linux.no/#linux+v2.6.38/arch/x86/lib/string_32.c and compare to regular strcpy:
#include<stdio.h>
#include<stdlib.h>
char *_strcpy(char *dest, const char *src)
{
int d0, d1, d2;
asm volatile("1:\tlodsb\n\t"
"stosb\n\t"
"testb %%al,%%al\n\t"
"jne 1b"
: "=&S" (d0), "=&D" (d1), "=&a" (d2)
: "0" (src), "1" (dest) : "memory");
return dest;
}
int main(int argc, char **argv){
int times = 1;
if(argc >1)
{
times = atoi(argv[1]);
}
char a[100];
for(; times; times--)
_strcpy(a, "Hello _strcpy!");
return 0;
}
and timeing it using (time .. ) showed that it is about x10 slower than regular strcpy (under x64 linux)
Why?

If your string is constant, it's possible that the compiler is inlining the copy (for the plain strcpy call), making it into a series of unconditional MOV instructions.
since this is linear code without conditions, it would be faster than the linux variant.

Linux assembler error "impossible constraint in ‘asm’"

I'm starting with assembler under Linux. I have saved the following code as testasm.c
and compiled it with: gcc testasm.c -otestasm
The compiler replies: "impossible constraint in ‘asm’".
#include <stdio.h>
int main(void)
{
int foo=10,bar=15;
__asm__ __volatile__ ("addl %%ebx,%%eax"
: "=eax"(foo)
: "eax"(foo), "ebx"(bar)
: "eax"
);
printf("foo = %d", foo);
return 0;
}
How can I resolve this problem?
(I've copied the example from here.)
Debian Lenny, kernel 2.6.26-2-amd64
gcc version 4.3.2 (Debian 4.3.2-1.1)
Resolution:
See the accepted answer - it seems the 'modified' clause is not supported any more.

__asm__ __volatile__ ("addl %%ebx,%%eax" : "=a"(foo) : "a"(foo), "b"(bar));
seems to work. I believe that the syntax for register constraints changed at some point, but it's not terribly well documented. I find it easier to write raw assembly and avoid the hassle.

The constraints are single letters (possibly with extra decorations), and you can specify several alternatives (i.e., an inmediate operand or register is "ir"). So the constraint "eax" means constraints "e" (signed 32-bit integer constant), "a" (register eax), or "x" (any SSE register). That is a bit different that what OP meant... and output to an "e" clearly doesn't make any sense. Also, if some operand (in this case an input and an output) must be the same as another, you refer to it by a number constraint. There is no need to say eax will be clobbered, it is an output. You can refer to the arguments in the inline code by %0, %1, ..., no need to use explicit register names. So the correct version for the code as intended by OP would be:
#include <stdio.h>
int main(void)
{
int foo=10, bar=15;
__asm__ __volatile__ (
"addl %2, %0"
: "=a" (foo)
: "0" (foo), "b" (bar)
);
printf("foo = %d", foo);
return 0;
}
A better solution would be to allow %2 to be anything, and %0 a register (as x86 allows, but you'd have to check your machine manual):
#include <stdio.h>
int main(void)
{
int foo=10, bar=15;
__asm__ __volatile__ (
"addl %2, %0"
: "=r" (foo)
: "0" (foo), "g" (bar)
);
printf("foo = %d", foo);
return 0;
}

If one wants to use multiline, then this will also work..
__asm__ __volatile__ (
"addl %%ebx,%%eax; \
addl %%eax, %%eax;"
: "=a"(foo)
: "a"(foo), "b"(bar)
);
'\' should be added for the compiler to accept a multiline string (the instructions).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Implementing rint() in x86-64 - visual-c++

MSVC 2012 doesn't have the rint() function. For 32-bit, I'm using the following: double rint(double x) { __asm { fld x frndint } } This doesn't work in x64. There's _mm_round_sd() but that requires SSE4. What is an efficient preferrably branchless way of getting the same behavior?

Related

DLL Floating point results differ according to caller

VC++ SSE code generation - is this a compiler bug?

Reading x86 MSR from kernel module

Just out of curiosity: how come linux kernel "optimized" strcpy is much slower the libc imp?

Linux assembler error "impossible constraint in ‘asm’"

Categories

Resources