why does _mm_mulhrs_epi16() always do biased rounding to positive infinity? - rounding

Does anyone know why the pmulhrsw instruction or
_mm_mulhrs_epi16(x) := RoundDown((x * y + 16384) / 32768)
always rounds towards positive infinity? To me, this is terribly biased for negative numbers, because then a sequence like -0.6, 0.6, -0.6, 0.6, ... won't add up to 0 on average.
Is this behavior intentional or unintentional? If it's intentional, what could be the use? Is there an easy way to make it less biased?
Lucky for me, I can just change the order of my operations to get a less biased result (my function is a signed geometric mean):
__m128i ChooseSign(x, sign)
{
return _mm_sign_epi16(x, sign)
}
signsDifferent = _mm_srai_epi16(_mm_xor_si128(a, b), 15) // (a ^ b) >> 15
sign = _mm_andnot_si128(signsDifferent, a) // !signsDifferent & a
//result = ChooseSign(sqrt(a * b), sign) * fraction // biased
result = ChooseSign(sqrt(a * b) * fraction, sign)

A most serious mistake. I asked the same question on the Intel developer forums and andysem corrected me, pointing out the behavior is to round to the nearest integer.
I was mistaken into thinking it was biased because the formula from MSDN, https://msdn.microsoft.com/en-us/library/bb513995.aspx
was (x * y + 16384) >> 15. This looked very similar to the int(x + 0.5) method for rounding, which I know is biased for negative #s and cringe at. But I didn't realize right shift for negative numbers isn't the same as dividing and casting to an int.
Plus, it wasn't matching my non-SIMD reference implementation, which turns out to be biased since I was calculating int(sum / 9.0f), which rounds towards zero.
I should've had more doubt before questioning the behavior of something implemented in hardware, which would've been rigorously vetted, since int(x + 0.5) would be a very expensive mistake.
_mm_mulhrs_epi16() still has some bias of always rounding x.5 towards + infinity. But that's not a big deal for my application.

Related

Numerical differentiation using Cauchy (CIF)

I am trying to create a module with a mathematical class for Taylor series, to have it easily accessible for other projects. Hence I wish to optimize it as far as I can.
For those who are not too familiar with Taylor series, it will be a necessity to be able to differentiate a function in a point many times. Given that the normal definition of the mathematical derivative of a function will require immense precision for higher order derivatives, I've decided to use Cauchy's integral formula instead. With a little bit of work, I've managed to rearrange the formula a little bit, as you can see on this picture: Rearranged formula. This provided me with much more accurate results on higher order derivatives than the traditional definition of the derivative. Here is the function i am currently using to differentiate a function in a point:
def myDerivative(f, x, dTheta, degree):
riemannSum = 0
theta = 0
while theta < 2*np.pi:
functionArgument = np.complex128(x + np.exp(1j*theta))
secondFactor = np.complex128(np.exp(-1j * degree * theta))
riemannSum += f(functionArgument) * secondFactor * dTheta
theta += dTheta
return factorial(degree)/(2*np.pi) * riemannSum.real
I've tested this function in my main function with a carefully thought out mathematical function which I know the derivatives of, namely f(x) = sin(x).
def main():
print(myDerivative(f, 0, 2*np.pi/(4*4096), 16))
pass
These derivatives seems to freak out at around the derivative of degree 16. I've also tried to play around with dTheta, but with no luck. I would like to have higher orders as well, but I fear I've run into some kind of machine precission.
My question is in it's simplest form: What can I do to improve this function in order to get higher order of my derivatives?
I seem to have come up with a solution to the problem. I did this by rearranging Cauchy's integral formula in a different way, by exploiting that the initial contour integral can be an arbitrarily large circle around the point of differentiation. Be aware that it is very important that the function is analytic in the complex plane for this to be valid.
New formula
Also this gives a new function for differentiation:
def myDerivative(f, x, dTheta, degree, contourRadius):
riemannSum = 0
theta = 0
while theta < 2*np.pi:
functionArgument = np.complex128(x + contourRadius*np.exp(1j*theta))
secondFactor = (1/contourRadius)**degree*np.complex128(np.exp(-1j * degree * theta))
riemannSum += f(functionArgument) * secondFactor * dTheta
theta += dTheta
return factorial(degree) * riemannSum.real / (2*np.pi)
This gives me a very accurate differentiation of high orders. For instance I am able to differentiate f(x)=e^x 50 times without a problem.
Well, since you are working with a discrete approximation of the derivative (via dTheta), sooner or later you must run into trouble. I'm surprised you were able to get at least 15 accurate derivatives -- good work! But to get derivatives of all orders, either you have to put a limit on what you're willing to accept and say it's good enough, or else compute the derivatives symbolically. Take a look at Sympy for that. Sympy probably has some functions for computing Taylor series too.

Why are Float and Double different in the case of adding 0.1 and 0.2?

Can someone explain why
(0.1::Float) + (0.2::Float) == (0.3::Float)
while
(0.1::Double) + (0.2::Double) /= (0.3::Double)
To my knowledge Double is supposed to be more precise. Is there something about Float I should know about?
The first thing to realize is that when you enter 0.1::Double and ghci prints 0.1 back, it's only an "illusion:"
Prelude Data.Ratio> 0.1::Double
0.1
Why is that an illusion? Because the number 0.1 is actually not precisely representable as a floating point number! This is true for both Float and Double. Observe:
Prelude Data.Ratio> toRational (0.1::Float)
13421773 % 134217728
Prelude Data.Ratio> toRational (0.1::Double)
3602879701896397 % 36028797018963968
So, in reality, these numbers are indeed "close" to the actual real number 0.1, but neither is precisely 0.1. How close are they? Let's find out:
Prelude Data.Ratio> toRational (0.1::Float) - (1%10)
1 % 671088640
Prelude Data.Ratio> toRational (0.1::Double) - (1%10)
1 % 180143985094819840
As you see, Double is indeed a lot more precise than Float; the difference between the representation of 0.1 as a Double and the actual real-number 0.1 is a lot smaller. But neither is precise.
So, indeed the Double addition is a lot more precise, and should be preferred over the Float version. The confusing equality you see is nothing but the weird effect of rounding. The results of == should not be trusted in the floating-point land. In fact, there are many floating point numbers x such that x == x + 1 holds. Here's one example:
Prelude> let x = -2.1474836e9::Float
Prelude> x == x + 1
True
A good read on floating-point representation is the classic What Every Computer Scientist Should Know about Floating-Point Arithmetic, which explains many of these quirky aspects of floating-point arithmetic.
Also note that this behavior is not unique to Haskell. Any language that uses IEEE754 Floating-point arithmetic will behave this way, which is the standard implemented by modern microprocessors.
Double is indeed more precise. And yes, there is something you should know about floating point representations in general: you have to be very careful about how you use them! == is generally unlikely to actually be useful. You're generally going to want to compare floating point representations using more specialized functions, or at least check if the representation lies within some range rather than whether it has a certain value according to the built-in approximation.

Floating point addition with LSB error

I'm implementing a hardware double precision adder with Verilog. During the verification phase when I compare my hardware output to MATLAB (or C) double precision addition outputs I found some weird cases where the LSB is not matching, taking into account that I'm using the same rounding mode (round to nearest even). My question is about the accuracy of the C calculation, is it truly accurate in doing the rounding or it's limited to some CPU architecture (32 or 64 bits)?
Here's an example,
A = 0x62a5a1c59bd10037 = 1.5944933396238637e+167
B = 0x62724bc40659bf0c = 1.685748657333889e+166 = 0.1685748657333889e+167
The correct output (just by doing the addition of the above real numbers manually)
= 1.7630682053572526e+167 = 0x62a7eb3e1c9c3819 (this matches my hardware)
When I try doing A+B in C, the result is equal to
= 1.7630682053572525e+167 = 0x62a7eb3e1c9c3818
When I try this application to check the intermediate operations
http://www.ecs.umass.edu/ece/koren/arith/simulator/FPAdd/
I can see from mantissa addition that C is not doing the rounding correctly (round to nearest even). In this case the mantissa should be rounded by adding one. Any idea why this is happening?
The operation of http://www.ecs.umass.edu/ece/koren/arith/simulator/FPAdd/ is correct. The last round to nearest even peforms a downward rounding:
A+B + 1.0111111010110011111000011100100111000011100000011000|10 *2^555
^
|
to forget the |10 part (exactly in the middle), the result chooses 0 (even) instead of 1

fixed point integer division ("fractional division") algorithm

The Honeywell DPS8 computer (and others) have/had a "divide fractional" instruction:
"This instruction divides a 71-bit fractional dividend (including sign) by a 36-bit
fractional divisor (including sign) to form a 36-bit fractional quotient (including
sign) and a 36-bit fractional remainder (including sign). Bit 35 of the remainder
corresponds to bit 70 of the dividend. The remainder sign is equal to the dividend
sign unless the remainder is zero."
So, as I understand it, this is integer division with the decimal point way over on the left.
.qqqqq / .ddddd
(I did scaled integer math in FORTH back in the day, but my memories of the techniques are lost in fog of time.)
To implement this instruction in a DPS8 emulator, I believe I need to start by creating two 70 bit numbers: the 71 bit dividend less it's sign bit, and the the 36 bit divisor less its sign bit and shifted 35 bits to the left so that the decimal points line up.
I think I can then form the remainder and quotient (in C) with '%' and '/', but I am unsure if those results need to be normalized (i.e. shifted).
I found an example of a "shift and subtract" algorithm "Computer Arithmetic", slide 10), but I would prefer a more straight forward implementation.
Am I on the right track, or is the solution more nuanced (fixing up the signs and detection of errors have been elided from here; those stages are well documented. The actual division is the issue.). Any pointers to C implementations of this kind of hardware emulation would be particularly helpful.
I do not have the definitive answer, but as a division is a division, you might find it helpful to look at some basic division routines.
Imagine that you have a 32-bit variable and you want an 8-bit fractional part.
You then have an integer part between 0 and 16777215, and a fractional part which is between 0 and 255.
0xiiiiiiff (where i is the integer part, f is the fractional part).
Imagine you have a 24-bit dividend (numerator), say the value 3, and a 24-bit divisor (denominator), say the value 13.
As we quickly will see, 3/13 is greater than zero and less than one. That means our fractional part is nonzero, but our integer part is filled completely with zeros.
So to do the above division using a standard divide function, we'll just bit-shift the dividend by N, thus we will get N bits of precision in our fractional part.
quotient_fp = (dividend_ip << 8) / divisor_ip
So far, so good.
But what if we want the divisor to have a fractional part, then ?
If we just shift the divisor up by 8, then we'll have a problem:
(dividend_ip << 8) / (divisor_ip << 8)
- because we'll obviously lose our fractional part of the quotient (result).
Instead, we'll need to shift the dividend up by as many bits as we shift the fractional part up...
((dividend_ip << 8) << 8) / (divisor_ip << 8)
...That makes it...
(dividend_ip << (dividend_precision + divisor_precision) / (divisor_ip << divisor_precision)
Now, let's put our fractional part math into the picture...
(((dividend_ip << dividend_precision) | dividend_fp) << divisor_precision) / ((divisor_ip << divisor_precision) | divisor_fp)
Our quotient's precision will be the same as dividend_precision, which is 8 bits.
Unfortunately, this eats a lot of bits.
Fortunately, in your case, the integer part is not important, so you'll have a lot of room for the fractional part.
Let's increase the precision to 15 bits; this can be tested using normal 32-bit integers...
(((dividend_ip << 15) | dividend_fp) << 15) / ((divisor_ip << 15) | divisor_fp)
Our quotient will now have a 15-bit precision.
OK, but since you're supplying only the fractional parts and the integer part is always zero anyway, you should be able to just toss the integer part. That makes it....
(((dividend_ip << 16) | dividend_fp) << 16) / ((divisor_ip << 16) | divisor_fp)
... reduced to ...
(dividend_fp << 16) / divisor_fp
... now let's use a 64-bit integer instead, we can get 32 bits of precision in the quotient...
(dividend_fp << 32) / divisor_fp
... some compilers have support for a int128_t (it can be enabled on some platforms for GCC), so you might be able to use that type, in order to get 128 bits easily. I have not tried it, but I've come across info on the Web earlier; search for int128_t, and you might find out how.
If you get the int128_t to work, you could make the dividend 128 bit, the divisor 64 bit and the quotient 64 bit...
quotient_fp = ((dividend_fp << 36) / divisor) >> (64 - 36)
... in order to get 36 bits precision.
Notice that since the result is in the top 36 bits of the quotient, the quotient needs to be shifted down (64 - 36) = 28 bits.
You could even go as high as (128 - 36) = 92 bits precision:
(dividend_fp << 92) / divisor
Now, that you probably (hopefully) have a solution, I would like to recommend that you get familiar with low-level binary divide (again; since you've been there a while ago).
The best sources seem to be how hardware divides binary numbers; such as microcontrollers, CPUs and the like. Assembly language dividers are also good for getting to know the inner workings. Often 32-bit divide routines that use bit-shifting are very good sources.
Through the time, I've come across a very clever implementation for ARM in ARM assembly language. Normally I wouldn't post references or assembly language examples, but considering that the code is very small, I think it would be alright.
Taken from A Fast Hi Precision Fixed Point Divide
r0 is the numerator (dividend)
r2 is the denominator (divisor)
mov r1,#0
adds r0,r0,r0
.rept 32
adcs r1,r2,r1,lsl#1
subcc r1,r1,r2
adcs r0,r0,r0
.endr
r0 is the quotient (result)
r1 is the remainder (rest, modulo result)
The above routine contains the basics for an unsigned divide.
I hope this information will be useful. It may contain errors, as I have not tested any code or example mentioned. I'm confident, though, that it's not all wrong. ;)

Microsoft.DirectX.Vector3.Normalize() inconsistency

Two ways to normalize a Vector3 object; by calling Vector3.Normalize() and the other by normalizing from scratch:
class Tester {
static Vector3 NormalizeVector(Vector3 v)
{
float l = v.Length();
return new Vector3(v.X / l, v.Y / l, v.Z / l);
}
public static void Main(string[] args)
{
Vector3 v = new Vector3(0.0f, 0.0f, 7.0f);
Vector3 v2 = NormalizeVector(v);
Debug.WriteLine(v2.ToString());
v.Normalize();
Debug.WriteLine(v.ToString());
}
}
The code above produces this:
X: 0
Y: 0
Z: 1
X: 0
Y: 0
Z: 0.9999999
Why?
(Bonus points: Why Me?)
Look how they implemented it (e.g. in asm).
Maybe they wanted to be faster and produced something like:
l = 1 / v.length();
return new Vector3(v.X * l, v.Y * l, v.Z * l);
to trade 2 divisions against 3 multiplications (because they thought mults were faster than divs (which is for modern fpus most often not valid)). This introduced one level more of operation, so the less precision.
This would be the often cited "premature optimization".
Don't care about this. There's always some error involved when using floats. If you're curious, try changing to double and see if this still happens.
You should expect this when using floats, the basic reason being that the computer processes in binary and this doesn't map exactly to decimal.
For an intuitive example of issues between different bases consider the fraction 1/3. It cannot be represented exactly in Decimal (it's 0.333333.....) but can be in Terniary (as 0.1).
Generally these issues are a lot less obvious with doubles, at the expense of computing costs (double the number of bits to manipulate). However in view of the fact that a float level of precision was enough to get man to the moon then you really shouldn't obsess :-)
These issues are sort of computer theory 101 (as opposed to programming 101 - which you're obviously well beyond), and if your heading towards Direct X code where similar things can come up regularly I'd suggest it might be a good idea to pick up a basic computer theory book and read it quickly.
You have here an interesting discussion about String formatting of floats.
Just for reference:
Your number requires 24 bits to be represented, which means that you are using up the whole mantissa of a float (23bits + 1 implied bit).
Single.ToString () is ultimately implemented by a native function, so I cannot tell for sure what is going on, but my guess is that it uses the last digit to round the whole mantissa.
The reason behind this could be that you often get numbers that cannot be represented exactly in binary, so you would get a long mantissa; for instance, 0.01 is represented internally as 0.00999... as you can see by writing:
float f = 0.01f;
Console.WriteLine ("{0:G}", f);
Console.WriteLine ("{0:G}", (double) f);
by rounding at the seventh digit, you will get back "0.01", which is what you would have expected.
For what seen above, numbers with only 7 digits will not show this problem, as you already saw.
Just to be clear: the rounding is taking place only when you convert your number to a string: your calculations, if any, will use all the available bits.
Floats have a precision of 7 digits externally (9 internally), so if you go above that then rounding (with potential quirks) is automatic.
If you drop the float down to 7 digits (for instance, 1 to the left, 6 to the right) then it will work out and the string conversion will as well.
As for the bonus points:
Why you ? Because this code was 'eager to blow on you'.
(Vulcan... blow... ok.
Lamest.
Punt.
Ever)
If your code is broken by minute floating point rounding errors, then I'm afraid you need to fix it, as they're just a fact of life.

Resources