Add floats yields Double - groovy

This groovy:
float a = 1;
float b = 2;
def r = a + b;
Creates this Java code when reversed from .class with IntelliJ:
float a = (float)1;
float b = (float)2;
Object r = null;
double var7 = (double)a + (double)b;
r = Double.valueOf(var7);
So r contains a Double.
If I do this:
float a = 1;
float b = 2;
float r = a + b;
It generates code that performs the addition with doubles and converts back to float:
float a = (float)1;
float b = (float)2;
float r = 0.0F;
double var7 = (double)a + (double)b;
r = (float)var7;
So should one abandon floats with groovy as it seems to not want to use them anyway?

Groovy decided to take 5 standard result types of numeric operations. fall back to certain standard numeric types for operations. Those are int, long, BigInteger, double and BigDecimal. Thus adding/multiplying two floats returns a double. Division and pow are special.
From http://www.groovy-lang.org/syntax.html
Division and power binary operations aside,
binary operations between byte, char, short and int result in int
binary operations involving long with byte, char, short and int result
in long
binary operations involving BigInteger and any other integral type
result in BigInteger
binary operations between float, double and BigDecimal result in
double
binary operations between two BigDecimal result in BigDecimal
As for if you should abandon float... normally it is good enough to convert the double to float, especially since groovy is doing that automatically for you.

.net (C#) does something similar with 16-bit integers: Addition of Bytes or Int16s yield Int32. Possibly to prevent overflows.
Operations with "smaller" data types may result in the "bigger" data types. And with bigger, I mean more bits.
As illustrated in this example (more digits also means more bits)
15 (2 digits) x 15 (2 digits) = 225 (3 digits)
1.5 (2 digits) x 1.5 (2 digits) = 2.25 (3 digits)
However, adding two 32 bit integers returns jus a 32 bit integer. And adding two doubles just returns a double. This is because the (virtual) machine is optimized for working with these sizes, which is because physical processors used to be optimized for working with these sizes. Some of them still are. 32 bit operations are often still faster than 64 bit operations, even on 64 bit processors. However, 16 bit operations are not or barely.
Your compiler attempts to protect you against overflows, and allows you to check for them explicitly. So unless you have a good reason not to, I'd default to using these types, and optionally trunc to a compacter type when storing the data.
Good reasons not to include scenarios where you process large amounts (1000s) of numbers, e.g. for graphic processing.

Related

Float to Int type conversion in Python for large integers/numbers

Need some help on the below piece of code that I am working on. Why original number in "a" is different from "c" when it goes through a type conversion. Any way we can make "a" and "c" same when it goes through float -> int type conversion?
a = '46700000000987654321'
b = float(a) => 4.670000000098765e+19
c = int(b) => 46700000000987652096
a == c => False
Please read this document about Floating Point Arithmetic: Issues and Limitations :
https://docs.python.org/3/tutorial/floatingpoint.html
for your example:
from decimal import Decimal
a='46700000000987654321'
b=Decimal(a)
print(b) #46700000000987654321
c=int(b)
print(c) #46700000000987654321
Modified version of my answer to another question (reasonably) duped to this one:
This happens because 46700000000987654321 is greater than the integer representational limits of a C double (which is what a Python float is implemented in terms of).
Typically, C doubles are IEEE 754 64 bit binary floating point values, which means they have 53 bits of integer precision (the last consecutive integer values float can represent are 2 ** 53 - 1 followed by 2 ** 53; it can't represent 2 ** 53 + 1). Problem is, 46700000000987654321 requires 66 bits of integer precision to store ((46700000000987654321).bit_length() will provide this information). When a value is too large for the significand (the integer component) alone, the exponent component of the floating point value is used to scale a smaller integer value by powers of 2 to be roughly in the ballpark of the original value, but this means that the representable integers start to skip, first by 2 (as you require >53 bits), then by 4 (for >54 bits), then 8 (>55 bits), then 16 (>56 bits), etc., skipping twice as far between representable values for each additional bit of magnitude you have beyond 53 bits.
In your case, 46700000000987654321, converted to float, has an integer value of 46700000000987652096 (as you noted), having lost precision in the low digits.
If you need arbitrarily precise base-10 floating point math, replace your use of float with decimal.Decimal (conveniently, your initial value is already a string, so you don't risk loss of precision between how you type a float and the actual value stored); the default precision will handle these values, and you can increase it if you need larger values. If you do that (and convert a to an int for the comparison, since a str is never equal to any numeric type), you get the behavior you expected:
from decimal import Decimal as Dec, getcontext
a = "46700000000987654321"
b = Dec(a); print(b) # => 46700000000987654321
c = int(b); print(c) # => 46700000000987654321
print(int(a) == c) # => True
Try it online!
If you echo the Decimals in an interactive interpreter instead of using print, you'd see Decimal('46700000000987654321') instead, which is the repr form of Decimals, but it's numerically 46700000000987654321, and if converted to int, or stringified via any method that doesn't use the repr, e.g. print, it displays as just 46700000000987654321.

Robust linear interpolation

Given two segment endpoints A and B (in two dimensions), I would like to perform linear interpolation based on a value t, i.e.:
C = A + t(B-A)
In the ideal world, A, B and C should be collinear. However, we are operating with limited floating-point here, so there will be small deviations. To work around numerical issues with other operations I am using robust adaptive routines originally created by Jonathan Shewchuk. In particular, Shewchuk implements an orientation function orient2d that uses adaptive precision to exactly test the orientation of three points.
Here my question: is there a known procedure how the interpolation can be computed using the floating-point math, so that it lies exactly on the line between A and B? Here, I care less about the accuracy of the interpolation itself and more about the resulting collinearity. In another terms, its ok if C is shifted around a bit as long as collinearity is satisfied.
The bad news
The request can't be satisfied. There are values of A and B for which there is NO value of t other than 0 and 1 for which lerp(A, B, t) is a float.
A trivial example in single precision is x1 = 12345678.f and x2 = 12345679.f. Regardless of the values of y1 and y2, the required result must have an x component between 12345678.f and 12345679.f, and there's no single-precision float between these two.
The (sorta) good news
The exact interpolated value, however, can be represented as the sum of 5 floating-point values (vectors in the case of 2D): one for the formula's result, one for the error in each operation [1] and one for multiplying the error by t. I'm not sure if that will be useful to you. Here's a 1D C version of the algorithm in single precision that uses fused multiply-add to calculate the product error, for simplicity:
#include <math.h>
float exact_sum(float a, float b, float *err)
{
float sum = a + b;
float z = sum - a;
*err = a - (sum - z) + (b - z);
return sum;
}
float exact_mul(float a, float b, float *err)
{
float prod = a * b;
*err = fmaf(a, b, -prod);
return prod;
}
float exact_lerp(float A, float B, float t,
float *err1, float *err2, float *err3, float *err4)
{
float diff = exact_sum(B, -A, err1);
float prod = exact_mul(diff, t, err2);
*err1 = exact_mul(*err1, t, err4);
return exact_sum(A, prod, err3);
}
In order for this algorithm to work, operations need to conform to IEEE-754 semantics in round-to-nearest mode. That's not guaranteed by the C standard, but the GNU gcc compiler can be instructed to do so, at least in processors supporting SSE2 [2][3].
It is guaranteed that the arithmetic addition of (result + err1 + err2 + err3 + err4) will be equal to the desired result; however, there is no guarantee that the floating-point addition of these quantities will be exact.
To use the above example, exact_lerp(12345678.f, 12345679.f, 0.300000011920928955078125f, &err1, &err2, &err3, &err4) returns a result of 12345678.f and err1, err2, err3 and err4 are 0.0f, 0.0f, 0.300000011920928955078125f and 0.0f respectively. Indeed, the correct result is 12345678.300000011920928955078125 which can't be represented as a single-precision float.
A more convoluted example: exact_lerp(0.23456789553165435791015625f, 7.345678806304931640625f, 0.300000011920928955078125f, &err1, &err2, &err3, &err4) returns 2.3679010868072509765625f and the errors are 6.7055225372314453125e-08f, 8.4771045294473879039287567138671875e-08f, 1.490116119384765625e-08f and 2.66453525910037569701671600341796875e-15f. These numbers add up to the exact result, which is 2.36790125353468550173374751466326415538787841796875 and can't be exactly stored in a single-precision float.
All numbers in the examples above are written using their exact values, rather than a number that approximates to them. For example, 0.3 can't be represented exactly as a single-precision float; the closest one has an exact value of 0.300000011920928955078125 which is the one I've used.
It might be possible that if you calculate err1 + err2 + err3 + err4 + result (in that order), you get an approximation that is considered collinear in your use case. Perhaps worth a try.
References
[1] Graillat, Stef (2007). Accurate Floating Point Product and Exponentiation.
[2] Enabling strict floating point mode in GCC
[3] Semantics of Floating Point Math in GCC

Does Haskell do a default conversion from double to integer?

I have a function that has the type Int -> Int -> Int -> Int. When i use div a b as a value for a variable in the function it seems, that the value gets rounded down to 0 if the return of div a b is 1/2 or anything double like.
Is this correct? Does Haskell cut of values like in java, if a double is forced into an integer?
div 1 2 doesn't return 0.5, which is then converted to the integer 0. It returns 0 in the first place. div performs integer division and as such always returns an integer (or other Integral type depending on which type you used it with). There's no doubles involved.
When you do convert a double to an integer, the method of rounding depends on which method you used. For example floor would round the number down whereas round would round to the nearest integer. There are no implicit conversions in Haskell, so any conversion will happen through a function.
Does Haskell cut off values like in java
no it does not.
When doing integer division, Java rounds towards zero, whereas Haskell rounds downwards; so in Haskell
\> (-9) `div` 10
-1
whereas in Java -9 / 10 is zero:
public class IntDiv{
public static void main(String []args){
double a = (-9) / 10;
System.out.printf("%.2f\n", a); // would print 0.00
}
}

Convert UInt32 to float without rounding?

I need to convert a UInt32 type to a float without having it rounded. Say I do
float num = 4278190335;
uint num1 = num;
The value instantly gets changed to 4278190336. Is there any way around this?
I need to convert a UInt32 type to a float without having it rounded.
That can't be done.
There are 232 possible uint values. There are fewer than 232 float values (there are 232 bit patterns, but that includes various NaN values). Add to that the fact that there are obviously a lot of float values which can't be represented as uint (e.g. 0.5) and it becomes clear that you can't represent every uint value exactly in a float. However, every uint (and every int) can be represented exactly as a double, so that might be a solution to your problem.
The problem you're seeing in your original source code is that 4278190335 isn't exactly representable as a float; the closest float value is 4278190336. This isn't a problem with the conversion from float to uint - it's a problem with the conversion from the exact value you've specified in your source code into a float; the float to uint conversion happens separately (and again, can easily lose information).
float has only 23 bits of mantissa. Along with the implicit 1 bit it can only represent exactly all numbers that fit in 24 bits. For numbers larger than that it can only store the nearest value. 4278190335 = 0xFF0000FF > 224 so it'll be rounded to 4278190336 when converting to float
Similarly double has 52 bits of mantissa and can represent all numbers within the range [-253, 253] exactly, so it can store any value that fit in 32-bit int including 4278190335. But again double can't store all numbers in long's range although they have the same size (64 bits)
Aside from your question being worded backward; I think what you are saying is.
You need to get the integer portion of a float value, e.g. its whole number value not its decimal value. In which case you can simply cast the float to an int, casting does not round.
e.g.
float myFloat = 1.5;
uint myInt = (uint)myFloat; //myInt == 1
Keep in mind though this isn't always clear to others reading your code. To help there Math.Floor and Math.Ceiling ... Floor returns the whole number below the current value, ceiling returns the whole number above it
e.g
float myFloat = 1.5;
uint myFloorInt = (uint)Math.Floor(myFloat); //myFloorInt == 1
uint myCeilingInt = (uint)Math.Ceiling(myFloat); //myCeilingInt == 2
You will need to cast or convert the value from float to uint, int, etc. as your needs dictate. Most frown on casting as the resulting value isn't always clear to people ... Convert has various methods to help you convert one value to another in nice clearly understandable way.
There is no solution to turn back the original value in your method.
i suggest to try byte to byte copying to make it posible retrieving data back. float typecasting could change original value.
if your processor is 32bit it could help u:
uint32 x;
float y;
memcpy((uint8*)&y,(uint8*)&x,4);
(mohandes...)

Fixed point to Floating point

I have the following code, I have just copied some data from external RAM to the MCU into a buffer called "data"
double p32 = 4.294967296e+009; /// equals to 2^32 in decimal notation
int32_t longhigh;
uint32_t longlow;
offset = mapdata(); //Points to the data I want, 55 bit fixed point on HW
longhigh = data[2*offset+1]; //Gets upperpart of data
longlow = data[2*offset]; //Gets lower part
double floating = (longhigh*p32 + longlow); // What is this doing? How does it work?
Can someone explain that last line of code for me? Why are we multiplying by p32? Thanks.
Multiplying by p32 is equivalent to a left shift by 32 bits. It also results in a type conversion for the product (from int to double), as well as for the sum. This way you can essentially keep 64-bit ints in the buffer and convert them to doubles when required.

Resources