Understanding the Float conversion behaviour of pyspark

Understanding the Float conversion behaviour of pyspark - apache-spark

When I convert the python float 77422223.0 to a spark FloatType, I get 77422224. If I do so with DoubleType I get 77422223. How is this conversion working and is there a way to compute when such an error will happen?
df = spark.createDataFrame([77422223.0],FloatType())
display(df)
output
and as expected running
df = spark.createDataFrame([77422223.0],DoubleType())
display(df)
yields

How is this conversion working…
I assume the Spark FloatType is IEEE-754 binary32. This format uses a 24-bit significand and an exponent range from −126 to +127. Each number is represented as a sign and a 24-bit numeral with a “.” after its first digit multiplied by two to the power of an exponent, such as +1.010011000011111000000002•213.
In binary, 77,422,223 is 1001001110101011110100011112. That is 27 bits. So it cannot be represented in the binary32 format. When it is converted to the binary32 format, the conversion operating rounds it to the nearest representable value. That is 1001001110101011110100100002, which has 23 significant digits.
… and is there a way to compute when such an error will happen?
When the number is written in binary, if the number of bits from the first 1 to its last 1, including both of those, is more than 24, then it cannot be represented in the binary32 format.
Also, if the magnitude of the number is less than 2−126, it cannot be represented in binary32 unless it is a multiple of 2−149, including zero. Numbers in this range are subnormal and have a fixed exponent −126, and the lowest bit of the significand has a position value of 2−149. And, if the magnitude number is 2128 or greater, it cannot be represented, unless it is +∞ or −∞.

Related

Excel internal expression of fractional format numbers

I have been trying to apply some conditional formatting to numbers which are formatted externally as fractions with a mask ???/???. In trying to test whether the fraction has a numerator of 1, I apply the formula = =MOD(1/G62,0)<>0, which divides 1 by the fraction itself, which ought to divide with no remainder if it has a numerator of 1, and return 0. If it returns something else then it has a numerator other than 1.
The rule is satisfied when it should not be. To test what is going on, I deconstruct the formula.
The fraction 1/28 is divided into 1 to give 28 and this is correctly displayed. I then populate another cell using the formula =MOD(H62,1) to 28 and it gives 0, as it should. I do the same thing for 1/14 and the result is 1. In other words the MOD of 14, 1 is 1! When I look at the decimal representation of the 2 fractions( I imagine the fractions are actually the representations of the decimal numbers, which themselves will be binary or hex numbers internally), I see the following.
1/28 0.0357142857142856
1/14 0.0714285714285716
When the decimal for 1/28 is subtracted from the decimal for 1/14, the result is 0.035714285714286.
As 1/28 can probably never be accurately represented in decimal, it looks like some rounding down has taken place. Most probably when MOD is applied to the decimal representation of 1/28 with 1, that decimal representation of 1/14 does not divide equally into 1, and this discrepancy is disclosed by the subtraction above.
I am using excel 2016. Maybe this is no longer a problem.
What I am trying to do is test to see if the lowest numerator of a fractional number is 1. Perhaps there is another way to do this in Excel. If so, let me know.

Restrict floats to allotted padding while parsing as string

I would like to print a series of floats with varying amounts of numbers to the left of the decimal place. I would like these numbers to exactly fill a padding with blank spaces, digits, and a decimal point.
Paraphrasing the data and code I have now
floats = [321.1234561, 21.1234561, 1.1234561, 0.123456, 0.02345, 0.0034, 0.0004567]
for number in floats:
print('{:>8.6f}'.format(number))
This outputs
321.123456
21.123456
1.123456
0.123456
0.02345
0.0034
0.000457
I am looking for a way to print the following in a for loop assuming I don't know the amount of digits that will be to the left of the decimal place and the number of digits to the left never exceeds the padding which is 8 for this example.
321.1234
21.12345
1.123456
0.123456
0.02345
0.0034
0.000457
Similar questions have been asked about printing floating points with a certain width but the width they were talking about appeared to be the precision rather than the total number of character used to print the number.
Edit:
I have added a number to the end of the list for the following reason. The use of the specifier 'g' with 7 significant figures was recommended by attdona. This prevents the padding from being exceeded for numbers greater than or equal to 1 but not for numbers less than 1 with precision greater than 6. Using {:>8.7g} instead gives
321.1234
21.12345
1.123456
0.123456
0.02345
0.0034
0.0004567
Where the only one that exceeds the padding is the newly added one.

Use the General format type specifier g:
'{:>8.7g}'.format(number)
reference: https://docs.python.org/3/library/string.html#format-specification-mini-language
Update: For small numbers this format fails to align correctly. In this case you may adopt a mixed approach, but keep in mind that very small numbers will round to zero
for number in floats:
fstr = '{:>8.7g}'.format(number)
if len(fstr) > 8:
fstr = '{:>8.6f}'.format(number)
print(fstr)

for i in floats:
print('{:>8}'.format(f'{i:{8}.{8-len(str(int(i)))-1}f}'.rstrip('0')))
321.1235
21.12346
1.123456
0.123456
0.02345
0.0034

How to obtain the hex value 0xffff corresponding to the decimal value -0.000061 in the table below?

Right at the beginning of this page The OpenType Font File you'll find this table, with examples of the F2DOT14 format for a 16-bit signed fixed number with the low 14 bits of a fraction.
I couldn't obtain the hex value 0xffff for the decimal -0.000061. By the way the mantissa -1 seems to be wrong and the value for the fraction should be 1/16384, instead of 16383/16384, unless I'm missing something related to the two's complement notation used to express a negative value in code.

The mantissa and fraction values listed are entirely correct: the F2DOT14 field encodes numbers as the arithmetic computation mantissa + fraction, not as "signed mantissa with unsigned concatenated fraction remainder".
As such, if you want -0.000061, you have to start with the signed integer -1 in the first two bits (11) and then add the positive value 16383/16384 in the last 14 bits (11111111111111), such that mantissa + fraction = -1 + 16383/16384 = -1/16384, which in turn is encoded using the 16 bit code 0xFFFF

How to extract dyadic fraction from float

Now, floating and double-precision numbers, although they can approximate any sort of number (although the same could be said integers, floats are just more precise), they are represented as binary decimals internally. For example, one tenth would be approximated
0.00011001100110011... (... only goes to computers precision, not infinity)
Now, any number in binary with finite bits as something called a dyadic fraction representation in mathematics (has nothing to do with p-adic). This means you represent it as a fraction, where the denominator is a power of 2. For example, let's say our computer approximates one tenth as 0.00011. The dyadic fraction for that is 3/32 or 3/(2^5), which is close to one tenth. Now for my technical question. What would be the simplest way to extract the dyadic fraction from a floating number.
Irrelevant Note: If you are wondering why I would want to do this, it is because I am creating a surreal number library in Haskell. Dyadic fractions are easily translated into Surreal numbers, which is why it is convenient that binary is easily translated into dyadic, (I'll sure have trouble with the rational numbers though.)

The decodeFloat function seems useful for this. Technically, you should also check that floatRadix is 2, but as far I can see this is always the case in GHC.
Just be careful since it does not simplify mantissa and exponent. Here, if I evaluate decodeFloat (1.0 :: Double) I get an exponent of -52 and a mantissa of 2^52 which is not what I expected.
Also, toRational seems to generate a dyadic fraction. I am not sure this is always the case, though.

Hold your numbers in binary and convert to decimal for display.
Binary numbers are all dyatic. The numbers after the decimal place represent the number of powers of two for the denominator and the number evaluated without a decimal place is the numerator. That's binary numbers for you.
There is an ideal representation for surreal numbers in binary. I call them "sinary". It's this:
0s is Not a number
1s is zero
10s is neg one
11s is one
100s is neg two
101s is neg half
110s is half
111s is two
... etc...
so you see that the standard binary count matches the surreal birth order of numeric values when evaluated in sinary. The way to determine the numeric value of sinary is that the 1's are rights and the 0's are lefts. We start with +/-1's and then 1/2, 1/4, 1/8, etc. With sign equal to + for 1 and - for 0.
ex: evaluating sinary
1011011s
-> is the 91st surreal number (because 64+16+8+2+1 = 91)
-> with a value of −0.28125, because...
1011011
NLRRLRR
+-++-++
+ 0 − 1 + 1/2 + 1/4 − 1/8 + 1/16 + 1/32
= 0 − 32/32 + 16/32 + 8/32 − 4/32 + 2/32 + 1/32
= − 9/32
The surreal numbers form a binary tree, so there is an ideal binary format matching their location on the tree according to the Left/Right pattern to reach the number. Assign 1 to right and 0 to left. Then the birth order of surreal number is equal to the binary count of this representation. ie: the 15th surreal number value represented in sinary is the 15th number representation in the standard binary count. The value of a sinary is the surreal label value. Strip the leading bit from the representation, and start adding +1's or -1's depending on if the number starts with 1 or 0 after the first one. Then once the bit flips, begin adding and subtracting halved values (1/2, 1/4, 1/8, etc) using + or - values according to the bit value 1/0.
I have tested this format and it seems to work well. And there are some other secrets... such as the left and right of any sinary representation is the same binary format with the tail clipped to the last 0 and last 1 respectively. Conversion to decimal into a dyatic is NOT required in order to preform the recursive functions requested by Conway.

Conversion of numeric to string in MATLAB

Suppose I want to conver the number 0.011124325465476454 to string in MATLAB.
If I hit
mat2str(0.011124325465476454,100)
I get 0.011124325465476453 which differs in the last digit.
If I hit num2str(0.011124325465476454,'%5.25f')
I get 0.0111243254654764530000000
which is padded with undesirable zeros and differs in the last digit (3 should be 4).
I need a way to convert numerics with random number of decimals to their EXACT string matches (no zeros padded, no final digit modification).
Is there such as way?
EDIT: Since I din't have in mind the info about precision that Amro and nrz provided, I am adding some more additional info about the problem. The numbers I actually need to convert come from a C++ program that outputs them to a txt file and they are all of the C++ double type. [NOTE: The part that inputs the numbers from the txt file to MATLAB is not coded by me and I'm actually not allowed to modify it to keep the numbers as strings without converting them to numerics. I only have access to this code's "output" which is the numerics I'd like to convert]. So far I haven't gotten numbers with more than 17 decimals (NOTE: consequently the example provided above, with 18 decimals, is not very indicative).
Now, if the number has 15 digits eg 0.280783055069002
then num2str(0.280783055069002,'%5.17f') or mat2str(0.280783055069002,17) returns
0.28078305506900197
which is not the exact number (see last digits).
But if I hit mat2str(0.280783055069002,15) I get
0.280783055069002 which is correct!!!
Probably there a million ways to "code around" the problem (eg create a routine that does the conversion), but isn't there some way using the standard built-in MATLAB's to get desirable results when I input a number with random number of decimals (but no more than 17);

My HPF toolbox also allows you to work with an arbitrary precision of numbers in MATLAB.
In MATLAB, try this:
>> format long g
>> x = 0.280783054
x =
0.280783054
As you can see, MATLAB writes it out with the digits you have posed. But how does MATLAB really "feel" about that number? What does it store internally? See what sprintf says:
>> sprintf('%.60f',x)
ans =
0.280783053999999976380053112734458409249782562255859375000000
And this is what HPF sees, when it tries to extract that number from the double:
>> hpf(x,60)
ans =
0.280783053999999976380053112734458409249782562255859375000000
The fact is, almost all decimal numbers are NOT representable exactly in floating point arithmetic as a double. (0.5 or 0.375 are exceptions to that rule, for obvious reasons.)
However, when stored in a decimal form with 18 digits, we see that HPF did not need to store the number as a binary approximation to the decimal form.
x = hpf('0.280783054',[18 0])
x =
0.280783054
>> x.mantissa
ans =
2 8 0 7 8 3 0 5 4 0 0 0 0 0 0 0 0 0
What niels does not appreciate is that decimal numbers are not stored in decimal form as a double. For example what does 0.1 look like internally?
>> sprintf('%.60f',0.1)
ans =
0.100000000000000005551115123125782702118158340454101562500000
As you see, matlab does not store it as 0.1. In fact, matlab stores 0.1 as a binary number, here in effect...
1/16 + 1/32 + 1/256 + 1/512 + 1/4096 + 1/8192 + 1/65536 + ...
or if you prefer
2^-4 + 2^-5 + 2^-8 + 2^-9 + 2^-12 + 2^13 + 2^-16 + ...
To represent 0.1 exactly, this would take infinitely many such terms since 0.1 is a repeating number in binary. MATLAB stops at 52 bits. Just like 2/3 = 0.6666666666... as a decimal, 0.1 is stored only as an approximation as a double.
This is why your problem really is completely about precision and the binary form that a double comprises.
As a final edit after chat...
The point is that MATLAB uses a double to represent a number. So it will take in a number with up to 15 decimal digits and be able to spew them out with the proper format setting.
>> format long g
>> eps
ans =
2.22044604925031e-16
So for example...
>> x = 1.23456789012345
x =
1.23456789012345
And we see that MATLAB has gotten it right. But now add one more digit to the end.
>> x = 1.234567890123456
x =
1.23456789012346
In its full glory, look at x, as MATLAB sees it:
>> sprintf('%.60f',x)
ans =
1.234567890123456024298320699017494916915893554687500000000000
So always beware the last digit of any floating point number. MATLAB will try to round things intelligently, but 15 digits is just on the edge of where you are safe.
Is it necessary to use a tool like HPF or MP to solve such a problem? No, as long as you recognize the limitations of a double. However tools that offer arbitrary precision give you the ability to be more flexible when you need it. For example, HPF offers the use and control of guard digits down in that basement area. If you need them, they are there to save the digits you need from corruption.

You can use Multiple Precision Toolkit from MATLAB File Exchange for arbitrary precision numbers. Floating point numbers do not usually have a precise base-10 presentation.

That's because your number is beyond the precision of the double numeric type (it gives you between 15 to 17 significant decimal digits). In your case, it is rounded to the nearest representable number as soon as the literal is evaluated.
If you need more precision than what the double-precision floating-points provides, store the numbers in strings, or use arbitrary-precision libraries. For example use the Symbolic Toolbox:
sym('0.0111243254654764549999999')

You cannot get EXACT string since the number is stored in double type, or even long double type.
The number stored will be a subtle more or less than the number you gives.
computer only knows binary number 0 & 1. You must know that numbers in one radix may not expressed the same in other radix. For example, number 1/3, radix 10 yields 0.33333333...(The ellipsis (three dots) indicate that there would still be more digits to come, here is digit 3), and it will be truncated to 0.333333; radix 3 yields 0.10000000, see, no more or less, exactly the amount; radix 2 yields 0.01010101... , so it will likely truncated to 0.01010101 in computer,that's 85/256, less than 1/3 by rounding, and next time you fetch the number, it won't be the same you want.
So from the beginning, you should store the number in string instead of float type, otherwise it will lose precision.
Considering the precision problem, MATLAB provides symbolic computation to arbitrary precision.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string