Controlling Decimal Precision Overflow in Spark - apache-spark

We are using Spark 2.4.x.
We have a precision loss for one of our division operations (69362.86 / 111862.86) Both of these values are defined as decimal(38,3) on the table. When run through beeline, it produces 0.620070504187002 but when run through spark it produces 0.6200710. As we can see, there is a decimal truncation with spark's result. Upon reading more, we stumbled upon the Spark story SPARK-29123. The comment ask us to set the parameter spark.sql.decimalOperations.allowPrecisionLoss to false to avoid precision loss. However, there is another comment in the same story that is warning us of null when the exact representation of the decimal value is not possible. The stack overflow thread doesn't talk about the warning mentioned in the 2nd comment. Setting this parameter spark.sql.decimalOperations.allowPrecisionLoss to false and running the computation (69362.86 / 111862.86) results in 0.620070504187002 which is good but we are concerned about the warning in the 2nd comment.
As per the rules laid out in the sourcecode the value of division's precision and scale is determined by the below formula.
Operation Result Precision Result Scale
e1 / e2 p1 - s1 + s2 + max(6, s1 + p2 + 1) max(6, s1 + p2 + 1)
As per these rules, My precision is (38 -3 +3 + max(6,3 +38 +1)) => 80 and scale is max(6,3 +38 +1) => 42. Since these are exceeding the default limit of 38 for both Precision and Scale, they are reduced to 38 and 6. One way to fix this decimal truncation is by using proper decimal precision and scale for the input columns. I think based on our data in our table, we can easily set input precision as 18 and scale as 5 for both of the columns involved in the division. In that case, the resultant precision will be 38 and 24. This is good enough precision and scale to represent our data without any noticeable truncation. But we can't do this manually for all the numeric columns in our space. So we are thinking of setting spark.sql.decimalOperations.allowPrecisionLoss to false at cluster level. We are interested in learning more about what are the situations the result will be NULL when we set this parameter to false but if this parameter was left at default, would have resulted in a value with precision loss.
Now my question is, in what are the situations setting this parameter spark.sql.decimalOperations.allowPrecisionLoss to false will result in null but when left it at the default (true) we get some value with precision loss. Can you provide any example that I can use to reproduce on my end?. If we are not able to find such an example then, can we set this parameter to false at cluster level so that the arithmetic operations can produce better results?.

Found some examples where setting this parameter spark.sql.decimalOperations.allowPrecisionLoss to true or false produces different results. I have given 2 such examples below.
From this analysis, I understood that there is no tolerance on the fractional portion of the decimal value when this parameter is set to false as the name suggests. However if the scale of the resulting arithmetic operation exceeds the default limit of 38, then the scale is reduced to 38. For the integer portion of the decimal value, there are no checks, If the integer value comes within the range of (precision-scale) then the proper value is returned otherwise the computation returns NULL.
With this, we have decided to leave this parameter to its default true to avoid a situation where the decimal column is not defined as tight to the actual value as possible, and because of this, the arithmetic operation results in NULL.
Case 1:
Case 2:

Related

Why do VBA and Excel disagree on whether two cells are equal? [duplicate]

This question already has answers here:
VBA rounding problem
(2 answers)
Closed 9 months ago.
I am trying to compare two cells in a table:
The column "MR" is calculated using the formula =ABS([#Value]-A1) to determine the moving range of the column "Value". The values in the "Value" column are not rounded. The highlighted cells in the "MR" column (B3 and B4) are equal. I can enter the formula =B3=B4 into a cell and Excel says that B3 is equal to B4.
But when I compare them in VBA, VBA says that B4 is greater than B3. I can select cell B3 and enter the following into the Immediate Window ? selection.value = selection.offset(1).value. That statement evaluates to false.
I tried removing the absolute value from the formula thinking that might have had something to do with it, but VBA still says they aren't equal.
I tried adding another row where Value=1.78 so MR=0.18. Interestingly, the MR in the new row (B5) is equal to B3, but is not equal to B4.
I then tried increasing the decimal of A4 to match the other values, and now VBA says they are equal. But when I added the absolute value back into the formula, VBA again says they are not equal. I removed the absolute value again and now VBA is saying they are not equal.
Why is VBA telling me the cells are not equal when Excel says they are? How can I reliably handle this situation through VBA going forward?
The problem is that the IEEE 754 Standard for Floating-Point Arithmetic is imprecise by design. Virtually every programming language suffers because of this.
IEEE 754 is an extremely complex topic and when you study it for months and you believe you understand fully, you are simply fooling yourself!
Accurate floating point value comparisons are difficult and error prone. Think long and hard before attempting to compare floating point numbers!
The Excel program gets around the issue by cheating on the application side. VBA on the other hand follows the IEEE 754 spec for Double Precision (binary64) faithfully.
A Double value is represented in memory using 64 bits. These 64 bits are split into three distinct fields that are used in binary scientific notation:
The SIGN bit (1 bit to represent the sign of the value: pos/neg)
The EXPONENT (11 bits, biased in value by +1023)
The MANTISSA (53 bits, 52 bits stored + 1 bit implied)
The mantissa in this system leverages the fact that all binary numbers begin with a digit of 1 and so that 1 is not stored in the bit-pattern. It is implied, increasing the mantissa precision to 53-bits for normal values.
The math works like this: Stored Value = SIGN VALUE * 2^UNBIASED EXPONENT * MANTISSA
Note that a stored value of 1 for the sign bit denotes a negative SIGN VALUE (-1) while a 0 denotes a positive SIGN VALUE (+1). The formula is SIGN VALUE = (-1) ^ (sign bit).
The problem always boils down to the same thing.
The vast majority of real numbers cannot be expressed precisely
within this system which introduces small rounding errors that propagate
like weeds.
It may help to think of this system as a grid of regularly spaced points. The system can represent ONLY the point-values and NONE of the real numbers between the points. All values assigned to a float will be rounded to one of the point-values (usually the closest point, but there are modes that enforce rounding upwards to the next highest point, or rounding downwards). Conducting any calculation on a floating-point value virtually guarantees the resulting value will require rounding.
To accent the obvious, there are an infinite number of real numbers between adjacent representable point-values on this grid; and all of them are rounded to the discreet grid-points.
To make matters worse, the gap size doubles at every Power-of-Two as the grid expands away from true zero (in both directions). For example, the gap length between grid points for values in the range of 2 to 4 is twice as large as it is for values in the range of 1 to 2. When representing values with large enough magnitudes, the grid gap length becomes massive, but closer to true zero, it is miniscule.
With your example numbers...
1.24 is represented with the following binary:
Sign bit = 0
Exponent = 01111111111
Mantissa = 0011110101110000101000111101011100001010001111010111
The Hex pattern over the full 64 bits is precisely: 3FF3D70A3D70A3D7.
The precision is derived exclusively from the 53-bit mantissa and the exact decimal value from the binary is:
0.2399999999999999911182158029987476766109466552734375
In this instance a leading integer of 1 is implied by the hidden bit associated with the mantissa and so the complete decimal value is:
1.2399999999999999911182158029987476766109466552734375
Now notice that this is not precisely 1.24 and that is the entire problem.
Let's examine 1.42:
Sign bit = 0
Exponent = 01111111111
Mantissa = 0110101110000101000111101011100001010001111010111000
The Hex pattern over the full 64 bits is precisely: 3FF6B851EB851EB8.
With the implied 1 the complete decimal value is stored as:
1.4199999999999999289457264239899814128875732421875000
And again, not precisely 1.42.
Now, let's examine 1.6:
Sign bit = 0
Exponent = 01111111111
Mantissa = 1001100110011001100110011001100110011001100110011010
The Hex pattern over the full 64 bits is precisely: 3FF999999999999A.
Notice the repeating binary fraction in this case that is truncated
and rounded when the mantissa bits run out? Obviously 1.6 when
represented in binary base2 can never be precisely accurate in the
same way as 1/3 can never be accurately represented in decimal base10
(0.33333333333333333333333... ≠ 1/3).
With the implied 1 the complete decimal value is stored as:
1.6000000000000000888178419700125232338905334472656250
Not exactly 1.6 but closer than the others!
Now let's subtract the full stored double precision representations:
1.60 - 1.42 = 0.18000000000000015987
1.42 - 1.24 = 0.17999999999999993782
So as you can see, they are not equal at all.
The usual way to work around this is threshold testing, basically an inspection to see if two values are close enough... and that depends on you and your requirements. Be forewarned, effective threshold testing is way harder than it appears at first glance.
Here is a function to help you get started comparing two Double Precision numbers. It handles many situations well but not all because no function can.
Function Roughly(a#, b#, Optional within# = 0.00001) As Boolean
Dim d#, x#, y#, z#
Const TINY# = 1.17549435E-38 'SINGLE_MIN
If a = b Then Roughly = True: Exit Function
x = Abs(a): y = Abs(b): d = Abs(a - b)
If a <> 0# Then
If b <> 0# Then
z = x + y
If z > TINY Then
Roughly = d / z < within
Exit Function
End If
End If
End If
Roughly = d < within * TINY
End Function
The idea here is to have the function return True if the two Doubles are Roughly the same Within a certain margin:
MsgBox Roughly(3.14159, 3.141591) '<---dispays True
The Within margin defaults to 0.00001, but you can pass whatever margin you need.
And while we know that:
MsgBox 1.60 - 1.42 = 1.42 - 1.24 '<---dispays False
Consider the utility of this:
MsgBox Roughly(1.60 - 1.42, 1.42 - 1.24) '<---dispays True
#chris neilsen linked to an interesting Microsoft page about Excel and IEEE 754.
And please read David Goldberg's seminal What Every Computer Scientist Should Know About Floating-Point Arithmetic. It changed the way I understood floating point numbers.

How to get equal results when doing arithmetic operations vba/excel [Double variable precision]

I am trying to get equal result of two exact calculations which are computed in a cell formula and the other one with a UDF:
Function calc()
Dim num as Double
num = 30000000 * ((1 + 8 / 100 / 365) ^ 125)
calc = num
End Function
Result of the calculation is different
A1 = 30000000 * ((1 + 8 / 100 / 365) ^ 125) not equal to A2 = calc()
We can test it with =if(A1=A2, TRUE, FALSE) which is false. I do understand that it has something to do with data types in vba and executing cell formula. Do you know how to make calculations to from vba function(s) and excel cell field(s) to render same result?
So, the calculation in application excel and the calculation in vba are presenting different outputs (what you've presented, with format displaying 20 decimal places):
As such, you would see false when comparing them. You will need to round() or format() to truncate the calculation at a level that is appropriate. E.g.:
calc = round(num,4)
calc = format(num,"0.###0")
The reason this is occurring is because of the inherent math you're using, specifically, ((1 + 8 / 100 / 365) ^ 125), and how that is being truncated/rounded in the allocated memory to each part of the calculation, which differs in VBA and in-application Excel.
Edit: Final image with the VBA changes I'd suggested:
Explanation
Double Data type seems to have flaws being "precise" after the "nth" digit. This is stated as well in the documentation
Precision. When you work with floating-point numbers, remember that they do not always have a precise representation in memory. This could lead to unexpected results from certain operations, such as value comparison and the Mod operator.
Troubleshooting
It seems that is the case here: I set up the value from the division on a cell and the division as formula in another one, although excel interface says there are not differences, when computing that value again, the formula on the sheet seems to be more precise.
Actual result
Further thoughts
It seems that is limited by the data type itself, if precision is not an issue, you may try to round it. If it is critical to be as precise as possible, I would suggest you to connect with an API to something that is able to handle more precision. In this scenario, I would use xlwings to use python.

What is the function of round() with this strange behavior? [duplicate]

This question already has answers here:
Python 3.x rounding behavior
(13 answers)
Closed 3 years ago.
I want to round some int numbers but I came across with the strange feature of round() for example
round(2.1) = 2
round(2.5) = 2 #it rounds to ceil
round(2.7) = 3
it rounds differently with the odd number as follow
round(5.1) = 5
round(5.5) = 6 #it rounds to floor
round(5.7) = 6
it rounds the X.5 to the floor with the x = even numbers but with the X = odd numbers it rounds to the ceil
I want to ask what is the advantage of this round? and where can I use it in our examples ? or what is its usage?
Looks like if it's close it goes to the even option.
From the documentation https://docs.python.org/3/library/functions.html#round
Return number rounded to ndigits precision after the decimal point. If ndigits is omitted or is None, it returns the nearest integer to its input.
For the built-in types supporting round(), values are rounded to the closest multiple of 10 to the power minus ndigits; if two multiples are equally close, rounding is done toward the even choice (so, for example, both round(0.5) and round(-0.5) are 0, and round(1.5) is 2). Any integer value is valid for ndigits (positive, zero, or negative). The return value is an integer if ndigits is omitted or None. Otherwise the return value has the same type as number.
For a general Python object number, round delegates to number.round.
Note The behavior of round() for floats can be surprising: for example, round(2.675, 2) gives 2.67 instead of the expected 2.68. This is not a bug: it’s a result of the fact that most decimal fractions can’t be represented exactly as a float. See Floating Point Arithmetic: Issues and Limitations for more information.

NORMDIST function is not giving the correct output

I'm trying to use NORMDIST function in Excel to create a bell curve, but the output is strange.
My mean is 0,0000583 and standard deviation is 0,0100323 so when I plug this to the function NORMDIST(0,0000583; 0,0000583; 0,0100323; FALSE) I expect to get something close to 0,5 as I'm using the same value as the mean probability of this value should be 50%, but the function gives an output of 39,77 which is clearly not correct.
Why is it like this?
A probability cannot have values greater than 1, but a density can.
The integral of the entire range of a density function is equal 1, but it can have values greater than one in specific interval. Example, a uniform distribution on the interval [0, ½] has probability density f(x) = 2 for 0 ≤ x ≤ ½ and f(x) = 0 elsewhere. See below:
          
=NORMDIST(x, mean, dev, FALSE) returns the density function. Densities are probabilities per unit. It is almost the probability of a point, but with a very tiny range interval (the derivative in the point).
shg's answer here, explain how to get a probability on a given interval with NORMIDIST and also in what occasions it can return a density greater than 1.
For a continuous variable, the probability of any particular value is zero, because there are an infinite number of values.
If you want to know the probability that a continuous random variable with a normal distribution falls in the range of a to b, use:
=NORMDIST(b, mean, dev, TRUE) - NORMDIST(a, mean, dev, TRUE)
The peak value of the density function occurs at the mean (i.e., =NORMDIST(mean, mean, dev, FALSE) ), and the value is:
=1/(SQRT(2*PI())*dev)
The peak value will exceed 1 when the deviation is less than 1 / sqrt(2pi) ~ 0.399,
which was your case.
This is an amazing answer on Cross Validated Stack Exchange (statistics) from a moderator (whuber), that addresses this issue very thoughtfully.
It is returning the probability density function whereas I think you want the cumulative distribution function (so try TRUE in place of FALSE) ref.

VBA rounding problem

I have this obscure rounding problem in VBA.
a = 61048.4599674847
b = 154553063.208822
c = a + b
debug.print c
Result:
154614111.66879
Here is the question, why did VBA rounded off variable c? I didn't issued any rounding off function. The value I was expecting was 154614111.6687894847. Even if I round off or format variable c to 15 decimal places I still don't get my expected result.
Any explanation would be appreciated.
Edit:
Got the expected results using cDec. I have read this in Jonathan Allen's reply in Why does CLng produce different results?
Here is the result to the test:
a = cDec(61048.4599674847)
b = cDec(154553063.208822)
c = a + b
?c
154614111.6687894847
The reason is the limited precission that can be stored in a floating point variable.
For a complete explanation you shoud read the paper What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg, published in the March, 1991 issue of Computing Surveys.
Link to paper
In VBA the default floating point type is Double which is a IEEE 64-bit (8-byte) floating-point number.
There is another type available: Decimal which is a 96-bit (12-byte) signed integers scaled by a variable power of 10
Put simply, this provides floating point numbers to 28 digit precission.
To use in your example:
a = CDec(61048.4599674847)
b = CDec(154553063.208822)
c = a + b
debug.print c
Result:
154614111.6687894847
Its not obscure, but its not necessarily obvious.
I think you've sort of answered it - but the basic problem is one of the "size" of the values that is how much data can be stored in a variable of a given type.
If (and this is very crude) you count the number of digits in each of the numbers in your first example you will see that you have 15 so whilst the range of values that a float (the default type) can represent is huge the precision is limited to 15 digits (I'm sure someone will be along to correct this, I'll tick the wiki box...)
So when you add the two numbers together it loses the least significant values in order to remain within the allowable precision for a flow.
By doing a cDec you're converting to a different type of variable (decimal) that is capable of greater precision

Resources