how to keep fraction value with fixed precision in spark sql - apache-spark

I am stuck into a very weird problem in currency calculation. I need to keep digits after point but somehow spark is showing flooring or ceiling value
as example
My expected output is 871.25
spark output is 871.00
i am getting double type value from intermediate table for this i have cast double value to decimal value with fixed 2 digit precision. My test code is
spark.sql("select cast(SUM(TRANAMT) as DECIMAL(20,2)) as Expr1 from CMSDLG").show()
I am not getting which part i need to focus here. Kindly Help me . I am using pyspark 2.0

you should sum after cast...
spark.sql("select SUM(cast(TRANAMT as DECIMAL(20,2)) ) as Expr1 from CMSDLG").show()

Use this query instead, to cast varchar into decimal and then round the sum to 2 decimal places.
spark.sql("select round(sum(cast(TRANAMT as decimal(65,30))),2) as Expr1 from CMSDLG").show()

Related

Controlling Decimal Precision Overflow in Spark

We are using Spark 2.4.x.
We have a precision loss for one of our division operations (69362.86 / 111862.86) Both of these values are defined as decimal(38,3) on the table. When run through beeline, it produces 0.620070504187002 but when run through spark it produces 0.6200710. As we can see, there is a decimal truncation with spark's result. Upon reading more, we stumbled upon the Spark story SPARK-29123. The comment ask us to set the parameter spark.sql.decimalOperations.allowPrecisionLoss to false to avoid precision loss. However, there is another comment in the same story that is warning us of null when the exact representation of the decimal value is not possible. The stack overflow thread doesn't talk about the warning mentioned in the 2nd comment. Setting this parameter spark.sql.decimalOperations.allowPrecisionLoss to false and running the computation (69362.86 / 111862.86) results in 0.620070504187002 which is good but we are concerned about the warning in the 2nd comment.
As per the rules laid out in the sourcecode the value of division's precision and scale is determined by the below formula.
Operation Result Precision Result Scale
e1 / e2 p1 - s1 + s2 + max(6, s1 + p2 + 1) max(6, s1 + p2 + 1)
As per these rules, My precision is (38 -3 +3 + max(6,3 +38 +1)) => 80 and scale is max(6,3 +38 +1) => 42. Since these are exceeding the default limit of 38 for both Precision and Scale, they are reduced to 38 and 6. One way to fix this decimal truncation is by using proper decimal precision and scale for the input columns. I think based on our data in our table, we can easily set input precision as 18 and scale as 5 for both of the columns involved in the division. In that case, the resultant precision will be 38 and 24. This is good enough precision and scale to represent our data without any noticeable truncation. But we can't do this manually for all the numeric columns in our space. So we are thinking of setting spark.sql.decimalOperations.allowPrecisionLoss to false at cluster level. We are interested in learning more about what are the situations the result will be NULL when we set this parameter to false but if this parameter was left at default, would have resulted in a value with precision loss.
Now my question is, in what are the situations setting this parameter spark.sql.decimalOperations.allowPrecisionLoss to false will result in null but when left it at the default (true) we get some value with precision loss. Can you provide any example that I can use to reproduce on my end?. If we are not able to find such an example then, can we set this parameter to false at cluster level so that the arithmetic operations can produce better results?.
Found some examples where setting this parameter spark.sql.decimalOperations.allowPrecisionLoss to true or false produces different results. I have given 2 such examples below.
From this analysis, I understood that there is no tolerance on the fractional portion of the decimal value when this parameter is set to false as the name suggests. However if the scale of the resulting arithmetic operation exceeds the default limit of 38, then the scale is reduced to 38. For the integer portion of the decimal value, there are no checks, If the integer value comes within the range of (precision-scale) then the proper value is returned otherwise the computation returns NULL.
With this, we have decided to leave this parameter to its default true to avoid a situation where the decimal column is not defined as tight to the actual value as possible, and because of this, the arithmetic operation results in NULL.
Case 1:
Case 2:

extraneous digits formatting within dataframe

I am running into a formatting / precision issue which I'm hoping to control
I obtain a list of numbers such as:
x = [0.009947, 0.009447, 0.008947]
The finished product I'm after is a DataFrame with a column whose value is this list but multiplied by 100 with 3 decimal places, e.g.
[0.995, 0.945, 0.895]
I proceed as follows:
x = 100*np.around([0.009947, 0.009447, 0.008947],5)
this displays as
array([0.995, 0.945, 0.895])
When I build the DataFrame:
pd.DataFrame({'test':[x]})
I get for the value in the 'test' column:
[0.9950000000000001, 0.9450000000000001, 0.895]
This does not happen in other examples and I'm not sure how to control the behavior. Appreciate any suggestions
This is a general issue with the usage of floating points in computers, check this out
from the docs

How to add a trailing zeros to a pandas dataframe column?

I have a pandas dataframe column that I would like to be formatted to two decimal places.
Such that:
10.1
Appears as:
10.10
How can I do that? I have already tried rounding.
This can be accomplished by mapping a string format object to the column of floats:
df.colName.map('{:.2f}'.format)
(Credit to exp1orer)
You can use:
pd.options.display.float_format = '{:,.2f}'.format
Note that this will only display two decimals for every float in your dataframes.
To go back to normal:
pd.reset_option('display.float_format')
From pyformat.info
Padding numbers
For floating points the padding value represents the length of the complete output. In the example below we want our output to have at least 6 characters with 2 after the decimal point.
'{:06.2f}'.format(3.141592653589793)
The :06 is the length of your output regardless of how many digits are in your input. The .2 indicates you want 2 places after the decimal point. The f indicates you want a float output.
Output
003.14
If you are using Python 3.6 or later you can use f strings. Check out this other answer: https://stackoverflow.com/a/45310389/12229158
>>> a = 10.1234
>>> f'{a:.2f}'
'10.12'

Calculations being being rounded SQL Server 2012 [duplicate]

This question already has answers here:
How to get a float result by dividing two integer values using T-SQL?
(10 answers)
Closed 7 years ago.
I am trying to calculate some performance metrics as [RATE] in SQL but no matter what I do I am only getting integer values in return.
In my dataset I have a Numerator and a Denominator both stored as integers that are broken down into groups and subsets. The rate is calculated slightly differently depending on the group and subset. For group 1 the calculation is simply N/D. For group 2 the calculation is (N*1000)/D with the exception of 1 subset which is calculated (N*10000)/D.
I wrote the query as:
SELECT [Group]
,[SubSet]
,[Numerator] N
,[Denominator] D
,CASE WHEN D=0 Then NULL
WHEN [Group]='G1' THEN [N]/[D]
WHEN [SubSet]='S2' THEN ([N]*10000)/[D]
WHEN [SubSet] NOT LIKE 'S2%' AND [G] NOT LIKE 'G1%' THEN ([N]*1000)/[D] as [RATE]
No matter what I do the outcome variables are integers. I tried formatting RATE as varchar, decimal, and float with no success. I tried changing N and D's format to varchar, decimal, and float as well. I tried changing the equations from (N*1000)/D to (N/D)*1000 but still no effect.
What am I missing/doing wrong?
The problem you are having is because SQL is doing integer division, which will only return whole numbers. To get a decimal return value you must have at least one value as a decimal.
Try this:
(CAST([N] as decimal(12,6))/[D]) * 1000
Adjust decimal(12,6) based on the precision you are expecting. 12,6 will return a decimal with 6 digits after the decimal point. If you wanted only 2 decimal places use 16,2.
If you then want to round the calculated value you will need to make use of the ROUND function in SQL.
Round to the second decimal place:
ROUND((CAST([N] as decimal(12,6))/[D]) * 1000, 2)
You need to use CAST:
CAST ((N*1000) AS FLOAT) / D
Hope this helps.
SELECT (n * 1000.0) will do it.

SELECT SUM android no return decimals

I have this query that calculates the sum, but it gives me decimal numbers, how can I get the decimal?
Cursor cursor = dataBase.rawQuery(
"SELECT ROUND(SUM(ore),2) AS totore FROM "+DbHelper.TURNI_TABLE+" WHERE MESE = 'Gennaio'",null);
SQLite does not have a decimal data type with a fixed number of digits.
ROUND returns a plain floating-point number, i.e., ROUND(42) will return 42.0; it is your program's responsibility to format this number as you want.
If the only reason you're using ROUND is that you want to format the number, you should not do this in the database but only in your program.

Resources