Aggregate function with Expr in PySpark 3.0.3 - apache-spark

The following code works well with PySpark 3.2.1
df.withColumn(
"total_amount",
f.aggregate(f.col("taxes"), f.lit(0.00), lambda acc, x: acc + x["amount"]),
)
I've downgraded to PySpark 3.0.3. How to change the above code to something like this:
df.withColumn(
"total_amount",
# f.aggregate(f.col("taxes"), f.lit(0.00), lambda acc, x: acc + x["amount"]),
f.lit(expr("aggregate(taxes,0,(acc,x)->acc+x['amount'])"))
)
x['amount'] does not work in our case! Is there something wrong the expression or I must change the taxes to have a list of numbers?
2nd case
df.withColumn(
"total_amount_2",
f.aggregate(
f.filter(
"lines",
lambda x: (
x["id"].isNotNull()
& (
x["code"].isin(["CODE1", "CODE2"]) == False
)
),
),
f.lit(0.00),
lambda acc, x: acc + x["amount"],
),
)
How to refactor these cases using spark.sql expr function ?

Try the following. I'm certain that one can access struct fields using dot notation too. I'm just not sure about the data type that you use (0.00), as this should be of the same data type as before. I have added the D letter which indicates it's a double.
df.withColumn(
"total_amount",
F.expr("aggregate(taxes, 0D, (acc, x) -> acc + x.amount)")
)
Regarding the 2nd case, review the following test case. I have tested it using Spark 3.0.3.
Input df:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[([('1', 'CODE3', 3.0), ('2', 'CODE3', 3.0)],)],
'lines: array<struct<id:string,code:string,amount:double>>'
)
Script:
df = df.withColumn(
"total_amount_2",
F.expr("""
aggregate(
filter(
lines,
x -> x.id is not null and !x.code in ('CODE1', 'CODE2')
),
0D,
(acc, x) -> acc + x.amount
)
""")
)
df.show(truncate=0)
# +----------------------------------+--------------+
# |lines |total_amount_2|
# +----------------------------------+--------------+
# |[[1, CODE3, 3.0], [2, CODE3, 3.0]]|6.0 |
# +----------------------------------+--------------+

Related

Sum of PySpark array using SQL function AGGREGATE produces incorrect result when casting as float

The following code produces results that are slightly different from the correct result, and I was wondering if anyone can help identify why.
import pyspark.sql.functions as f
import pyspark.sql.types as t
import pandas as pd
v1 = [24005, 24806874, 114187]
v2 = [24005, 24806872, 114189]
df = pd.DataFrame({"index": range(2), "arr": [v1, v2]})
schema = t.StructType(
[t.StructField("index", t.IntegerType(), True),
t.StructField("arr", t.ArrayType(t.LongType(), True)),
]
)
df = spark.createDataFrame(df, schema=schema)
df = df.withColumn(
"sum",
f.expr("aggregate(arr, cast(0 as float), (acc, x) -> acc + x)")
)
df.show(truncate=False)
# Output
#+-----+-------------------------+-----------+
#|index|arr |sum |
#+-----+-------------------------+-----------+
#|0 |[24005, 24806874, 114187]|2.4945068E7|
#|1 |[24005, 24806872, 114189]|2.4945064E7|
#+-----+-------------------------+-----------+
However, updating float to double gives the correct result:
# Output
#+-----+-------------------------+-----------+
#|index|arr |sum |
#+-----+-------------------------+-----------+
#|0 |[24005, 24806874, 114187]|2.4945066E7|
#|1 |[24005, 24806872, 114189]|2.4945066E7|
#+-----+-------------------------+-----------+
I would love to hear your thoughts!
The effective digits of the decimal part of 32-bit floats are usually 6-7 digits. Your calculation result is 8 digits, which is beyond the float range, so its accuracy cannot be guaranteed.

Combine pandas df colums and rows to get a pairs in a string - python

Here is my df:
d = {'Equipment': ['A','B','C'], 'Downtime': [3,8, 4]}
df = pd.DataFrame(data=d)
i would like to create a string that looks like (A,3);(B,8);(C,4)
even better would be :
(Equipment:A,Downtime:3);(Equipment:B,Downtime:8);(Equipment:C,Downtime:4)
Solution using df.apply and str.join:
print(
";".join(
df.apply(
lambda x: "("
+ ",".join(
"{}:{}".format(name, value) for name, value in zip(x.index, x)
)
+ ")",
axis=1,
).values
)
)
Prints:
(Equipment:A,Downtime:3);(Equipment:B,Downtime:8);(Equipment:C,Downtime:4)
EDIT: To get output in DataFrame:
x = ";".join(
df.apply(
lambda x: "("
+ ",".join(
"{}:{}".format(name, value) for name, value in zip(x.index, x)
)
+ ")",
axis=1,
).values
)
df_out = pd.DataFrame([{"Name": x}])
print(df_out)
Prints:
Name
0 (Equipment:A,Downtime:3);(Equipment:B,Downtime...

PySpark UDF with multiple arguments returns null

I have a PySpark Dataframe with two columns (A, B, whose type is double) whose values are either 0.0 or 1.0.
I am trying to add a new column, which is the sum of those two.
I followed examples in Pyspark: Pass multiple columns in UDF
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, StringType
sum_cols = F.udf(lambda x: x[0]+x[1], IntegerType())
df_with_sum = df.withColumn('SUM_COL',sum_cols(F.array('A','B')))
df_with_sum.select(['SUM_COL']).toPandas()
This shows a Series of NULLs instead of the results I expect.
I tried any of the following to see if there's an issue with data types
sum_cols = F.udf(lambda x: x[0], IntegerType())
sum_cols = F.udf(lambda x: int(x[0]), IntegerType())
still getting Nulls.
I tried removing the array:
sum_cols = F.udf(lambda x: x, IntegerType())
df_with_sum = df.withColumn('SUM_COL',sum_cols(df.A))
This works fine and shows 0/1
I tried removing the UDF, but leaving the array:
df_with_sum = df.withColumn('SUM_COL', F.array('A','B'))
This works fine and shows a series of arrays of [0.0/1.0, 0.0/1.0]
So, array works fine, UDF works fine, it is just when I try to pass an array to UDF that things break down. What am I doing wrong?
The problem is that you are trying to return a double in a function that is supposed to output an integer, which does not fit, and pyspark by default silently resorts to NULL when a casting fails:
df_with_doubles = spark.createDataFrame([(1.0,1.0), (2.0,2.0)], ['A', 'B'])
sum_cols = F.udf(lambda x: x[0]+x[1], IntegerType())
df_with_sum = df_with_double.withColumn('SUM_COL',sum_cols(F.array('A','B')))
df_with_sum.select(['SUM_COL']).toPandas()
You get:
SUM_COL
0 None
1 None
However, if you do:
df_with_integers = spark.createDataFrame([(1,1), (2,2)], ['A', 'B'])
sum_cols = F.udf(lambda x: x[0]+x[1], IntegerType())
df_with_sum = df_with_integers.withColumn('SUM_COL',sum_cols(F.array('A','B')))
df_with_sum.select(['SUM_COL']).toPandas()
You get:
SUM_COL
0 2
1 4
So, either cast your columns to IntegerType beforehand (or cast them in the UDF), or change the return type of the UDF to DoubleType.

spark DataFrame using explode and describe functions

I have spark DataFrame having 3 columns(id: Int, x_axis: Array[Int], y_axis: Array[Int]) with some sample data below:
want to get basic statistics of y_axis column for each row in dataframe. Output would be something like:
I have tried explode and then describe, but could not figure out expected output.
Any help or reference is much apprecieated
As you suggest, you could explode the Y column and then use a window over id to compute all the statistics you are interested in. Nevertheless, you want to re aggregate your data afterwards so you would generate a huge intermediate result for nothing.
Spark does not have a lot of predefined functions for arrays. Therefore the easiest way to achieve what you want is probably a UDF:
val extractFeatures = udf( (x : Seq[Int]) => {
val mean = x.sum.toDouble/x.size
val variance = x.map(i=> i*i).sum.toDouble/x.size - mean*mean
val std = scala.math.sqrt(variance)
Map("count" -> x.size.toDouble,
"mean" -> mean,
"std" -> std,
"min" -> x.min.toDouble,
"max" -> x.max.toDouble)
})
val df = sc
.parallelize(Seq((1,Seq(1,2,3,4,5)), (2,Seq(1,2,1,4))))
.toDF("id", "y")
.withColumn("described_y", extractFeatures('y))
.show(false)
+---+---------------+---------------------------------------------------------------------------------------------+
|id |y |described_y |
+---+---------------+---------------------------------------------------------------------------------------------+
|1 |[1, 2, 3, 4, 5]|Map(count -> 5.0, mean -> 3.0, min -> 1.0, std -> 1.4142135623730951, max -> 5.0, var -> 2.0)|
|2 |[1, 2, 1, 4] |Map(count -> 4.0, mean -> 2.0, min -> 1.0, std -> 1.224744871391589, max -> 4.0, var -> 1.5) |
+---+---------------+---------------------------------------------------------------------------------------------+
And btw, the stddev you calculated is actually the variance. You need to take the square root to get the standard deviation.

PySpark flip key/value

I am trying to flip key value from a dataset to do sorting. However, the map function return an invalid syntax error
rdd = clean_headers_rdd.rdd\
.filter(lambda x: x['date'].year == 2016)\
.map(lambda x: (x['user_id'], 1)).reduceByKey(lambda x, y: x + y)\
.map(lambda (x, y): (y, x)).sortByKey(ascending = False)
PEP 3113 -- Removal of Tuple Parameter Unpacking.
Method recommended by the transition plan:
rdd.map(lambda x_y: (x_y[1], x_y[0])
Shortcut with the operator module:
from operator import itemgetter
rdd.map(itemgetter(1, 0))
Slicing:
rdd.map(lambda x: x[::-1])

Resources