Python Spark Cumulative Sum by Group Using DataFrame - apache-spark

How do I compute the cumulative sum per group specifically using the DataFrame abstraction; and in PySpark?
With an example dataset as follows:
df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")],
["time", "value", "class"] )
+----+-----+-----+
|time|value|class|
+----+-----+-----+
| 1| 2| a|
| 3| 2| a|
| 1| 3| b|
| 2| 2| a|
| 2| 3| b|
+----+-----+-----+
I would like to add a cumulative sum column of value for each class grouping over the (ordered) time variable.

This can be done using a combination of a window function and the Window.unboundedPreceding value in the window's range as follows:
from pyspark.sql import Window
from pyspark.sql import functions as F
windowval = (Window.partitionBy('class').orderBy('time')
.rangeBetween(Window.unboundedPreceding, 0))
df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
df_w_cumsum.show()
+----+-----+-----+-------+
|time|value|class|cum_sum|
+----+-----+-----+-------+
| 1| 3| b| 3|
| 2| 3| b| 6|
| 1| 2| a| 2|
| 2| 2| a| 4|
| 3| 2| a| 6|
+----+-----+-----+-------+

To make an update from previous answers. The correct and precise way to do is :
from pyspark.sql import Window
from pyspark.sql import functions as F
windowval = (Window.partitionBy('class').orderBy('time')
.rowsBetween(Window.unboundedPreceding, 0))
df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
df_w_cumsum.show()

I have tried this way and it worked for me.
from pyspark.sql import Window
from pyspark.sql import functions as f
import sys
cum_sum = DF.withColumn('cumsum', f.sum('value').over(Window.partitionBy('class').orderBy('time').rowsBetween(-sys.maxsize, 0)))
cum_sum.show()

I create this function in this link for my use:
kolang/column_functions/cumulative_sum
def cumulative_sum(col: Union[Column, str],
on_col: Union[Column, str],
ascending: bool = True,
partition_by: Union[Column, str, List[Union[Column, str]]] = None) -> Column:
on_col = on_col if ascending else F.desc(on_col)
if partition_by is None:
w = Window.orderBy(on_col).rangeBetween(Window.unboundedPreceding, 0)
else:
w = Window.partitionBy(partition_by).orderBy(on_col).rangeBetween(Window.unboundedPreceding, 0)
return F.sum(col).over(w)

Related

How to set the value of a Pyspark column based on two conditions of the value of another column

Say I have a dataframe:
+-----+-----+-----+
|id |foo. |bar. |
+-----+-----+-----+
| 1| baz| 0|
| 2| baz| 0|
| 3| 333| 2|
| 4| 444| 1|
+-----+-----+-----+
I want to set the 'foo' column to a value depending on the value of bar.
If bar is 2: set the value of foo for that row to 'X',
else if bar is 1: set the value of foo for that row to 'Y'
And if neither condition is met, leave the foo value as it is.
pyspark.when seems like the closest method, but that doesn't seem to work based on another columns value.
when can work with other columns. You can use F.col to get the value of the other column and provide an appropriate condition:
import pyspark.sql.functions as F
df2 = df.withColumn(
'foo',
F.when(F.col('bar') == 2, 'X')
.when(F.col('bar') == 1, 'Y')
.otherwise(F.col('foo'))
)
df2.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
We can solve this using when òr UDF in spark to insert new column based on condition.
Create Sample DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('AddConditionalColumn').getOrCreate()
data = [(1,"baz",0),(2,"baz",0),(3,"333",2),(4,"444",1)]
columns = ["id","foo","bar"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3|333| 2|
| 4|444| 1|
+---+---+---+
Using When:
from pyspark.sql.functions import when
df2 = df.withColumn("foo", when(df.bar == 2,"X")
.when(df.bar == 1,"Y")
.otherwise(df.foo))
df2.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
Using UDF:
import pyspark.sql.functions as F
from pyspark.sql.types import *
def executeRule(value):
if value == 2:
return 'X'
elif value == 1:
return 'Y'
else:
return value
# Converting function to UDF
ruleUDF = F.udf(executeRule, StringType())
df3 = df.withColumn("foo", ruleUDF("bar"))
df3.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1| 0| 0|
| 2| 0| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+

How to use groupBy, collect_list, arrays_zip, & explode together in pyspark to solve certain business problem

I am new to pyspark world.
Want to join two DataFrames df and df_sd on colum days While joining it should also use column Name from df DataFrame. If there is no matching value for Name and days combination from df DataFrame then it should have null. Please see below code and desired output for better understanding.
import findspark
findspark.init("/opt/spark")
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import SQLContext
from pyspark.sql.types import IntegerType
Mydata = Row("Name", "Number", "days")
spark = SparkSession \
.builder \
.appName("DataFrame Learning") \
.getOrCreate()
sqlContext = SQLContext(spark)
mydata1 = Mydata("A", 100, 1)
mydata2 = Mydata("A", 200, 2)
mydata3 = Mydata("B", 300, 1)
mydata4 = Mydata("B", 400, 2)
mydata5 = Mydata("B", 500, 3)
mydata6 = Mydata("C", 600, 1)
myDataAll = [mydata1, mydata2, mydata3, mydata4, mydata5, mydata6]
STANDARD_TENORS = [1, 2, 3]
df_sd = spark.createDataFrame(STANDARD_TENORS, IntegerType())
df_sd = df_sd.withColumnRenamed("value", "days")
df_sd.show()
df = spark.createDataFrame(myDataAll)
df.show()
+----+
# |days|
# +----+
# | 1|
# | 2|
# | 3|
# +----+
#
# +----+------+----+
# |Name|Number|days|
# +----+------+----+
# | A| 100| 1|
# | A| 200| 2|
# | B| 300| 1|
# | B| 400| 2|
# | B| 500| 3|
# | C| 600| 1|
# +----+------+----+
Please see below expected results from join
# +----+------+----+
# |Name|Number|days|
# +----+------+----+
# | A| 100| 1|
# | A| 200| 2|
# | A|Null | 3|
# | B| 300| 1|
# | B| 400| 2|
# | B| 500| 3|
# | C| 600| 1|
# | C|Null | 2|
# | C|Null | 3|
# +----+------+----+
If df_sd will not be huge list, and you have spark2.4, you can do this by creating a new column in df with the list of days(1,2,3)and then use groupBy,collect_list, arrays_zip, & explode. The orderBy before the groupBy is there to ensure that the list gets collected in the right order.
df.show()
+----+------+----+
|Name|Number|days|
+----+------+----+
| A| 100| 1|
| A| 200| 2|
| B| 300| 1|
| B| 400| 2|
| B| 500| 3|
| C| 600| 1|
+----+------+----+
STANDARD_TENORS #-> [1, 2, 3]
#-> should be ordered
from pyspark.sql import functions as F
df.withColumn("days2", F.array(*[F.lit(x) for x in STANDARD_TENORS]))\
.orderBy("Name","days")\
.groupBy("Name").agg(F.collect_list("Number").alias("Number")\
,F.first("days2").alias("days"))\
.withColumn("zipped", F.explode(F.arrays_zip("Number","days")))\
.select("Name","zipped.*").orderBy("Name","days").show()
+----+------+----+
|Name|Number|days|
+----+------+----+
| A| 200| 1|
| A| 100| 2|
| A| null| 3|
| B| 300| 1|
| B| 400| 2|
| B| 500| 3|
| C| 600| 1|
| C| null| 2|
| C| null| 3|
+----+------+----+
If you want to use join, you can do it in a similar manner:
from pyspark.sql import functions as F
df_sd.agg(F.collect_list("days").alias("days")).join(\
df.orderBy("Name","days").groupBy("Name")\
.agg(F.collect_list("Number").alias("Number"),F.collect_list("days").alias("days1")),\
F.size("days")>=F.size("days1")).drop("days1")\
.withColumn("zipped", F.explode(F.arrays_zip("Number","days")))\
.select("Name","zipped.*")\
.orderBy("Name","days")\
.show()
UPDATE:
Updated In order to handle any order whatsoever or for any value present in Number.. I could have made the code a little more concise, but I kept it like that so you can see all those columns I used in order to understand the logic. Feel free to ask any questions.
df.show()
#newsampledataframe
+----+------+----+
|Name|Number|days|
+----+------+----+
| A| 100| 1|
| A| 200| 2|
| B| 300| 1|
| B| 400| 2|
| B| 500| 3|
| C| 600| 3|
+----+------+----+
#STANDARD_TENORS = [1, 2, 3]
from pyspark.sql import functions as F
df.withColumn("days2", F.array(*[F.lit(x) for x in STANDARD_TENORS]))\
.groupBy("Name").agg(F.collect_list("Number").alias("col1")\
,F.first("days2").alias("days2"),F.collect_list("days").alias("x"))\
.withColumn("days3", F.arrays_zip(F.col("col1"),F.col("x")))\
.withColumn("days4", F.array_except("days2","x"))\
.withColumn("day5", F.expr("""transform(days4,x-> struct(bigint(-1),x))"""))\
.withColumn("days3", F.explode(F.array_union("days3","day5"))).select("Name","days3.*")\
.withColumn("Number", F.when(F.col("col1")==-1, F.lit(None)).otherwise(F.col("col1"))).drop("col1")\
.select("Name", "Number", F.col("x").alias("days"))\
.orderBy("Name","days")\
.show(truncate=False)

Passing multiple columns in Pandas UDF PySpark

I want to calculate the Jaro Winkler distance between two columns of a PySpark DataFrame. Jaro Winkler distance is available through pyjarowinkler package on all nodes.
pyjarowinkler works as follows:
from pyjarowinkler import distance
distance.get_jaro_distance("A", "A", winkler=True, scaling=0.1)
Output:
1.0
I am trying to write a Pandas UDF to pass two columns as Series and calculate the distance using lambda function.
Here's how I am doing it:
#pandas_udf("float", PandasUDFType.SCALAR)
def get_distance(col1, col2):
import pandas as pd
distance_df = pd.DataFrame({'column_A': col1, 'column_B': col2})
distance_df['distance'] = distance_df.apply(lambda x: distance.get_jaro_distance(str(distance_df['column_A']), str(distance_df['column_B']), winkler = True, scaling = 0.1))
return distance_df['distance']
temp = temp.withColumn('jaro_distance', get_distance(temp.x, temp.x))
I should be able to pass any two string columns in the above function.
I am getting the following output:
+---+---+---+-------------+
| x| y| z|jaro_distance|
+---+---+---+-------------+
| A| 1| 2| null|
| B| 3| 4| null|
| C| 5| 6| null|
| D| 7| 8| null|
+---+---+---+-------------+
Expected Output:
+---+---+---+-------------+
| x| y| z|jaro_distance|
+---+---+---+-------------+
| A| 1| 2| 1.0|
| B| 3| 4| 1.0|
| C| 5| 6| 1.0|
| D| 7| 8| 1.0|
+---+---+---+-------------+
I suspect this might be because str(distance_df['column_A']) is not correct. It contains the concatenated string of all row values.
While this code works for me:
#pandas_udf("float", PandasUDFType.SCALAR)
def get_distance(col):
return col.apply(lambda x: distance.get_jaro_distance(x, "A", winkler = True, scaling = 0.1))
temp = temp.withColumn('jaro_distance', get_distance(temp.x))
Output:
+---+---+---+-------------+
| x| y| z|jaro_distance|
+---+---+---+-------------+
| A| 1| 2| 1.0|
| B| 3| 4| 0.0|
| C| 5| 6| 0.0|
| D| 7| 8| 0.0|
+---+---+---+-------------+
Is there a way to do this with Pandas UDF? I'm dealing with millions of records so UDF will be expensive but still acceptable if it works. Thanks.
The error was from your function in the df.apply method, adjust it to the following should fix it:
#pandas_udf("float", PandasUDFType.SCALAR)
def get_distance(col1, col2):
import pandas as pd
distance_df = pd.DataFrame({'column_A': col1, 'column_B': col2})
distance_df['distance'] = distance_df.apply(lambda x: distance.get_jaro_distance(x['column_A'], x['column_B'], winkler = True, scaling = 0.1), axis=1)
return distance_df['distance']
However, Pandas df.apply method is not vectorised which beats the purpose why we need pandas_udf over udf in PySpark. A faster and less overhead solution is to use list comprehension to create the returning pd.Series (check this link for more discussion about Pandas df.apply and its alternatives):
from pandas import Series
#pandas_udf("float", PandasUDFType.SCALAR)
def get_distance(col1, col2):
return Series([ distance.get_jaro_distance(c1, c2, winkler=True, scaling=0.1) for c1,c2 in zip(col1, col2) ])
df.withColumn('jaro_distance', get_distance('x', 'y')).show()
+---+---+---+-------------+
| x| y| z|jaro_distance|
+---+---+---+-------------+
| AB| 1B| 2| 0.67|
| BB| BB| 4| 1.0|
| CB| 5D| 6| 0.0|
| DB|B7F| 8| 0.61|
+---+---+---+-------------+
You can union all the data frame first, partition by the same partition key after the partitions were shuffled and distributed to the worker nodes, and restore them before the pandas computing. Pls check the example where I wrote a small toolkit for this scenario: SparkyPandas

how can I create a pyspark udf using multiple columns?

I need to write some custum code using multiple columns within a group of my data.
My custom code is to set a flag if a value is over a threshold, but suppress the flag if it is within a certain time of a previous flag.
Here is some sample code:
df = spark.createDataFrame(
[
("a", 1, 0),
("a", 2, 1),
("a", 3, 1),
("a", 4, 1),
("a", 5, 1),
("a", 6, 0),
("a", 7, 1),
("a", 8, 1),
("b", 1, 0),
("b", 2, 1)
],
["group_col","order_col", "flag_col"]
)
df.show()
+---------+---------+--------+
|group_col|order_col|flag_col|
+---------+---------+--------+
| a| 1| 0|
| a| 2| 1|
| a| 3| 1|
| a| 4| 1|
| a| 5| 1|
| a| 6| 0|
| a| 7| 1|
| a| 8| 1|
| b| 1| 0|
| b| 2| 1|
+---------+---------+--------+
from pyspark.sql.functions import udf, col, asc
from pyspark.sql.window import Window
def _suppress(dates=None, alert_flags=None, window=2):
sup_alert_flag = alert_flag
last_alert_date = None
for i, alert_flag in enumerate(alert_flag):
current_date = dates[i]
if alert_flag == 1:
if not last_alert_date:
sup_alert_flag[i] = 1
last_alert_date = current_date
elif (current_date - last_alert_date) > window:
sup_alert_flag[i] = 1
last_alert_date = current_date
else:
sup_alert_flag[i] = 0
else:
alert_flag = 0
return sup_alert_flag
suppress_udf = udf(_suppress, DoubleType())
df_out = df.withColumn("supressed_flag_col", suppress_udf(dates=col("order_col"), alert_flags=col("flag_col"), window=4).Window.partitionBy(col("group_col")).orderBy(asc("order_col")))
df_out.show()
The above fails, but my expected output is the following:
+---------+---------+--------+------------------+
|group_col|order_col|flag_col|supressed_flag_col|
+---------+---------+--------+------------------+
| a| 1| 0| 0|
| a| 2| 1| 1|
| a| 3| 1| 0|
| a| 4| 1| 0|
| a| 5| 1| 0|
| a| 6| 0| 0|
| a| 7| 1| 1|
| a| 8| 1| 0|
| b| 1| 0| 0|
| b| 2| 1| 1|
+---------+---------+--------+------------------+
Editing answer after more thought.
The general problem seems to be that the result of the current row depends upon result of the previous row. In effect, there is a recurrence relationship. I haven't found a good way to implement a recursive UDF in Spark. There are several challenges that result from the assumed distributed nature of the data in Spark which would make this difficult to achieve. At least in my mind. The following solution should work but may not scale for large data sets.
from pyspark.sql import Row
import pyspark.sql.functions as F
import pyspark.sql.types as T
suppress_flag_row = Row("order_col", "flag_col", "res_flag")
def suppress_flag( date_alert_flags, window_size ):
sorted_alerts = sorted( date_alert_flags, key=lambda x: x["order_col"])
res_flags = []
last_alert_date = None
for row in sorted_alerts:
current_date = row["order_col"]
aflag = row["flag_col"]
if aflag == 1 and (not last_alert_date or (current_date - last_alert_date) > window_size):
res = suppress_flag_row(current_date, aflag, True)
last_alert_date = current_date
else:
res = suppress_flag_row(current_date, aflag, False)
res_flags.append(res)
return res_flags
in_fields = [T.StructField("order_col", T.IntegerType(), nullable=True )]
in_fields.append( T.StructField("flag_col", T.IntegerType(), nullable=True) )
out_fields = in_fields
out_fields.append(T.StructField("res_flag", T.BooleanType(), nullable=True) )
out_schema = T.StructType(out_fields)
suppress_udf = F.udf(suppress_flag, T.ArrayType(out_schema) )
window_size = 4
tmp = df.groupBy("group_col").agg( F.collect_list( F.struct( F.col("order_col"), F.col("flag_col") ) ).alias("date_alert_flags"))
tmp2 = tmp.select(F.col("group_col"), suppress_udf(F.col("date_alert_flags"), F.lit(window_size)).alias("suppress_res"))
expand_fields = [F.col("group_col")] + [F.col("res_expand")[f.name].alias(f.name) for f in out_fields]
final_df = tmp2.select(F.col("group_col"), F.explode(F.col("suppress_res")).alias("res_expand")).select( expand_fields )
I think, You don't need custom function for this. you can use rowsBetween option along with window to get the 5 rows range. Please check and let me know if missed something.
>>> from pyspark.sql import functions as F
>>> from pyspark.sql import Window
>>> w = Window.partitionBy('group_col').orderBy('order_col').rowsBetween(-5,-1)
>>> df = df.withColumn('supr_flag_col',F.when(F.sum('flag_col').over(w) == 0,1).otherwise(0))
>>> df.orderBy('group_col','order_col').show()
+---------+---------+--------+-------------+
|group_col|order_col|flag_col|supr_flag_col|
+---------+---------+--------+-------------+
| a| 1| 0| 0|
| a| 2| 1| 1|
| a| 3| 1| 0|
| b| 1| 0| 0|
| b| 2| 1| 1|
+---------+---------+--------+-------------+

Adding a group count column to a PySpark dataframe

I am coming from R and the tidyverse to PySpark due to its superior Spark handling, and I am struggling to map certain concepts from one context to the other.
In particular, suppose that I had a dataset like the following
x | y
--+--
a | 5
a | 8
a | 7
b | 1
and I wanted to add a column containing the number of rows for each x value, like so:
x | y | n
--+---+---
a | 5 | 3
a | 8 | 3
a | 7 | 3
b | 1 | 1
In dplyr, I would just say:
import(tidyverse)
df <- read_csv("...")
df %>%
group_by(x) %>%
mutate(n = n()) %>%
ungroup()
and that would be that. I can do something almost as simple in PySpark if I'm looking to summarize by number of rows:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
spark.read.csv("...") \
.groupBy(col("x")) \
.count() \
.show()
And I thought I understood that withColumn was equivalent to dplyr's mutate. However, when I do the following, PySpark tells me that withColumn is not defined for groupBy data:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count
spark = SparkSession.builder.getOrCreate()
spark.read.csv("...") \
.groupBy(col("x")) \
.withColumn("n", count("x")) \
.show()
In the short run, I can simply create a second dataframe containing the counts and join it to the original dataframe. However, it seems like this could become inefficient in the case of large tables. What is the canonical way to accomplish this?
When you do a groupBy(), you have to specify the aggregation before you can display the results. For example:
import pyspark.sql.functions as f
data = [
('a', 5),
('a', 8),
('a', 7),
('b', 1),
]
df = sqlCtx.createDataFrame(data, ["x", "y"])
df.groupBy('x').count().select('x', f.col('count').alias('n')).show()
#+---+---+
#| x| n|
#+---+---+
#| b| 1|
#| a| 3|
#+---+---+
Here I used alias() to rename the column. But this only returns one row per group. If you want all rows with the count appended, you can do this with a Window:
from pyspark.sql import Window
w = Window.partitionBy('x')
df.select('x', 'y', f.count('x').over(w).alias('n')).sort('x', 'y').show()
#+---+---+---+
#| x| y| n|
#+---+---+---+
#| a| 5| 3|
#| a| 7| 3|
#| a| 8| 3|
#| b| 1| 1|
#+---+---+---+
Or if you're more comfortable with SQL, you can register the dataframe as a temporary table and take advantage of pyspark-sql to do the same thing:
df.registerTempTable('table')
sqlCtx.sql(
'SELECT x, y, COUNT(x) OVER (PARTITION BY x) AS n FROM table ORDER BY x, y'
).show()
#+---+---+---+
#| x| y| n|
#+---+---+---+
#| a| 5| 3|
#| a| 7| 3|
#| a| 8| 3|
#| b| 1| 1|
#+---+---+---+
as #pault appendix
import pyspark.sql.functions as F
...
(df
.groupBy(F.col('x'))
.agg(F.count('x').alias('n'))
.show())
#+---+---+
#| x| n|
#+---+---+
#| b| 1|
#| a| 3|
#+---+---+
enjoy
I found we can get even more close to the tidyverse example:
from pyspark.sql import Window
w = Window.partitionBy('x')
df.withColumn('n', f.count('x').over(w)).sort('x', 'y').show()

Resources