Pyspark calculating 12 months moving average within groups - apache-spark

I am currently using Pyspark to do a moving average calculation for the last 12 months for different company group. The data looks like this:
| CALENDAR_DATE| COMPANY | VALUE
| 2021-11-01 | a | 31
| 2021-10-01 | a | 31
| 2021-09-01 | a | 33
| 2021-08-01 | a | 21
| 2021-07-01 | a | 25
| 2021-06-01 | a | 28
| 2021-05-01 | a | 31
| 2021-04-01 | a | 31
| 2021-03-01 | a | 33
| 2021-04-01 | a | 31
| 2021-03-01 | a | 33
| 2021-04-01 | a | 10
| 2021-03-01 | a | 25
| 2021-04-01 | a | 30
| 2021-03-01 | a | 27
| 2021-02-01 | a | 18
| 2021-01-01 | a | 15
| 2021-11-01 | b | 31
| 2021-10-01 | b | 30
| 2021-09-01 | b | 31
| 2021-08-01 | b | 32
and I would like to get an extra column called rolling_average for each company a and b.
my code looks like this and it doesn't give me the right answer. I really don't know what is the problem.
from pyspark.sql.functions import *
from pyspark.sql.window import *
w = Window().partitionBy('COMPANY').orderBy('CALENDAR_DATE').rowsBetween(-11, 0)
df = df.withColumn('ROLLING_AVERAGE', round(avg('VALUE').over(w), 1))

You need to use Window rangeBetween instead of rowsBetween. But before convert the CALENDAR_DATE column into timestamp:
from pyspark.sql import Window
from pyspark.sql import functions as F
df = df.withColumn('calendar_timestamp', F.to_timestamp('CALENDAR_DATE').cast("long"))
# 2629800 is the number of seconds in one month
w = Window().partitionBy('COMPANY').orderBy('calendar_timestamp').rangeBetween(-11 * 2629800, 0)
df1 = df.withColumn(
'ROLLING_AVERAGE',
F.round(F.avg('VALUE').over(w), 1)
).drop('calendar_timestamp')
df1.show()
#+-------------+-------+-----+---------------+
#|CALENDAR_DATE|COMPANY|VALUE|ROLLING_AVERAGE|
#+-------------+-------+-----+---------------+
#| 2021-08-01| b| 32| 32.0|
#| 2021-09-01| b| 31| 31.5|
#| 2021-10-01| b| 30| 31.0|
#| 2021-11-01| b| 31| 31.0|
#| 2021-01-01| a| 15| 15.0|
#| 2021-02-01| a| 18| 16.5|
#| 2021-03-01| a| 33| 25.2|
#| 2021-03-01| a| 33| 25.2|
#| 2021-03-01| a| 25| 25.2|
#| 2021-03-01| a| 27| 25.2|
#| 2021-04-01| a| 31| 25.3|
#| 2021-04-01| a| 31| 25.3|
#| 2021-04-01| a| 10| 25.3|
#| 2021-04-01| a| 30| 25.3|
#| 2021-05-01| a| 31| 25.8|
#| 2021-06-01| a| 28| 26.0|
#| 2021-07-01| a| 25| 25.9|
#| 2021-08-01| a| 21| 25.6|
#| 2021-09-01| a| 33| 26.1|
#| 2021-10-01| a| 31| 26.4|
#+-------------+-------+-----+---------------+

Related

add specific rows on pyspark

I'm trying to transform a python notebook in pyspark pipeline and I'm blocked by... it seems a simple problem ....
I have this dataframe after a count aggregation group By Id:
| Id | count |
| 0 | 5 |
| 1 | 3 |
| 4 | 6 |
And I want this :
| Id | count |
| 0 | 5 |
| 1 | 3 |
| 2 | 0 |
| 3 | 0 |
| 4 | 6 |
| 5 | 0 |
I have tried to add a [0,1,3,4,5] array in each rows, then explode outter this array, then tried to find a way to keep the rows I need but it's seems a bit complicated for this simple case.
DO you have any tips ?
Thx in advance
original.show()
+---+-----+
| id|count|
+---+-----+
| 1| 12|
| 3| 15|
+---+-----+
df = spark.createDataFrame([(0,0),(1,0),(2,0),(3,0),(4,0),(5,0)],['id', 'default_count'])
df.show()
+---+-------------+
| id|default_count|
+---+-------------+
| 0| 0|
| 1| 0|
| 2| 0|
| 3| 0|
| 4| 0|
| 5| 0|
+---+-------------+
result=original.join(df, on='id', how='right').withColumn('count', F.coalesce(F.col('count'), F.col('default_count'))).orderBy(F.col('id')).drop(F.col('default_count'))
+---+-----+
| id|count|
+---+-----+
| 0| 0|
| 1| 12|
| 2| 0|
| 3| 15|
| 4| 0|
| 5| 0|
+---+-----+
df.show()
+---+-----+
| Id|count|
+---+-----+
| 0| 5|
| 1| 3|
| 4| 6|
+---+-----+
extra_rows = spark.createDataFrame([(2, 0),
(3, 0),
(5, 0)],
['Id', 'count'])
df.unionByName(extra_rows).orderBy('Id').show()
+---+-----+
| Id|count|
+---+-----+
| 0| 5|
| 1| 3|
| 2| 0|
| 3| 0|
| 4| 6|
| 5| 0|
+---+-----+

PySpark percentile for multiple columns

I want to convert multiple numeric columns of PySpark dataframe into its percentile values using PySpark, without changing its order.
E.g. given an array of column names arr = [Salary, Age, Bonus] to convert columns into percentiles.
Input
+----------+-------------+---------+--------+-----+-------+
| Empl. No | Dept | Pincode | Salary | Age | Bonus |
+----------+-------------+---------+--------+-----+-------+
| 1 | HR | 111 | 1000 | 45 | 100 |
| 2 | Sales | 596 | 500 | 30 | 50 |
| 3 | Manufacture | 895 | 600 | 50 | 400 |
| 4 | HR | 212 | 700 | 26 | 60 |
| 5 | Business | 754 | 350 | 18 | 22 |
+----------+-------------+---------+--------+-----+-------+
Expected output
+----------+-------------+---------+--------+-----+-------+
| Empl. No | Dept | Pincode | Salary | Age | Bonus |
+----------+-------------+---------+--------+-----+-------+
| 1 | HR | 111 | 100 | 80 | 80 |
| 2 | Sales | 596 | 40 | 60 | 40 |
| 3 | Manufacture | 895 | 60 | 100 | 100 |
| 4 | HR | 212 | 80 | 40 | 60 |
| 5 | Business | 754 | 20 | 20 | 20 |
+----------+-------------+---------+--------+-----+-------+
The formula for percentile for a given element 'x' in the list = (Number of elements less than 'x'/Total number of elements) *100.
You can use percentile_approx for this , in conjunction with groupBy with the desired columns for which you want the percentile to be calculated.
Built in Spark > 3.x
input_list = [
(1,"HR",111,1000,45,100)
,(2,"Sales",112,500,30,50)
,(3,"Manufacture",127,600,50,500)
,(4,"Hr",821,700,26,60)
,(5,"Business",754,350,18,22)
]
sparkDF = sql.createDataFrame(input_list,['emp_no','dept','pincode','salary','age','bonus'])
sparkDF.groupBy(['emp_no','dept']).agg(
*[ F.first(F.col('pincode')).alias('pincode') ]
,*[ F.percentile_approx(F.col(col),0.95).alias(col) for col in ['salary','age','bonus'] ]
).show()
+------+-----------+-------+------+---+-----+
|emp_no| dept|pincode|salary|age|bonus|
+------+-----------+-------+------+---+-----+
| 3|Manufacture| 127| 600| 50| 500|
| 1| HR| 111| 1000| 45| 100|
| 2| Sales| 112| 500| 30| 50|
| 5| Business| 754| 350| 18| 22|
| 4| Hr| 821| 700| 26| 60|
+------+-----------+-------+------+---+-----+
Spark has a window function for calculating percentiles which is called percent_rank.
Test df:
from pyspark.sql import SparkSession, functions as F, Window as W
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, "HR", 111, 1000, 45, 100),
(2, "Sales", 596, 500, 30, 50),
(3, "Manufacture", 895, 600, 50, 400),
(4, "HR", 212, 700, 26, 60),
(5, "Business", 754, 350, 18, 22)],
['Empl_No', 'Dept', 'Pincode', 'Salary', 'Age', 'Bonus'])
df.show()
# +-------+-----------+-------+------+---+-----+
# |Empl_No| Dept|Pincode|Salary|Age|Bonus|
# +-------+-----------+-------+------+---+-----+
# | 1| HR| 111| 1000| 45| 100|
# | 2| Sales| 596| 500| 30| 50|
# | 3|Manufacture| 895| 600| 50| 400|
# | 4| HR| 212| 700| 26| 60|
# | 5| Business| 754| 350| 18| 22|
# +-------+-----------+-------+------+---+-----+
percent_rank works in a way that the smallest value gets percentile 0 and the biggest value gets 1.
arr = ['Salary', 'Age', 'Bonus']
df = df.select(
*[c for c in df.columns if c not in arr],
*[F.percent_rank().over(W.orderBy(c)).alias(c) for c in arr]
).sort('Empl_No')
df.show()
# +-------+-----------+-------+------+----+-----+
# |Empl_No| Dept|Pincode|Salary| Age|Bonus|
# +-------+-----------+-------+------+----+-----+
# | 1| HR| 111| 1.0|0.75| 0.75|
# | 2| Sales| 596| 0.25| 0.5| 0.25|
# | 3|Manufacture| 895| 0.5| 1.0| 1.0|
# | 4| HR| 212| 0.75|0.25| 0.5|
# | 5| Business| 754| 0.0| 0.0| 0.0|
# +-------+-----------+-------+------+----+-----+
However, your expectation is somewhat different. You expect it to assume 0 as the smallest value even though it does not exist in the columns.
To solve this I will add a row with 0 values and later it will be deleted.
arr = ['Salary', 'Age', 'Bonus']
# Adding a row containing 0 values
df = df.limit(1).withColumn('Dept', F.lit('_tmp')).select(
*[c for c in df.columns if c not in arr],
*[F.lit(0).alias(c) for c in arr]
).union(df)
# Calculating percentiles
df = df.select(
*[c for c in df.columns if c not in arr],
*[F.percent_rank().over(W.orderBy(c)).alias(c) for c in arr]
).sort('Empl_No')
# Removing the fake row
df = df.filter("Dept != '_tmp'")
df.show()
# +-------+-----------+-------+------+---+-----+
# |Empl_No| Dept|Pincode|Salary|Age|Bonus|
# +-------+-----------+-------+------+---+-----+
# | 1| HR| 111| 1.0|0.8| 0.8|
# | 2| Sales| 596| 0.4|0.6| 0.4|
# | 3|Manufacture| 895| 0.6|1.0| 1.0|
# | 4| HR| 212| 0.8|0.4| 0.6|
# | 5| Business| 754| 0.2|0.2| 0.2|
# +-------+-----------+-------+------+---+-----+
You can multiply the percentile by 100 if you like:
*[(100 * F.percent_rank().over(W.orderBy(c))).alias(c) for c in arr]
Then you get...
+-------+-----------+-------+------+-----+-----+
|Empl_No| Dept|Pincode|Salary| Age|Bonus|
+-------+-----------+-------+------+-----+-----+
| 1| HR| 111| 100.0| 80.0| 80.0|
| 2| Sales| 596| 40.0| 60.0| 40.0|
| 3|Manufacture| 895| 60.0|100.0|100.0|
| 4| HR| 212| 80.0| 40.0| 60.0|
| 5| Business| 754| 20.0| 20.0| 20.0|
+-------+-----------+-------+------+-----+-----+

How to concatenate data frame column pyspark?

I have created data frame using below code:
df = spark.createDataFrame([("A", "20"), ("B", "30"), ("D", "80"),("A", "120"),("c", "20"),("Null", "20")],["Let", "Num"])
df.show()
+----+---+
| Let|Num|
+----+---+
| A| 20|
| B| 30|
| D| 80|
| A|120|
| c| 20|
|Null| 20|
+----+---+
I want create data frame like below:
+----+-------+
| Let|Num |
+----+-------+
| A| 20,120|
| B| 30 |
| D| 80 |
| c| 20 |
|Null| 20 |
+----+-------+
how to achieve this?
You can groupBy Let and collect as list with collect_list
from pyspark.sql import functions as F
df.groupBy("Let").agg(F.collect_list("Num")).show()
Output as List:
+----+-----------------+
| Let|collect_list(Num)|
+----+-----------------+
| B| [30]|
| D| [80]|
| A| [20, 120]|
| c| [20]|
|Null| [20]|
+----+-----------------+
df.groupBy("Let").agg(F.concat_ws(",", F.collect_list("Num"))).show()
Output as String
+----+-------------------------------+
| Let|concat_ws(,, collect_list(Num))|
+----+-------------------------------+
| B| 30|
| D| 80|
| A| 20,120|
| c| 20|
|Null| 20|
+----+-------------------------------+

Pyspark: pivot month level records into a single column based quarter level record

I have a dataframe in the form:
+---------+-------+------------+------------------+--------+-------+
| quarter | month | month_rank | unique_customers | units | sales |
+---------+-------+------------+------------------+--------+-------+
-
| 1 | 1 | 1 | 15 | 30 | 1000 |
--------------------------------------------------------------------
| 1 | 2 | 2 | 20 | 35 | 1200 |
--------------------------------------------------------------------
| 1 | 3 | 3 | 18 | 40 | 1500 |
--------------------------------------------------------------------
| 2 | 4 | 1 | 10 | 25 | 800 |
--------------------------------------------------------------------
| 2 | 5 | 2 | 25 | 50 | 2000 |
--------------------------------------------------------------------
| 2 | 6 | 3 | 28 | 45 | 1800 |
...
I am trying to group on quarter and track the monthly sales in a columnar fashion such as the following:
+---------+--------------+------------+------------------+--------+-------+
| quarter | month_rank1 | rank1_unique_customers | rank1_units | rank1_sales | month_rank2 | rank2_unique_customers | rank2_units | rank2_sales | month_rank3 | rank3_unique_customers | rank3_units | rank3_sales |
+---------+--------------+------------+------------------+--------+-------+
| 1 | 1 | 15|30|1000| 2 |20|35|1200 | 3 |18|40|1500
---------------------------------------------------------------------
| 2 | 4 | 10|25|800 | 5 |25|50|2000 | 6 |28|45|1800
---------------------------------------------------------------------
Is this achievable with multiple pivots? I have had no luck creating multiple columns from a pivot. I am thinking I might be able to achieve this result with windowing, but if anyone has run into a similar problem any suggestions would be greatly appriciated. Thank you!
Use pivot on month_rank column then agg other columns.
Example:
df=spark.createDataFrame([(1,1,1,15,30,1000),(1,2,2,20,35,1200),(1,3,3,18,40,1500),(2,4,1,10,25,800),(2,5,2,25,50,2000),(2,6,3,28,45,1800)],["quarter","month","month_rank","unique_customers","units","sales"])
df.show()
#+-------+-----+----------+----------------+-----+-----+
#|quarter|month|month_rank|unique_customers|units|sales|
#+-------+-----+----------+----------------+-----+-----+
#| 1| 1| 1| 15| 30| 1000|
#| 1| 2| 2| 20| 35| 1200|
#| 1| 3| 3| 18| 40| 1500|
#| 2| 4| 1| 10| 25| 800|
#| 2| 5| 2| 25| 50| 2000|
#| 2| 6| 3| 28| 45| 1800|
#+-------+-----+----------+----------------+-----+-----+
from pyspark.sql.functions import *
df1=df.\
groupBy("quarter").\
pivot("month_rank").\
agg(first(col("month")),first(col("unique_customers")),first(col("units")),first(col("sales")))
cols=["quarter","month_rank1","rank1_unique_customers","rank1_units","rank1_sales","month_rank2","rank2_unique_customers","rank2_units","rank2_sales","month_rank3","rank3_unique_customers","rank3_units","rank3_sales"]
df1.toDF(*cols).show()
#+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+
#|quarter|month_rank1|rank1_unique_customers|rank1_units|rank1_sales|month_rank2|rank2_unique_customers|rank2_units|rank2_sales|month_rank3|rank3_unique_customers|rank3_units|rank3_sales|
#+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+
#| 1| 1| 15| 30| 1000| 2| 20| 35| 1200| 3| 18| 40| 1500|
#| 2| 4| 10| 25| 800| 5| 25| 50| 2000| 6| 28| 45| 1800|
#+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+

How do I calculate the start/end of an interval (set of rows) containing identical values?

Assume we have a spark DataFrame that looks like the following (ordered by time):
+------+-------+
| time | value |
+------+-------+
| 1 | A |
| 2 | A |
| 3 | A |
| 4 | B |
| 5 | B |
| 6 | A |
+------+-------+
I'd like to calculate the start/end times of each sequence of uninterrupted values. The expected output from the above DataFrame would be:
+-------+-------+-----+
| value | start | end |
+-------+-------+-----+
| A | 1 | 3 |
| B | 4 | 5 |
| A | 6 | 6 |
+-------+-------+-----+
(The end value for the final row could also be null.)
Doing this with a simple group aggregation:
.groupBy("value")
.agg(
F.min("time").alias("start"),
F.max("time").alias("end")
)
doesn't take into account the fact that the same value can appear in multiple different intervals.
the idea is to create an identifier for each group and use it to group by and compute your min and max time.
assuming df is your dataframe:
from pyspark.sql import functions as F, Window
df = df.withColumn(
"fg",
F.when(
F.lag('value').over(Window.orderBy("time"))==F.col("value"),
0
).otherwise(1)
)
df = df.withColumn(
"rn",
F.sum("fg").over(
Window
.orderBy("time")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
)
From that point, you have your dataframe with an identifier for each consecutive group.
df.show()
+----+-----+---+---+
|time|value| rn| fg|
+----+-----+---+---+
| 1| A| 1| 1|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| B| 2| 1|
| 5| B| 2| 0|
| 6| A| 3| 1|
+----+-----+---+---+
then you just have to do the aggregation
df.groupBy(
'value',
"rn"
).agg(
F.min('time').alias("start"),
F.max('time').alias("end")
).drop("rn").show()
+-----+-----+---+
|value|start|end|
+-----+-----+---+
| A| 1| 3|
| B| 4| 5|
| A| 6| 6|
+-----+-----+---+

Resources