Aggregating two columns with Pyspark

Aggregating two columns with Pyspark - apache-spark

Learning Apache Spark through PySpark and having issues.
I have the following DF:
+----------+------------+-----------+----------------+
| game_id|posteam_type|total_plays|total_touchdowns|
+----------+------------+-----------+----------------+
|2009092003| home| 90| 3|
|2010091912| home| 95| 0|
|2010112106| home| 75| 0|
|2010121213| home| 85| 3|
|2009092011| null| 9| null|
|2010110703| null| 2| null|
|2010112111| null| 6| null|
|2011100909| home| 102| 3|
|2011120800| home| 72| 2|
|2012010110| home| 74| 6|
|2012110410| home| 68| 1|
|2012120911| away| 91| 2|
|2011103008| null| 6| null|
|2012111100| null| 3| null|
|2013092212| home| 86| 6|
|2013112407| home| 73| 4|
|2013120106| home| 99| 3|
|2014090705| home| 94| 3|
|2014101203| home| 77| 4|
|2014102611| home| 107| 6|
+----------+------------+-----------+----------------+
I'm attempting to find the average number of plays it takes to score a TD or sum(total_plays)/sum(total_touchdowns).
I figured out the code to get the sums but can't figure out how to get the total average:
plays = nfl_game_play.groupBy().agg({'total_plays': 'sum'}).collect()
touchdowns = nfl_game_play.groupBy().agg({'total_touchdowns',: 'sum'}).collect()
As you can see I tried storing each as a variable but beyond just remembering what each value is and manually doing it.

Try with below code:
Example:
df.show()
#+-----------+----------------+
#|total_plays|total_touchdowns|
#+-----------+----------------+
#| 90| 3|
#| 95| 0|
#| 9| null|
#+-----------+----------------+
from pyspark.sql.functions import *
total_avg=df.groupBy().agg(sum("total_plays")/sum("total_touchdowns")).collect()[0][0]
#64.66666666666667

Related

window function on a subset of data

I have a table like the below. I want to calculate an average of median but only for Q=2 and Q=3. I don't want to include other Qs but still preserve the data.
df = spark.createDataFrame([('2018-03-31',6,1),('2018-03-31',27,2),('2018-03-31',3,3),('2018-03-31',44,4),('2018-06-30',6,1),('2018-06-30',4,3),('2018-06-30',32,2),('2018-06-30',112,4),('2018-09-30',2,1),('2018-09-30',23,4),('2018-09-30',37,3),('2018-09-30',3,2)],['date','median','Q'])
+----------+--------+---+
| date| median | Q |
+----------+--------+---+
|2018-03-31| 6| 1|
|2018-03-31| 27| 2|
|2018-03-31| 3| 3|
|2018-03-31| 44| 4|
|2018-06-30| 6| 1|
|2018-06-30| 4| 3|
|2018-06-30| 32| 2|
|2018-06-30| 112| 4|
|2018-09-30| 2| 1|
|2018-09-30| 23| 4|
|2018-09-30| 37| 3|
|2018-09-30| 3| 2|
+----------+--------+---+
Expected output:
+----------+--------+---+------------+
| date| median | Q |result |
+----------+--------+---+------------+
|2018-03-31| 6| 1| null|
|2018-03-31| 27| 2| 15|
|2018-03-31| 3| 3| 15|
|2018-03-31| 44| 4| null|
|2018-06-30| 6| 1| null|
|2018-06-30| 4| 3| 18|
|2018-06-30| 32| 2| 18|
|2018-06-30| 112| 4| null|
|2018-09-30| 2| 1| null|
|2018-09-30| 23| 4| null|
|2018-09-30| 37| 3| 20|
|2018-09-30| 3| 2| 20|
+----------+--------+---+------------+
OR
+----------+--------+---+------------+
| date| median | Q |result |
+----------+--------+---+------------+
|2018-03-31| 6| 1| 15|
|2018-03-31| 27| 2| 15|
|2018-03-31| 3| 3| 15|
|2018-03-31| 44| 4| 15|
|2018-06-30| 6| 1| 18|
|2018-06-30| 4| 3| 18|
|2018-06-30| 32| 2| 18|
|2018-06-30| 112| 4| 18|
|2018-09-30| 2| 1| 20|
|2018-09-30| 23| 4| 20|
|2018-09-30| 37| 3| 20|
|2018-09-30| 3| 2| 20|
+----------+--------+---+------------+
I tried the following code but when I include the where statement it drops Q=1 and Q=4.
window = (
Window
.partitionBy("date")
.orderBy("date")
)
df_avg = (
df
.where(
(F.col("Q") == 2) |
(F.col("Q") == 3)
)
.withColumn("result", F.avg("median").over(window))
)

For both of your expected output, you can use conditional aggregation, use avg with when (otherwise).
If you want the 1st expected output.
window = (
Window
.partitionBy("date", F.col("Q").isin([2, 3]))
)
df_avg = (
df.withColumn("result", F.when(F.col("Q").isin([2, 3]), F.avg("median").over(window)))
)
For the 2nd expected output.
window = (
Window
.partitionBy("date")
)
df_avg = (
df.withColumn("result", F.avg(F.when(F.col("Q").isin([2, 3]), F.col("median"))).over(window))
)

Alternatively, since you are really aggregating a (small?) subset, replace window with auto-join:
>>> df_avg = df.where(col("Q").isin([2,3])).groupBy("date","Q").agg(avg("median").alias("result"))
>>> df_result = df.join(df_avg,["date","Q"],"left")
Might turn out to be faster than using window.

Pyspark update record based on last value using timestamp and column value

I'm struggling to figure this out. I need to find the last record with reason backfill and update the non backfill record with the greatest timestamp.
Here is what I've tried -
w = Window.orderBy("idx")
w1 = Window.partitionBy('reason').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df_uahr.withColumn('idx',F.monotonically_increasing_id()).withColumn("app_data_new",F.last(F.lead("app_data").over(w)).over(w1)).orderBy("idx").show()
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
|upstart_application_id| reason| created_at| updated_at| app_data|idx|app_data_new|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
| 2|disqualified |2018-07-12 15:57:26|2018-07-12 15:57:26| app_data_a| 0| app_data_c|
| 2| backfill|2020-05-29 17:47:09|2021-05-29 17:47:09| app_data_c| 1| null|
| 2| backfill|2022-03-09 09:47:09|2022-03-09 09:47:09| app_data_d| 2| null|
| 2| test|2022-04-09 09:47:09|2022-04-09 09:47:09| app_data_e| 3| app_data_f|
| 2| test|2022-04-19 09:47:09|2022-04-19 09:47:09|app_data_e_a| 4| app_data_f|
| 2| backfill|2022-05-09 09:47:09|2022-05-09 09:47:09| app_data_f| 5| null|
| 2| after|2023-04-09 09:47:09|2023-04-09 09:47:09| app_data_g| 6| app_data_h|
| 2| backfill|2023-05-09 09:47:09|2023-05-09 09:47:09| app_data_h| 7| null|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
Expected value
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
|upstart_application_id| reason| created_at| updated_at| app_data|idx|app_data_new|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
| 2|disqualified |2018-07-12 15:57:26|2018-07-12 15:57:26| app_data_a| 0| app_data_d|
| 2| backfill|2020-05-29 17:47:09|2021-05-29 17:47:09| app_data_c| 1| null|
| 2| backfill|2022-03-09 09:47:09|2022-03-09 09:47:09| app_data_d| 2| null|
| 2| test|2022-04-09 09:47:09|2022-04-09 09:47:09| app_data_e| 3| null|
| 2| test|2022-04-19 09:47:09|2022-04-19 09:47:09|app_data_e_a| 4| app_data_f|
| 2| backfill|2022-05-09 09:47:09|2022-05-09 09:47:09| app_data_f| 5| null|
| 2| after|2023-04-09 09:47:09|2023-04-09 09:47:09| app_data_g| 6| app_data_h|
| 2| backfill|2023-05-09 09:47:09|2023-05-09 09:47:09| app_data_h| 7| null|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+

Check start, middle and end of groups in Spark

I have a Spark dataframe that looks like this:
+---+-----------+-------------------------+---------------+
| id| Phase | Switch | InputFileName |
+---+-----------+-------------------------+---------------+
| 1| 2| 1| fileA|
| 2| 2| 1| fileA|
| 3| 2| 1| fileA|
| 4| 2| 0| fileA|
| 5| 2| 0| fileA|
| 6| 2| 1| fileA|
| 11| 2| 1| fileB|
| 12| 2| 1| fileB|
| 13| 2| 0| fileB|
| 14| 2| 0| fileB|
| 15| 2| 1| fileB|
| 16| 2| 1| fileB|
| 21| 4| 1| fileB|
| 22| 4| 1| fileB|
| 23| 4| 1| fileB|
| 24| 4| 1| fileB|
| 25| 4| 1| fileB|
| 26| 4| 0| fileB|
| 31| 1| 0| fileC|
| 32| 1| 0| fileC|
| 33| 1| 0| fileC|
| 34| 1| 0| fileC|
| 35| 1| 0| fileC|
| 36| 1| 0| fileC|
+---+-----------+-------------------------+---------------+
For each group (a combination of InputFileName and Phase) I need to run a validation function which checks that Switch equals 1 at the very start and end of the group, and transitions to 0 at any point in-between. The function should add the validation result as a new column. The expected output is below: (gaps are just to highlight the different groups)
+---+-----------+-------------------------+---------------+--------+
| id| Phase | Switch | InputFileName | Valid |
+---+-----------+-------------------------+---------------+--------+
| 1| 2| 1| fileA| true |
| 2| 2| 1| fileA| true |
| 3| 2| 1| fileA| true |
| 4| 2| 0| fileA| true |
| 5| 2| 0| fileA| true |
| 6| 2| 1| fileA| true |
| 11| 2| 1| fileB| true |
| 12| 2| 1| fileB| true |
| 13| 2| 0| fileB| true |
| 14| 2| 0| fileB| true |
| 15| 2| 1| fileB| true |
| 16| 2| 1| fileB| true |
| 21| 4| 1| fileB| false|
| 22| 4| 1| fileB| false|
| 23| 4| 1| fileB| false|
| 24| 4| 1| fileB| false|
| 25| 4| 1| fileB| false|
| 26| 4| 0| fileB| false|
| 31| 1| 0| fileC| false|
| 32| 1| 0| fileC| false|
| 33| 1| 0| fileC| false|
| 34| 1| 0| fileC| false|
| 35| 1| 0| fileC| false|
| 36| 1| 0| fileC| false|
+---+-----------+-------------------------+---------------+--------+
I have previously solved this using Pyspark and a Pandas UDF:
df = df.groupBy("InputFileName", "Phase").apply(validate_profile)
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def validate_profile(df: pd.DataFrame):
first_valid = True if df["Switch"].iloc[0] == 1 else False
during_valid = (df["Switch"].iloc[1:-1] == 0).any()
last_valid = True if df["Switch"].iloc[-1] == 1 else False
df["Valid"] = first_valid & during_valid & last_valid
return df
However, now I need to rewrite this in Scala. I just want to know the best way of accomplishing this.
I'm currently trying window functions to get the first and last ids of each group:
val minIdWindow = Window.partitionBy("InputFileName", "Phase").orderBy("id")
val maxIdWindow = Window.partitionBy("InputFileName", "Phase").orderBy(col("id").desc)
I can then add the min and max ids as separate columns and use when to get the start and end values of Switch:
df.withColumn("MinId", min("id").over(minIdWindow))
.withColumn("MaxId", max("id").over(maxIdWindow))
.withColumn("Valid", when(
col("id") === col("MinId"), col("Switch")
).when(
col("id") === col("MaxId"), col("Switch")
))
This gets me the start and end values, but I'm not sure how to check if Switch equals 0 in between. Am I on the right track using window functions? Or would you recommend an alternative solution?

Try this,
val wind = Window.partitionBy("InputFileName", "Phase").orderBy("id")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
val df1 = df.withColumn("Valid",
when(first("Switch").over(wind) === 1
&& last("Switch").over(wind) === 1
&& min("Switch").over(wind) === 0, true)
.otherwise(false))
df1.orderBy("id").show() //Ordering for display purpose
Output:
+---+-----+------+-------------+-----+
| id|Phase|Switch|InputFileName|Valid|
+---+-----+------+-------------+-----+
| 1| 2| 1| fileA| true|
| 2| 2| 1| fileA| true|
| 3| 2| 1| fileA| true|
| 4| 2| 0| fileA| true|
| 5| 2| 0| fileA| true|
| 6| 2| 1| fileA| true|
| 11| 2| 1| fileB| true|
| 12| 2| 1| fileB| true|
| 13| 2| 0| fileB| true|
| 14| 2| 0| fileB| true|
| 15| 2| 1| fileB| true|
| 16| 2| 1| fileB| true|
| 21| 4| 1| fileB|false|
| 22| 4| 1| fileB|false|
| 23| 4| 1| fileB|false|
| 24| 4| 1| fileB|false|
| 25| 4| 1| fileB|false|
| 26| 4| 0| fileB|false|
| 31| 1| 0| fileC|false|
| 32| 1| 0| fileC|false|
+---+-----+------+-------------+-----+

Calculate percentiles ignoring missing values

I have a PySpark dataframe with columns ID and BALANCE.
I am trying to bucket the column balance into 100 percentile (1-100%) buckets and calculate how many IDs fall in each bucket.
I cannot use anything related to RDD, I can only use PySpark syntax. I've tried the code below.
w = Window.orderBy(df.BALANCE)
test = df.withColumn('percentile_col', F.percent_rank().over(w))
I am hoping to get a new column that automatically calculates the percentile of each data point in BALANCE column and ignoring the missing value.

Spark 3.1+ has unionByName which has an optional argument allowMissingColumns. This makes it easier.
Test data:
from pyspark.sql import functions as F, Window as W
df = spark.range(12).withColumn(
'balance',
F.when(~F.col('id').isin([0, 1, 2, 3, 4]), F.col('id') + 500))
df.show()
#+---+-------+
#| id|balance|
#+---+-------+
#| 0| null|
#| 1| null|
#| 2| null|
#| 3| null|
#| 4| null|
#| 5| 505|
#| 6| 506|
#| 7| 507|
#| 8| 508|
#| 9| 509|
#| 10| 510|
#| 11| 511|
#+---+-------+
percent_rank will give you exact percentiles - results may have many numbers after the decimal point. That's why percent_rank alone may not provide what you want.
df = (
df.filter(~F.isnull('balance'))
.withColumn('percentile', F.percent_rank().over(W.orderBy('balance')))
.unionByName(df.filter(F.isnull('balance')), True)
)
df.show()
#+---+-------+-------------------+
#| id|balance| percentile|
#+---+-------+-------------------+
#| 5| 505| 0.0|
#| 6| 506|0.16666666666666666|
#| 7| 507| 0.3333333333333333|
#| 8| 508| 0.5|
#| 9| 509| 0.6666666666666666|
#| 10| 510| 0.8333333333333334|
#| 11| 511| 1.0|
#| 0| null| null|
#| 1| null| null|
#| 2| null| null|
#| 3| null| null|
#| 4| null| null|
#+---+-------+-------------------+
The following should work. Rounding step is added.
pr = F.percent_rank().over(W.orderBy('balance'))
df = (
df.filter(~F.isnull('balance'))
.withColumn('bucket', F.when(pr == 0, 1).otherwise(F.ceil(pr * 100)))
.unionByName(df.filter(F.isnull('balance')), True)
)
df.show()
#+---+-------+------+
#| id|balance|bucket|
#+---+-------+------+
#| 5| 505| 1|
#| 6| 506| 17|
#| 7| 507| 34|
#| 8| 508| 50|
#| 9| 509| 67|
#| 10| 510| 84|
#| 11| 511| 100|
#| 0| null| null|
#| 1| null| null|
#| 2| null| null|
#| 3| null| null|
#| 4| null| null|
#+---+-------+------+
You may also consider ntile. Every value is added to one of n "buckets".
When n=100 (test table has less than 100 items, so only the first "buckets" get values):
df = (
df.filter(~F.isnull('balance'))
.withColumn('ntile_100', F.ntile(100).over(W.orderBy('balance')))
.unionByName(df.filter(F.isnull('balance')), True)
)
df.show()
#+---+-------+---------+
#| id|balance|ntile_100|
#+---+-------+---------+
#| 5| 505| 1|
#| 6| 506| 2|
#| 7| 507| 3|
#| 8| 508| 4|
#| 9| 509| 5|
#| 10| 510| 6|
#| 11| 511| 7|
#| 0| null| null|
#| 1| null| null|
#| 2| null| null|
#| 3| null| null|
#| 4| null| null|
#+---+-------+---------+
When n=4:
df = (
df.filter(~F.isnull('balance'))
.withColumn('ntile_100', F.ntile(4).over(W.orderBy('balance')))
.unionByName(df.filter(F.isnull('balance')), True)
)
df.show()
#+---+-------+---------+
#| id|balance|ntile_100|
#+---+-------+---------+
#| 5| 505| 1|
#| 6| 506| 1|
#| 7| 507| 2|
#| 8| 508| 2|
#| 9| 509| 3|
#| 10| 510| 3|
#| 11| 511| 4|
#| 0| null| null|
#| 1| null| null|
#| 2| null| null|
#| 3| null| null|
#| 4| null| null|
#+---+-------+---------+

Try this.
We are first checking if the df.Balance column has Null values. If it has Null values we are displaying None. Else the percent_rank() function gets applied.
from pyspark.sql import functions as F
w = Window.orderBy(df.BALANCE)
test = df.withColumn('percentile_col',when(df.BALANCE.isNull(), lit(None)).otherwise(F.percent_rank().over(w)))

Applying spark window function on subset of a dataframe

I'm trying to force spark to only apply a window function on a specified subset of a dataframe, while the actual window has access to rows outside of this subset. Let me go through an example:
I have a spark dataframe that has been saved to hdfs. The dataframe contains events, so each row has a timestamp an id and an integer feature. Also there is a column that I calculate, which is a sum window function made like this:
df = spark.table("some_table_in_hdfs")
w = Window.partitionBy("id").orderBy("date")
df = df.withColumn("feat_int_sum", F.sum("feat_int").over(w))
df.show()
+----------+--------+---+------------+
| date|feat_int| id|feat_int_sum|
+----------+--------+---+------------+
|2018-08-10| 5| 0| 5|
|2018-08-12| 27| 0| 32|
|2018-08-14| 3| 0| 35|
|2018-08-11| 32| 1| 32|
|2018-08-12| 552| 1| 584|
|2018-08-16| 2| 1| 586|
+----------+--------+---+------------+
When I do a load of new data from a different source, I would like to append to the above dataframe in hdfs. In order to do this, I have to apply the window function to the new data as well. I would like to union the two dataframes so that the window function has access to the "old" feat_int values to do the sum over.
df_new = spark.table("some_new_table")
df_new.show()
+----------+--------+---+
| date|feat_int| id|
+----------+--------+---+
|2018-08-20| 65| 0|
|2018-08-23| 3| 0|
|2018-08-24| 4| 0|
|2018-08-21| 69| 1|
|2018-08-25| 37| 1|
|2018-08-26| 3| 1|
+----------+--------+---+
df_union = df.union(df_new.withColumn("feat_int_sum", F.lit(None)))
df_union.show()
+----------+--------+---+------------+
| date|feat_int| id|feat_int_sum|
+----------+--------+---+------------+
|2018-08-10| 5| 0| 5|
|2018-08-12| 27| 0| 32|
|2018-08-14| 3| 0| 35|
|2018-08-20| 65| 0| null|
|2018-08-23| 3| 0| null|
|2018-08-24| 4| 0| null|
|2018-08-11| 32| 1| 32|
|2018-08-12| 552| 1| 584|
|2018-08-16| 2| 1| 586|
|2018-08-21| 69| 1| null|
|2018-08-25| 37| 1| null|
|2018-08-26| 3| 1| null|
+----------+--------+---+------------+
The problem is, that I would like to apply the sum window function on df_union but only on the rows with null in feat_int_sum. The reason is that I don't want to recalculate the window function for all the values that are already calculated in df. So the desired result would be something like this:
+----------+--------+---+------------+-----------------+
| date|feat_int| id|feat_int_sum|feat_int_sum_temp|
+----------+--------+---+------------+-----------------+
|2018-08-10| 5| 0| 5| null|
|2018-08-12| 27| 0| 32| null|
|2018-08-14| 3| 0| 35| null|
|2018-08-20| 65| 0| null| 100|
|2018-08-23| 3| 0| null| 103|
|2018-08-24| 4| 0| null| 107|
|2018-08-11| 32| 1| 32| null|
|2018-08-12| 552| 1| 584| null|
|2018-08-16| 2| 1| 586| null|
|2018-08-21| 69| 1| null| 655|
|2018-08-25| 37| 1| null| 692|
|2018-08-26| 3| 1| null| 695|
+----------+--------+---+------------+-----------------+
I tried wrapping the window function in a when statement like this:
df_union.withColumn("feat_int_sum_temp", F.when(F.col('date') > '2018-08-16', F.sum('feat_int').over(w))
But looking at the spark explain plan, it seems that it will run the window function for all rows, and afterwards apply the when conditon.
The whole reason I don't want to run the window function on the old rows, is that I'm dealing with some really big tables, and I don't want to waste computational ressources recalculating values that will not be used.
After this step I would coalesce the feat_int_sum and the feat_int_sum_temp columns, and append only the new part of the data to hdfs.
I would appreciate any hints on how to force spark to only apply the window function on the specified subset, while the actual window has access to rows outside of this subset.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Aggregating two columns with Pyspark - apache-spark

Related

window function on a subset of data

Pyspark update record based on last value using timestamp and column value

Check start, middle and end of groups in Spark

Calculate percentiles ignoring missing values

Applying spark window function on subset of a dataframe

Categories

Resources