Max/Min comparing two values in Spark withColumn and not aggregate [duplicate]

Max/Min comparing two values in Spark withColumn and not aggregate [duplicate] - apache-spark

I have a data frame read with sqlContext.sql function in pyspark.
This contains 4 numerics columns with information per client (this is the key id).
I need to calculate the max value per client and join this value to the data frame:
+--------+-------+-------+-------+-------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|
+--------+-------+-------+-------+-------+
| 0| null| null| null| null|
| 1| null| null| null| null|
| 2| null| null| null| null|
| 3| null| null| null| null|
| 4| null| null| null| null|
| 5| null| null| null| null|
| 6| 23| 13| 17| 8|
| 7| null| null| null| null|
| 8| null| null| null| null|
| 9| null| null| null| null|
| 10| 34| 2| 4| 0|
| 11| 0| 0| 0| 0|
| 12| 0| 0| 0| 0|
| 13| 0| 0| 30| 0|
| 14| null| null| null| null|
| 15| null| null| null| null|
| 16| 37| 29| 29| 29|
| 17| 0| 0| 16| 0|
| 18| 0| 0| 0| 0|
| 19| null| null| null| null|
+--------+-------+-------+-------+-------+
In this case, the max value to the client six is 23 and the client ten is 30. the null is naturally null in the new column.
Please help me showing how can i do this operation.

There is a function for that: pyspark.sql.functions.greatest.
>>> df = spark.createDataFrame([(1, 4, 3)], ['a', 'b', 'c'])
>>> df.select(greatest(df.a, df.b, df.c).alias("greatest")).collect()
[Row(greatest=4)]
The example was taken directly from the docs.
(Least does the opposite.)

I think combing values to a list and than finding max on it would be the simplest approach.
from pyspark.sql.types import *
schema = StructType([
StructField("ClientId", IntegerType(), True),
StructField("m_ant21", IntegerType(), True),
StructField("m_ant22", IntegerType(), True),
StructField("m_ant23", IntegerType(), True),
StructField("m_ant24", IntegerType(), True)
])
df = spark\
.createDataFrame(
data=[(0, None, None, None, None),
(1, 23, 13, 17, 99),
(2, 0, 0, 0, 1),
(3, 0, None, 1, 0)],
schema=schema)
import pyspark.sql.functions as F
def agg_to_list(m21,m22,m23,m24):
return [m21,m22,m23,m24]
u_agg_to_list = F.udf(agg_to_list, ArrayType(IntegerType()))
df2 = df.withColumn('all_values', u_agg_to_list('m_ant21', 'm_ant22', 'm_ant23', 'm_ant24'))\
.withColumn('max', F.sort_array("all_values", False)[0])\
.select('ClientId', 'max')
df2.show()
Outputs :
+--------+----+
|ClientId|max |
+--------+----+
|0 |null|
|1 |99 |
|2 |1 |
|3 |1 |
+--------+----+

Related

How to interpolate time series based on time gap between non null values in PySpark

I would like to interpolate time series data. Thereby, the challenge is to interpolate only if the time interval between the existing values is not greater than a specified limit.
Input data
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.memory", "60g").getOrCreate()
df = spark.createDataFrame([{'timestamp': 1642205833225, 'value': 58.00},
{'timestamp': 1642205888654, 'value': float('nan')},
{'timestamp': 1642205899657, 'value': float('nan')},
{'timestamp': 1642205892970, 'value': 55.00},
{'timestamp': 1642206338180, 'value': float('nan')},
{'timestamp': 1642206353652, 'value': 56.45},
{'timestamp': 1642206853451, 'value': float('nan')},
{'timestamp': 1642207353652, 'value': 80.45}
])
df.show()
+-------------+-----+
| timestamp|value|
+-------------+-----+
|1642205833225| 58.0|
|1642205888654| NaN|
|1642205899654| NaN|
|1642205892970| 55.0|
|1642206338180| NaN|
|1642206353652|56.45|
|1642206853451| NaN|
|1642207353652|80.45|
+-------------+-----+
First I want to calculate the time gap to the next existing value
(next_value - current_value).
+-------------+-----+---------------+
| timestamp|value|timegap_to_next|
+-------------+-----+---------------+
|1642205833225| 58.0| 59745|
|1642205888654| NaN| NaN|
|1642205899657| NaN| NaN|
|1642205892970| 55.0| 460682|
|1642206338180| NaN| NaN|
|1642206353652|56.45| 1030300|
|1642206853451| NaN| NaN|
|1642207383952|80.45| NaN|
+-------------+-----+---------------+
Based on the calculated Timegap the interpolation should be done. In this case the threshold is 500000.
Final Output:
+-------------+-----+---------------+
| timestamp|value|timegap_to_next|
+-------------+-----+---------------+
|1642205833225| 58.0| 59745|
|1642205888654| 57.0| NaN|
|1642205899657| 56.0| NaN|
|1642205892970| 55.0| 460682|
|1642206338180|55.75| NaN|
|1642206353652|56.45| 1030300|
|1642206853451| NaN| NaN|
|1642207383952|80.45| NaN|
+-------------+-----+---------------+
Can anybody help me with this special case? That would be very nice!

Having this input dataframe:
df = spark.createDataFrame([
(1642205833225, 58.00), (1642205888654, float('nan')),
(1642205899657, float('nan')), (1642205899970, 55.00),
(1642206338180, float('nan')), (1642206353652, 56.45),
(1642206853451, float('nan')), (1642207353652, 80.45)
], ["timestamp", "value"])
# replace NaN value by Nulls
df = df.replace(float("nan"), None, ["value"])
You can use some window functions (last, first) to get next and previous non null values for each row and calculate the time gap like this:
from pyspark.sql import functions as F, Window
w1 = Window.orderBy("timestamp").rowsBetween(1, Window.unboundedFollowing)
w2 = Window.orderBy("timestamp").rowsBetween(Window.unboundedPreceding, -1)
df = (
df.withColumn("rn", F.row_number().over(Window.orderBy("timestamp")))
.withColumn("next_val", F.first("value", ignorenulls=True).over(w1))
.withColumn("next_rn", F.first(F.when(F.col("value").isNotNull(), F.col("rn")), ignorenulls=True).over(w1))
.withColumn("prev_val", F.last("value", ignorenulls=True).over(w2))
.withColumn("prev_rn", F.last(F.when(F.col("value").isNotNull(), F.col("rn")), ignorenulls=True).over(w2))
.withColumn("timegap_to_next", F.when(F.col("value").isNotNull(), F.min(F.when(F.col("value").isNotNull(), F.col("timestamp"))).over(w1) - F.col("timestamp")))
)
Now, you can do the linear interpolation of column value depending on your threshold using when expression:
w3 = Window.orderBy("timestamp").rowsBetween(Window.unboundedPreceding, Window.currentRow)
df = df.withColumn(
"value",
F.coalesce(
"value",
F.when(
F.last("timegap_to_next", ignorenulls=True).over(w3) < 500000,
(F.col("prev_val") +
((F.col("next_val") - F.col("prev_val"))/
(F.col("next_timestamp") - F.col("prev_next_timestamp"))
* (F.col("timestamp") - F.col("prev_next_timestamp")
)
)
)
)
)
).select("timestamp", "value", "timegap_to_next")
df.show()
#+-------------+------+---------------+
#| timestamp| value|timegap_to_next|
#+-------------+------+---------------+
#|1642205833225| 58.0| 66745|
#|1642205888654| 56.0| null|
#|1642205899657| 57.0| null|
#|1642205899970| 55.0| 453682|
#|1642206338180|55.725| null|
#|1642206353652| 56.45| 1000000|
#|1642206853451| null| null|
#|1642207353652| 80.45| null|
#+-------------+------+---------------+

I was looking for a solution to this problem, and I have noticed that the last part of the answer above cannot be ran directly as it is, since some columns are not defined. I edited and simplified a bit (I am using window functions only in the first step) in case anyone else find it useful.
Input dataframe:
df = spark.createDataFrame([
(0, None),
(10, 10.00),
(20, None), # 20
(30, 30.00),
(40, None), # 25
(50, None), # 20
(60, 15.00),
(70, None), # 20
(80, 25.00),
(90, None)
], ["timestamp", "value"])
Compute previous/next timestamp and values using window functions:
from pyspark.sql import functions as F, Window
w1 = Window.orderBy("timestamp").rowsBetween(1, Window.unboundedFollowing)
w2 = Window.orderBy("timestamp").rowsBetween(Window.unboundedPreceding, -1)
df = (
df.withColumn("next_val", F.first("value", ignorenulls=True).over(w1))
.withColumn("prev_val", F.last("value", ignorenulls=True).over(w2))
.withColumn("next_timestamp", F.first(F.when(F.col("value").isNotNull(), F.col("timestamp")), ignorenulls=True).over(w1))
.withColumn("prev_timestamp", F.last(F.when(F.col("value").isNotNull(), F.col("timestamp")), ignorenulls=True).over(w2))
)
df.show()
+---------+-----+--------+--------+--------------+--------------+
|timestamp|value|next_val|prev_val|next_timestamp|prev_timestamp|
+---------+-----+--------+--------+--------------+--------------+
| 0| null| 10.0| null| 10| null|
| 10| 10.0| 30.0| null| 30| null|
| 20| null| 30.0| 10.0| 30| 10|
| 30| 30.0| 15.0| 10.0| 60| 10|
| 40| null| 15.0| 30.0| 60| 30|
| 50| null| 15.0| 30.0| 60| 30|
| 60| 15.0| 25.0| 30.0| 80| 30|
| 70| null| 25.0| 15.0| 80| 60|
| 80| 25.0| null| 15.0| null| 60|
| 90| null| null| 25.0| null| 80|
+---------+-----+--------+--------+--------------+--------------+
Make interpolation with conditional on the length of intervals of missing values. In this case the time gap from two consecutive existing values need to be less than 30.
df = (df.withColumn("timegap", F.when(F.col("value").isNull(), F.col("next_timestamp")-F.col("prev_timestamp")))
.withColumn("new_value",
F.when(
(F.col("value").isNull()) & (F.col('timegap')<30),
F.round(F.col('prev_val') + (F.col('next_val')-F.col('prev_val')) / (F.col('next_timestamp')-F.col('prev_timestamp')) * (F.col('timestamp')-F.col('prev_timestamp')), 2 )
).otherwise(F.col('value')))
)
df.select('timestamp','value', 'timegap', 'new_value').show()
+---------+-----+-------+---------+
|timestamp|value|timegap|new_value|
+---------+-----+-------+---------+
| 0| null| null| null|
| 10| 10.0| null| 10.0|
| 20| null| 20| 20.0|
| 30| 30.0| null| 30.0|
| 40| null| 30| null|
| 50| null| 30| null|
| 60| 15.0| null| 15.0|
| 70| null| 20| 20.0|
| 80| 25.0| null| 25.0|
| 90| null| null| null|
+---------+-----+-------+---------+

Aggregating two columns with Pyspark

Learning Apache Spark through PySpark and having issues.
I have the following DF:
+----------+------------+-----------+----------------+
| game_id|posteam_type|total_plays|total_touchdowns|
+----------+------------+-----------+----------------+
|2009092003| home| 90| 3|
|2010091912| home| 95| 0|
|2010112106| home| 75| 0|
|2010121213| home| 85| 3|
|2009092011| null| 9| null|
|2010110703| null| 2| null|
|2010112111| null| 6| null|
|2011100909| home| 102| 3|
|2011120800| home| 72| 2|
|2012010110| home| 74| 6|
|2012110410| home| 68| 1|
|2012120911| away| 91| 2|
|2011103008| null| 6| null|
|2012111100| null| 3| null|
|2013092212| home| 86| 6|
|2013112407| home| 73| 4|
|2013120106| home| 99| 3|
|2014090705| home| 94| 3|
|2014101203| home| 77| 4|
|2014102611| home| 107| 6|
+----------+------------+-----------+----------------+
I'm attempting to find the average number of plays it takes to score a TD or sum(total_plays)/sum(total_touchdowns).
I figured out the code to get the sums but can't figure out how to get the total average:
plays = nfl_game_play.groupBy().agg({'total_plays': 'sum'}).collect()
touchdowns = nfl_game_play.groupBy().agg({'total_touchdowns',: 'sum'}).collect()
As you can see I tried storing each as a variable but beyond just remembering what each value is and manually doing it.

Try with below code:
Example:
df.show()
#+-----------+----------------+
#|total_plays|total_touchdowns|
#+-----------+----------------+
#| 90| 3|
#| 95| 0|
#| 9| null|
#+-----------+----------------+
from pyspark.sql.functions import *
total_avg=df.groupBy().agg(sum("total_plays")/sum("total_touchdowns")).collect()[0][0]
#64.66666666666667

Pyspark: Replace all occurrences of a value with null in dataframe

I have a dataframe similar to below. I originally filled all null values with -1 to do my joins in Pyspark.
df = pd.DataFrame({'Number': ['1', '2', '-1', '-1'],
'Letter': ['A', '-1', 'B', 'A'],
'Value': [30, 30, 30, -1]})
pyspark_df = spark.createDataFrame(df)
+------+------+-----+
|Number|Letter|Value|
+------+------+-----+
| 1| A| 30|
| 2| -1| 30|
| -1| B| 30|
| -1| A| -1|
+------+------+-----+
After processing the dataset, I need to replace all -1 back to null values.
+------+------+-----+
|Number|Letter|Value|
+------+------+-----+
| 1| A| 30|
| 2| null| 30|
| null| B| 30|
| null| A| null|
+------+------+-----+
What's the easiest way to do this?

Another way to do this in a less verbose manner is to use replace.
pyspark_df.replace(-1,None).replace('-1',None).show()

when+otherwise shall do the trick:
import pyspark.sql.functions as F
pyspark_df.select([F.when(F.col(i).cast("Integer") <0 , None).otherwise(F.col(i)).alias(i)
for i in df.columns]).show()
+------+------+-----+
|Number|Letter|Value|
+------+------+-----+
| 1| A| 30|
| 2| null| 30|
| null| B| 30|
| null| A| null|
+------+------+-----+

You can scan all columns and replace -1's with None:
import pyspark.sql.functions as F
for x in pyspark_df.columns:
pyspark_df = pyspark_df.withColumn(x, F.when(F.col(x)==-1, F.lit(None)).otherwise(F.col(x)))
pyspark_df.show()
Output:
+------+------+-----+
|Number|Letter|Value|
+------+------+-----+
| 1| A| 30|
| 2| null| 30|
| null| B| 30|
| null| A| null|
+------+------+-----+

Use reduce to apply when+otherwise on all columns on dataframe.
df.show()
#+------+------+-----+
#|Number|Letter|Value|
#+------+------+-----+
#| 1| A| 30|
#| 2| -1| 30|
#| -1| B| 30|
#+------+------+-----+
from functools import reduce
(reduce(lambda new_df, col_name: new_df.withColumn(col_name, when(col(col_name)== '-1',lit(None)).otherwise(col(col_name))),df.columns,df)).show()
#+------+------+-----+
#|Number|Letter|Value|
#+------+------+-----+
#| 1| A| 30|
#| 2| null| 30|
#| null| B| 30|
#+------+------+-----+

Calculate percentiles ignoring missing values

I have a PySpark dataframe with columns ID and BALANCE.
I am trying to bucket the column balance into 100 percentile (1-100%) buckets and calculate how many IDs fall in each bucket.
I cannot use anything related to RDD, I can only use PySpark syntax. I've tried the code below.
w = Window.orderBy(df.BALANCE)
test = df.withColumn('percentile_col', F.percent_rank().over(w))
I am hoping to get a new column that automatically calculates the percentile of each data point in BALANCE column and ignoring the missing value.

Spark 3.1+ has unionByName which has an optional argument allowMissingColumns. This makes it easier.
Test data:
from pyspark.sql import functions as F, Window as W
df = spark.range(12).withColumn(
'balance',
F.when(~F.col('id').isin([0, 1, 2, 3, 4]), F.col('id') + 500))
df.show()
#+---+-------+
#| id|balance|
#+---+-------+
#| 0| null|
#| 1| null|
#| 2| null|
#| 3| null|
#| 4| null|
#| 5| 505|
#| 6| 506|
#| 7| 507|
#| 8| 508|
#| 9| 509|
#| 10| 510|
#| 11| 511|
#+---+-------+
percent_rank will give you exact percentiles - results may have many numbers after the decimal point. That's why percent_rank alone may not provide what you want.
df = (
df.filter(~F.isnull('balance'))
.withColumn('percentile', F.percent_rank().over(W.orderBy('balance')))
.unionByName(df.filter(F.isnull('balance')), True)
)
df.show()
#+---+-------+-------------------+
#| id|balance| percentile|
#+---+-------+-------------------+
#| 5| 505| 0.0|
#| 6| 506|0.16666666666666666|
#| 7| 507| 0.3333333333333333|
#| 8| 508| 0.5|
#| 9| 509| 0.6666666666666666|
#| 10| 510| 0.8333333333333334|
#| 11| 511| 1.0|
#| 0| null| null|
#| 1| null| null|
#| 2| null| null|
#| 3| null| null|
#| 4| null| null|
#+---+-------+-------------------+
The following should work. Rounding step is added.
pr = F.percent_rank().over(W.orderBy('balance'))
df = (
df.filter(~F.isnull('balance'))
.withColumn('bucket', F.when(pr == 0, 1).otherwise(F.ceil(pr * 100)))
.unionByName(df.filter(F.isnull('balance')), True)
)
df.show()
#+---+-------+------+
#| id|balance|bucket|
#+---+-------+------+
#| 5| 505| 1|
#| 6| 506| 17|
#| 7| 507| 34|
#| 8| 508| 50|
#| 9| 509| 67|
#| 10| 510| 84|
#| 11| 511| 100|
#| 0| null| null|
#| 1| null| null|
#| 2| null| null|
#| 3| null| null|
#| 4| null| null|
#+---+-------+------+
You may also consider ntile. Every value is added to one of n "buckets".
When n=100 (test table has less than 100 items, so only the first "buckets" get values):
df = (
df.filter(~F.isnull('balance'))
.withColumn('ntile_100', F.ntile(100).over(W.orderBy('balance')))
.unionByName(df.filter(F.isnull('balance')), True)
)
df.show()
#+---+-------+---------+
#| id|balance|ntile_100|
#+---+-------+---------+
#| 5| 505| 1|
#| 6| 506| 2|
#| 7| 507| 3|
#| 8| 508| 4|
#| 9| 509| 5|
#| 10| 510| 6|
#| 11| 511| 7|
#| 0| null| null|
#| 1| null| null|
#| 2| null| null|
#| 3| null| null|
#| 4| null| null|
#+---+-------+---------+
When n=4:
df = (
df.filter(~F.isnull('balance'))
.withColumn('ntile_100', F.ntile(4).over(W.orderBy('balance')))
.unionByName(df.filter(F.isnull('balance')), True)
)
df.show()
#+---+-------+---------+
#| id|balance|ntile_100|
#+---+-------+---------+
#| 5| 505| 1|
#| 6| 506| 1|
#| 7| 507| 2|
#| 8| 508| 2|
#| 9| 509| 3|
#| 10| 510| 3|
#| 11| 511| 4|
#| 0| null| null|
#| 1| null| null|
#| 2| null| null|
#| 3| null| null|
#| 4| null| null|
#+---+-------+---------+

Try this.
We are first checking if the df.Balance column has Null values. If it has Null values we are displaying None. Else the percent_rank() function gets applied.
from pyspark.sql import functions as F
w = Window.orderBy(df.BALANCE)
test = df.withColumn('percentile_col',when(df.BALANCE.isNull(), lit(None)).otherwise(F.percent_rank().over(w)))

Split numerical count in Spark DataFrame column into several columns

Let's say I have a spark DataFrame like this
+------------------+----------+--------------+-----+
| user| dt| action|count|
+------------------+----------+--------------+-----+
|Albert |2018-03-24|Action1 | 19|
|Albert |2018-03-25|Action1 | 1|
|Albert |2018-03-26|Action1 | 6|
|Barack |2018-03-26|Action2 | 3|
|Barack |2018-03-26|Action3 | 1|
|Donald |2018-03-26|Action3 | 29|
|Hillary |2018-03-24|Action1 | 4|
|Hillary |2018-03-26|Action2 | 2|
and I'd like to have counts for Action1/Action2/Action3 in the separate counts, so to convert it into another DataFrame like this
+------------------+----------+-------------+-------------+-------------+
| user| dt|action1_count|action2_count|action3_count|
+------------------+----------+-------------+-------------+-------------+
|Albert |2018-03-24| 19| 0| 0|
|Albert |2018-03-25| 1| 0| 0|
|Albert |2018-03-26| 6| 0| 0|
|Barack |2018-03-26| 0| 3| 0|
|Barack |2018-03-26| 0| 0| 1|
|Donald |2018-03-26| 0| 0| 29|
|Hillary |2018-03-24| 4| 0| 0|
|Hillary |2018-03-26| 0| 2| 0|
As I'm a newbie to Spark, my attempt to reach that was quite dull and straightforward:
Get 3 new DF's from filtering by each "action"
Join original DF with each of new ones, using the second DF's "count" in the new DF
The code I tried looked like this:
val a1 = originalDf.filter("action = 'Action1'")
val df1 = originalDf.as('o)
.join(a1,
($"o.user" === $"a1.user" && $"o.dt" === $"a1.dt"),
"left_outer")
.select($"o.user", $"o.dt", $"a1.count".as("action1_count"))
Then do the same with Action2/Action3, then join those.
However, even at this stage I've already got several problems with such approach:
It doesn't work at all - I mean fails with an error the reason of which I don't understand: org.apache.spark.sql.AnalysisException: cannot resolve 'o.user' given input columns: [user, dt, action, count, user, dt, action, count];
Even if it succeeded, I assume I would have got nulls where I need zeros.
I feel there should be a better way to reach this. Like some map construct or something. But at the moment I don't feel I'm able to construct the transform required to convert first dataframe into second one.
So as right now I don't have working solution at all, I'll be very thankful for any suggestions.
UPD: I might also get DF's that don't contain all of 3 possible "action" values, for instance
+------------------+----------+--------------+-----+
| user| dt| action|count|
+------------------+----------+--------------+-----+
|Albert |2018-03-24|Action1 | 19|
|Albert |2018-03-25|Action1 | 1|
|Albert |2018-03-26|Action1 | 6|
|Hillary |2018-03-24|Action1 | 4|
For those, I still need the resulting DF with 3 columns:
+------------------+----------+-------------+-------------+-------------+
| user| dt|action1_count|action2_count|action3_count|
+------------------+----------+-------------+-------------+-------------+
|Albert |2018-03-24| 19| 0| 0|
|Albert |2018-03-25| 1| 0| 0|
|Albert |2018-03-26| 6| 0| 0|
|Hillary |2018-03-24| 4| 0| 0|

You can avoid multiple join by using when to select appropriate value of column.
About your join, I don't really think it got exception like cannot resolve 'o.user', you may want to check your code again.
val df = Seq(("Albert","2018-03-24","Action1",19),
("Albert","2018-03-25","Action1",1),
("Albert","2018-03-26","Action1",6),
("Barack","2018-03-26","Action2",3),
("Barack","2018-03-26","Action3",1),
("Donald","2018-03-26","Action3",29),
("Hillary","2018-03-24","Action1",4),
("Hillary","2018-03-26","Action2",2)).toDF("user", "dt", "action", "count")
val df2 = df.withColumn("count1", when($"action" === "Action1", $"count").otherwise(lit(0))).
withColumn("count2", when($"action" === "Action2", $"count").otherwise(lit(0))).
withColumn("count3", when($"action" === "Action3", $"count").otherwise(lit(0)))
+-------+----------+-------+-----+------+------+------+
|user |dt |action |count|count1|count2|count3|
+-------+----------+-------+-----+------+------+------+
|Albert |2018-03-24|Action1|19 |19 |0 |0 |
|Albert |2018-03-25|Action1|1 |1 |0 |0 |
|Albert |2018-03-26|Action1|6 |6 |0 |0 |
|Barack |2018-03-26|Action2|3 |0 |3 |0 |
|Barack |2018-03-26|Action3|1 |0 |0 |1 |
|Donald |2018-03-26|Action3|29 |0 |0 |29 |
|Hillary|2018-03-24|Action1|4 |4 |0 |0 |
|Hillary|2018-03-26|Action2|2 |0 |2 |0 |
+-------+----------+-------+-----+------+------+------+

Here's one approach using pivot and first, with the advantage of not having to know what the action values are:
val df = Seq(
("Albert", "2018-03-24", "Action1", 19),
("Albert", "2018-03-25", "Action1", 1),
("Albert", "2018-03-26", "Action1", 6),
("Barack", "2018-03-26", "Action2", 3),
("Barack", "2018-03-26", "Action3", 1),
("Donald", "2018-03-26", "Action3", 29),
("Hillary", "2018-03-24", "Action1", 4),
("Hillary", "2018-03-26", "Action2", 2)
).toDF("user", "dt", "action", "count")
val pivotDF = df.groupBy("user", "dt", "action").pivot("action").agg(first($"count")).
na.fill(0).
orderBy("user", "dt", "action")
// +-------+----------+-------+-------+-------+-------+
// | user| dt| action|Action1|Action2|Action3|
// +-------+----------+-------+-------+-------+-------+
// | Albert|2018-03-24|Action1| 19| 0| 0|
// | Albert|2018-03-25|Action1| 1| 0| 0|
// | Albert|2018-03-26|Action1| 6| 0| 0|
// | Barack|2018-03-26|Action2| 0| 3| 0|
// | Barack|2018-03-26|Action3| 0| 0| 1|
// | Donald|2018-03-26|Action3| 0| 0| 29|
// |Hillary|2018-03-24|Action1| 4| 0| 0|
// |Hillary|2018-03-26|Action2| 0| 2| 0|
// +-------+----------+-------+-------+-------+-------+
[UPDATE]
Per comments, if you have more Action? to be created as columns than those in the pivot column, you can traverse the missing Action? to add them as zero-filled as columns:
val fullActionList = List("Action1", "Action2", "Action3", "Action4", "Action5")
val missingActions = fullActionList.diff(
pivotDF.select($"action").as[String].collect.toList.distinct
)
// missingActions: List[String] = List(Action4, Action5)
missingActions.foldLeft( pivotDF )( _.withColumn(_, lit(0)) ).
show
// +-------+----------+-------+-------+-------+-------+-------+-------+
// | user| dt| action|Action1|Action2|Action3|Action4|Action5|
// +-------+----------+-------+-------+-------+-------+-------+-------+
// | Albert|2018-03-24|Action1| 19| 0| 0| 0| 0|
// | Albert|2018-03-25|Action1| 1| 0| 0| 0| 0|
// | Albert|2018-03-26|Action1| 6| 0| 0| 0| 0|
// | Barack|2018-03-26|Action2| 0| 3| 0| 0| 0|
// | Barack|2018-03-26|Action3| 0| 0| 1| 0| 0|
// | Donald|2018-03-26|Action3| 0| 0| 29| 0| 0|
// |Hillary|2018-03-24|Action1| 4| 0| 0| 0| 0|
// |Hillary|2018-03-26|Action2| 0| 2| 0| 0| 0|
// +-------+----------+-------+-------+-------+-------+-------+-------+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Max/Min comparing two values in Spark withColumn and not aggregate [duplicate] - apache-spark

There is a function for that: pyspark.sql.functions.greatest. >>> df = spark.createDataFrame([(1, 4, 3)], ['a', 'b', 'c']) >>> df.select(greatest(df.a, df.b, df.c).alias("greatest")).collect() [Row(greatest=4)] The example was taken directly from the docs. (Least does the opposite.)

Related

How to interpolate time series based on time gap between non null values in PySpark

Aggregating two columns with Pyspark

Pyspark: Replace all occurrences of a value with null in dataframe

Calculate percentiles ignoring missing values

Split numerical count in Spark DataFrame column into several columns

Categories

Resources