Calculate duration within groups in PySpark - apache-spark

I want to calculate the duration within groups of the same date_id, subs_no, year, month, and day. If it's the first entry, it should just display "first".
Here's my dataset:
+--------+---------------+--------+----+-----+---+
| date_id| ts| subs_no|year|month|day|
+--------+---------------+--------+----+-----+---+
|20200801|14:27:18.000000|10007239|2022| 6| 1|
|20200801|14:29:44.000000|10054647|2022| 6| 1|
|20200801|08:24:21.000000|10057750|2022| 6| 1|
|20200801|13:49:27.000000|10019958|2022| 6| 1|
|20200801|20:07:32.000000|10019958|2022| 6| 1|
+--------+---------------+--------+----+-----+---+
NB: column "ts" is of string type.
Here's my expected output:
+--------+---------------+--------+----+-----+---+---------+
| date_id| ts| subs_no|year|month|day| duration|
+--------+---------------+--------+----+-----+---+---------+
|20200801|14:27:18.000000|10007239|2022| 6| 1| first |
|20200801|14:29:44.000000|10054647|2022| 6| 1| first |
|20200801|08:24:21.000000|10057750|2022| 6| 1| first |
|20200801|13:49:27.000000|10019958|2022| 6| 1| first |
|20200801|20:07:32.000000|10019958|2022| 6| 1| 6:18:05 |
+--------+---------------+--------+----+-----+---+---------+

You could try joining some of the columns into one which will represent a real timestamp. Then, do calculations using min as a window function. Finally, replace duration of "00:00:00" to "first".
Input:
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('20200801', '14:27:18.000000', '10007239', 2022, 6, 1),
('20200801', '14:29:44.000000', '10054647', 2022, 6, 1),
('20200801', '08:24:21.000000', '10057750', 2022, 6, 1),
('20200801', '13:49:27.000000', '10019958', 2022, 6, 1),
('20200801', '20:07:32.000000', '10019958', 2022, 6, 1)],
['date_id', 'ts', 'subs_no', 'year', 'month', 'day'])
Script:
ts = F.to_timestamp(F.format_string('%d-%d-%d %s','year', 'month', 'day', 'ts'))
w = W.partitionBy('date_id', 'subs_no', 'year', 'month', 'day').orderBy(ts)
df = df.withColumn(
'duration',
F.regexp_extract(ts - F.min(ts).over(w), r'\d\d:\d\d:\d\d', 0)
)
df = df.replace('00:00:00', 'first', 'duration')
df.show()
# +--------+---------------+--------+----+-----+---+--------+
# |date_id |ts |subs_no |year|month|day|duration|
# +--------+---------------+--------+----+-----+---+--------+
# |20200801|14:27:18.000000|10007239|2022|6 |1 |first |
# |20200801|13:49:27.000000|10019958|2022|6 |1 |first |
# |20200801|20:07:32.000000|10019958|2022|6 |1 |06:18:05|
# |20200801|14:29:44.000000|10054647|2022|6 |1 |first |
# |20200801|08:24:21.000000|10057750|2022|6 |1 |first |
# +--------+---------------+--------+----+-----+---+--------+

Use window function.Code and logic below
w=Window.partitionBy('date_id', 'subs_no', 'year', 'month').orderBy('date_id', 'subs_no', 'year', 'month')
new =(df.withColumn('ty', to_timestamp('ts'))#Coerce to timestamp.
.withColumn('duration',when(first('ty').over(w)==col('ty'),'first').otherwise(regexp_extract(col('ty')-first('ty').over(w),'\d{2}:\d{2}:\d{2}',0)))#use window functions to align consecutive tos of ts.Where ts does not change, delineate as first else compute distance nad extract time lapsed
.drop('ty')
.orderBy('date_id', 'ts','subs_no', 'year', 'month'))
new.show()
+--------+---------------+--------+----+-----+---+--------+
| date_id| ts| subs_no|year|month|day|duration|
+--------+---------------+--------+----+-----+---+--------+
|20200801|08:24:21.000000|10057750|2022| 6| 1| first|
|20200801|13:49:27.000000|10019958|2022| 6| 1| first|
|20200801|14:27:18.000000|10007239|2022| 6| 1| first|
|20200801|14:29:44.000000|10054647|2022| 6| 1| first|
|20200801|20:07:32.000000|10019958|2022| 6| 1|06:18:05|
+--------+---------------+--------+----+-----+---+--------+

Related

How to fill up null values in Spark Dataframe based on other columns' value?

Given this dataframe:
+-----+-----+----+
|num_a|num_b| sum|
+-----+-----+----+
| 1| 1| 2|
| 12| 15| 27|
| 56| 11|null|
| 79| 3| 82|
| 111| 114| 225|
+-----+-----+----+
How would you fill up Null values in sum column if the value can be gathered from other columns? In this example 56+11 would be the value.
I've tried df.fillna with an udf, but that doesn't seems to work, as it was just getting the column name not the actual value. I would want to compute the value just for the rows with missing values, so creating a new column would not be a viable option.
If your requirement is UDF, then it can be done as:
import pyspark.sql.functions as F
from pyspark.sql.types import LongType
df = spark.createDataFrame(
[(1, 2, 3),
(12, 15, 27),
(56, 11, None),
(79, 3, 82)],
["num_a", "num_b", "sum"]
)
F.udf(returnType=LongType)
def fill_with_sum(num_a, num_b, sum):
return sum if sum is None else (num_a + num_b)
df = df.withColumn("sum", fill_with_sum(F.col("num_a"), F.col("num_b"), F.col("sum")))
[Out]:
+-----+-----+---+
|num_a|num_b|sum|
+-----+-----+---+
| 1| 2| 3|
| 12| 15| 27|
| 56| 11| 67|
| 79| 3| 82|
+-----+-----+---+
You can use coalesce function. Check this sample code
import pyspark.sql.functions as f
df = spark.createDataFrame(
[(1, 2, 3),
(12, 15, 27),
(56, 11, None),
(79, 3, 82)],
["num_a", "num_b", "sum"]
)
df.withColumn("sum", f.coalesce(f.col("sum"), f.col("num_a") + f.col("num_b"))).show()
Output is:
+-----+-----+---+
|num_a|num_b|sum|
+-----+-----+---+
| 1| 2| 3|
| 12| 15| 27|
| 56| 11| 67|
| 79| 3| 82|
+-----+-----+---+

pyspark convert rows to columns

I have a dataframe where I need to convert rows of the same group to columns. basically pivot these. below is my df.
+------------+-------+-----+-------+
|Customer |ID |unit |order |
+------------+-------+-----+-------+
|John |123 |00015|1 |
|John |123 |00016|2 |
|John |345 |00205|3 |
|John |345 |00206|4 |
|John |789 |00283|5 |
|John |789 |00284|6 |
+------------+-------+-----+-------+
I need the resultant data for the above as..
+--------+-------+--------+----------+--------+--------+-----------+--------+-------+----------+
|state | ID_1 | unit_1 |seq_num_1 | ID_2 | unit_2 | seq_num_2 | ID_3 |unit_3 |seq_num_3 |
+--------+-------+--------+----------+--------+--------+-----------+--------+-------+----------+
|John | 123 | 00015 | 1 | 345 | 00205 | 3 | 789 |00283 | 5 |
|John | 123 | 00016 | 2 | 345 | 00206 | 4 | 789 |00284 | 6 |
+--------+-------+--------+----------+--------+--------+-----------+--------+-------+----------+
I tried to groupBy and pivot() function, but its throwing error says large pivot values found. Is there any way to get the result without using the pivot() function..any help is greatly appreciated.
thanks.
This looks like a typical case of using dense_rank() aggregate function to create a generic sequence (dr in the below code) of distinct IDs under each group of Customer, then do pivoting on this sequence. we can do the similar to order column using row_number() so that it can be used in groupby:
from pyspark.sql import Window, functions as F
# below I added an extra row for a reference when the number of rows vary for different IDs
df = spark.createDataFrame([
('John', '123', '00015', '1'), ('John', '123', '00016', '2'), ('John', '345', '00205', '3'),
('John', '345', '00206', '4'), ('John', '789', '00283', '5'), ('John', '789', '00284', '6'),
('John', '789', '00285', '7')
], ['Customer', 'ID', 'unit', 'order'])
Add two Window Specs: w1 to get dense_rank() of IDs over Customer and w2 to get row_number() of order under the same Customer and ID.
w1 = Window.partitionBy('Customer').orderBy('ID')
w2 = Window.partitionBy('Customer','ID').orderBy('order')
Add two new columns based on the above two WinSpecs: dr(dense_rank) and sid(row_number)
df1 = df.select(
"*",
F.dense_rank().over(w1).alias('dr'),
F.row_number().over(w2).alias('sid')
)
+--------+---+-----+-----+---+---+
|Customer| ID| unit|order| dr|sid|
+--------+---+-----+-----+---+---+
| John|123|00015| 1| 1| 1|
| John|123|00016| 2| 1| 2|
| John|345|00205| 3| 2| 1|
| John|345|00206| 4| 2| 2|
| John|789|00283| 5| 3| 1|
| John|789|00284| 6| 3| 2|
| John|789|00285| 7| 3| 3|
+--------+---+-----+-----+---+---+
Find the max(dr), so that we can pre-define the list to pivot on which is range(1,N+1) (this will improve the efficiency of pivot method).
N = df1.agg(F.max('dr')).first()[0]
Groupby Customer, sid and pivot with dr and then do the aggregate:
df_new = df1.groupby('Customer','sid') \
.pivot('dr', range(1,N+1)) \
.agg(
F.first('ID').alias('ID'),
F.first('unit').alias('unit'),
F.first('order').alias('order')
)
df_new.show()
+--------+---+----+------+-------+----+------+-------+----+------+-------+
|Customer|sid|1_ID|1_unit|1_order|2_ID|2_unit|2_order|3_ID|3_unit|3_order|
+--------+---+----+------+-------+----+------+-------+----+------+-------+
| John| 1| 123| 00015| 1| 345| 00205| 3| 789| 00283| 5|
| John| 2| 123| 00016| 2| 345| 00206| 4| 789| 00284| 6|
| John| 3|null| null| null|null| null| null| 789| 00285| 7|
+--------+---+----+------+-------+----+------+-------+----+------+-------+
Rename the column names if needed:
import re
df_new.toDF(*['_'.join(reversed(re.split('_',c,1))) for c in df_new.columns]).show()
+--------+---+----+------+-------+----+------+-------+----+------+-------+
|Customer|sid|ID_1|unit_1|order_1|ID_2|unit_2|order_2|ID_3|unit_3|order_3|
+--------+---+----+------+-------+----+------+-------+----+------+-------+
| John| 1| 123| 00015| 1| 345| 00205| 3| 789| 00283| 5|
| John| 2| 123| 00016| 2| 345| 00206| 4| 789| 00284| 6|
| John| 3|null| null| null|null| null| null| 789| 00285| 7|
+--------+---+----+------+-------+----+------+-------+----+------+-------+
below is my solution.. doing the rank and then flattening the results.
df = spark.createDataFrame([
('John', '123', '00015', '1'), ('John', '123', '00016', '2'), ('John', '345', '00205', '3'),
('John', '345', '00206', '4'), ('John', '789', '00283', '5'), ('John', '789', '00284', '6'),
('John', '789', '00285', '7')
], ['Customer', 'ID', 'unit', 'order'])
rankedDF = df.withColumn("rank", row_number().over(Window.partitionBy("customer").orderBy("order")))
w1 = Window.partitionBy("customer").orderBy("order")
groupedDF = rankedDF.select("customer", "rank", collect_list("ID").over(w1).alias("ID"), collect_list("unit").over(w1).alias("unit"), collect_list("order").over(w1).alias("seq_num")).groupBy("customer", "rank").agg(max("ID").alias("ID"), max("unit").alias("unit"), max("seq_num").alias("seq_num") )
groupedColumns = [col("customer")]
pivotColumns = map(lambda i:map(lambda a:col(a)[i-1].alias(a + "_" + `i`), ["ID", "unit", "seq_num"]), [1,2,3])
flattenedCols = [item for sublist in pivotColumns for item in sublist]
finalDf=groupedDF.select(groupedColumns + flattenedCols)
There may be multiple ways to do this but a pandas udf can be one such way. Here is a toy example based on your data:
df = pd.DataFrame({'Customer': ['John']*6,
'ID': [123]*2 + [345]*2 + [789]*2,
'unit': ['00015', '00016', '00205', '00206', '00283', '00284'],
'order': range(1, 7)})
sdf = spark.createDataFrame(df)
# Spark 2.4 syntax. Spark 3.0 is less verbose
return_types = 'state string, ID_1 int, unit_1 string, seq_num_1 int, ID_2int, unit_2 string, seq_num_2 int, ID_3 int, unit_3 string, seq_num_3 int'
#pandas_udf(returnType=return_types, functionType=PandasUDFType.GROUPED_MAP)
def convert_to_wide(pdf):
groups = pdf.groupby('ID')
out = pd.concat([group.set_index('Customer') for _, group in groups], axis=1).reset_index()
out.columns = ['state', 'ID_1', 'unit_1', 'seq_num_1', 'ID_2', 'unit_2', 'seq_num_2', 'ID_3', 'unit_3', 'seq_num_3']
return out
sdf.groupby('Customer').apply(convert_to_wide).show()
+-----+----+------+---------+----+------+---------+----+------+---------+
|state|ID_1|unit_1|seq_num_1|ID_2|unit_2|seq_num_2|ID_3|unit_3|seq_num_3|
+-----+----+------+---------+----+------+---------+----+------+---------+
| John| 123| 00015| 1| 345| 00205| 3| 789| 00283| 5|
| John| 123| 00016| 2| 345| 00206| 4| 789| 00284| 6|
+-----+----+------+---------+----+------+---------+----+------+---------+

How to find the distribution of a column in PySpark dataframe for all the unique values present in that column?

I have a PySpark dataframe-
df = spark.createDataFrame([
("u1", 0),
("u2", 0),
("u3", 1),
("u4", 2),
("u5", 3),
("u6", 2),],
['user_id', 'medals'])
df.show()
Output-
+-------+------+
|user_id|medals|
+-------+------+
| u1| 0|
| u2| 0|
| u3| 1|
| u4| 2|
| u5| 3|
| u6| 2|
+-------+------+
I want to get the distribution of the medals column for all the users. So if there are n unique values in the medals column, I want n columns in the output dataframe with corresponding number of users who received that many medals.
The output for the data given above should look like-
+------- +--------+--------+--------+
|medals_0|medals_1|medals_2|medals_3|
+--------+--------+--------+--------+
| 2| 1| 2| 1|
+--------+--------+--------+--------+
How do I achieve this?
it's a simple pivot:
df.groupBy().pivot("medals").count().show()
+---+---+---+---+
| 0| 1| 2| 3|
+---+---+---+---+
| 2| 1| 2| 1|
+---+---+---+---+
if you need some cosmetic to add the word medals in the column name, then you can do this :
medals_df = df.groupBy().pivot("medals").count()
for col in medals_df.columns:
medals_df = medals_df.withColumnRenamed(col, "medals_{}".format(col))
medals_df.show()
+--------+--------+--------+--------+
|medals_0|medals_1|medals_2|medals_3|
+--------+--------+--------+--------+
| 2| 1| 2| 1|
+--------+--------+--------+--------+

PySpark: modify column values when another column value satisfies a condition

I have a PySpark Dataframe with two columns:
+---+----+
| Id|Rank|
+---+----+
| a| 5|
| b| 7|
| c| 8|
| d| 1|
+---+----+
For each row, I'm looking to replace Id column with "other" if Rank column is larger than 5.
If I use pseudocode to explain:
For row in df:
if row.Rank > 5:
then replace(row.Id, "other")
The result should look like this:
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Any clue how to achieve this? Thanks!!!
To create this Dataframe:
df = spark.createDataFrame([('a', 5), ('b', 7), ('c', 8), ('d', 1)], ['Id', 'Rank'])
You can use when and otherwise like -
from pyspark.sql.functions import *
df\
.withColumn('Id_New',when(df.Rank <= 5,df.Id).otherwise('other'))\
.drop(df.Id)\
.select(col('Id_New').alias('Id'),col('Rank'))\
.show()
this gives output as -
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Starting with #Pushkr solution couldn't you just use the following ?
from pyspark.sql.functions import *
df.withColumn('Id',when(df.Rank <= 5,df.Id).otherwise('other')).show()

Fill in null with previously known good value with pyspark

Is there a way to replace null values in pyspark dataframe with the last valid value? There is addtional timestamp and session columns if you think you need them for windows partitioning and ordering. More specifically, I'd like to achieve the following conversion:
+---------+-----------+-----------+ +---------+-----------+-----------+
| session | timestamp | id| | session | timestamp | id|
+---------+-----------+-----------+ +---------+-----------+-----------+
| 1| 1| null| | 1| 1| null|
| 1| 2| 109| | 1| 2| 109|
| 1| 3| null| | 1| 3| 109|
| 1| 4| null| | 1| 4| 109|
| 1| 5| 109| => | 1| 5| 109|
| 1| 6| null| | 1| 6| 109|
| 1| 7| 110| | 1| 7| 110|
| 1| 8| null| | 1| 8| 110|
| 1| 9| null| | 1| 9| 110|
| 1| 10| null| | 1| 10| 110|
+---------+-----------+-----------+ +---------+-----------+-----------+
This uses last and ignores nulls.
Let's re-create something similar to the original data:
import sys
from pyspark.sql.window import Window
import pyspark.sql.functions as func
d = [{'session': 1, 'ts': 1}, {'session': 1, 'ts': 2, 'id': 109}, {'session': 1, 'ts': 3}, {'session': 1, 'ts': 4, 'id': 110}, {'session': 1, 'ts': 5}, {'session': 1, 'ts': 6}]
df = spark.createDataFrame(d)
df.show()
# +-------+---+----+
# |session| ts| id|
# +-------+---+----+
# | 1| 1|null|
# | 1| 2| 109|
# | 1| 3|null|
# | 1| 4| 110|
# | 1| 5|null|
# | 1| 6|null|
# +-------+---+----+
Now, let's use window function last:
df.withColumn("id", func.last('id', True).over(Window.partitionBy('session').orderBy('ts').rowsBetween(-sys.maxsize, 0))).show()
# +-------+---+----+
# |session| ts| id|
# +-------+---+----+
# | 1| 1|null|
# | 1| 2| 109|
# | 1| 3| 109|
# | 1| 4| 110|
# | 1| 5| 110|
# | 1| 6| 110|
# +-------+---+----+
This seems to be doing the trick using Window functions:
import sys
from pyspark.sql.window import Window
import pyspark.sql.functions as func
def fill_nulls(df):
df_na = df.na.fill(-1)
lag = df_na.withColumn('id_lag', func.lag('id', default=-1)\
.over(Window.partitionBy('session')\
.orderBy('timestamp')))
switch = lag.withColumn('id_change',
((lag['id'] != lag['id_lag']) &
(lag['id'] != -1)).cast('integer'))
switch_sess = switch.withColumn(
'sub_session',
func.sum("id_change")
.over(
Window.partitionBy("session")
.orderBy("timestamp")
.rowsBetween(-sys.maxsize, 0))
)
fid = switch_sess.withColumn('nn_id',
func.first('id')\
.over(Window.partitionBy('session', 'sub_session')\
.orderBy('timestamp')))
fid_na = fid.replace(-1, 'null')
ff = fid_na.drop('id').drop('id_lag')\
.drop('id_change')\
.drop('sub_session').\
withColumnRenamed('nn_id', 'id')
return ff
Here is the full null_test.py.
#Oleksiy's answer is great, but didn't fully work for my requirements. Within a session, if multiple nulls are observed, all are filled with the first non-null for the session. I needed the last non-null value to propagate forward.
The following tweak worked for my use case:
def fill_forward(df, id_column, key_column, fill_column):
# Fill null's with last *non null* value in the window
ff = df.withColumn(
'fill_fwd',
func.last(fill_column, True) # True: fill with last non-null
.over(
Window.partitionBy(id_column)
.orderBy(key_column)
.rowsBetween(-sys.maxsize, 0))
)
# Drop the old column and rename the new column
ff_out = ff.drop(fill_column).withColumnRenamed('fill_fwd', fill_column)
return ff_out
Here is the trick I followed by converting pyspark dataframe into pandas dataframe and doing the operation as pandas has built-in function to fill null values with previously known good value. And changing it back to pyspark dataframe.
Here is the code!!
d = [{'session': 1, 'ts': 1}, {'session': 1, 'ts': 2, 'id': 109}, {'session': 1, 'ts': 3}, {'session': 1, 'ts': 4, 'id': 110}, {'session': 1, 'ts': 5}, {'session': 1, 'ts': 6},{'session': 1, 'ts': 7, 'id': 110},{'session': 1, 'ts': 8},{'session': 1, 'ts': 9},{'session': 1, 'ts': 10}]\
dt = spark.createDataFrame(d)
import pandas as pd\
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
psdf= dt.select("*").toPandas()\
psdf["id"].fillna(method='ffill', inplace=True)\
dt= spark.createDataFrame(psdf)\
dt.show()

Resources