Transpose each record into multiple columns in pyspark dataframe - apache-spark

I am looking to transpose each record into multiple columns in pyspark dataframe.
This is my dataframe:
+--------+-------------+--------------+------------+------+
|level_1 |level_2 |level_3 |level_4 |UNQ_ID|
+--------+-------------+--------------+------------+------+
|D Group|Investments |ORB |ECM |1 |
|E Group|Investment |Origination |Execution |2 |
+--------+-------------+--------------+------------+------+
Required dataframe is:
+--------+---------------+------+
|level |name |UNQ_ID|
+--------+---------------+------+
|level_1 |D Group |1 |
|level_1 |E Group |2 |
|level_2 |Investments |1 |
|level_2 |Investment |2 |
|level_3 |ORB |1 |
|level_3 |Origination |2 |
|level_4 |ECM |1 |
|level_4 |Execution |2 |
+--------+---------------+------+

The easier way using stack function:
import pyspark.sql.functions as f
output_df = df.selectExpr('stack(4, "level_1", level_1, "level_2", level_2, "level_3", level_3, "level_4", level_4) as (level, name)', 'UNQ_ID')
output_df.show()
# +-------+-----------+------+
# | level| name|UNQ_ID|
# +-------+-----------+------+
# |level_1| D Group| 1|
# |level_2|Investments| 1|
# |level_3| ORB| 1|
# |level_4| ECM| 1|
# |level_1| E Group| 2|
# |level_2|Investments| 2|
# |level_3|Origination| 2|
# |level_4| Execution| 2|
# +-------+-----------+------+

Related

Spark-Scala : Create split rows based on the value of other column

I have an Input as below
id
size
1
4
2
2
output - If input is 4 (size column) split 4 times(1-4) and if input size column value is 2 split it
1-2 times.
id
size
1
1
1
2
1
3
1
4
2
1
2
2
You can create an array of sequence from 1 to size using sequence function and then to explode it:
import org.apache.spark.sql.functions._
val df = Seq((1,4), (2,2)).toDF("id", "size")
df
.withColumn("size", explode(sequence(lit(1), col("size"))))
.show(false)
The output would be:
+---+----+
|id |size|
+---+----+
|1 |1 |
|1 |2 |
|1 |3 |
|1 |4 |
|2 |1 |
|2 |2 |
+---+----+
You can use first use sequence function to create sequence from 1 to size and then explode it.
val df = input.withColumn("seq", sequence(lit(1), $"size"))
df.show()
+---+----+------------+
| id|size| seq|
+---+----+------------+
| 1| 4|[1, 2, 3, 4]|
| 2| 2| [1, 2]|
+---+----+------------+
df.withColumn("size", explode($"seq")).drop("seq").show()
+---+----+
| id|size|
+---+----+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
+---+----+
You could turn your size column into an incrementing sequence using Seq.range and then explode the arrays. Something like this:
import spark.implicits._
import org.apache.spark.sql.functions.{explode, col}
// Original dataframe
val df = Seq((1,4), (2,2)).toDF("id", "size")
// Mapping over this dataframe: turning each row into (idx, array)
val df_with_array = df
.map(row => {
(row.getInt(0), Seq.range(1, row.getInt(1) + 1))
})
.toDF("id", "array")
.select(col("id"), explode(col("array")))
output.show()
+---+---+
| id|col|
+---+---+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
+---+---+

Employing Pyspark How to determine the frequency of each event and its event-by-event frequency

I have a dataset like:
Data
a
a
a
a
a
b
b
b
a
a
b
I would like to include a column that like the one below. The data will be in the form of a1,1 in the column, where the first element represents the event frequency (a1), or how often "a" appears in the field, and the second element (,1) is the frequency for each event, or how often "a" repeats before any other element (b) in the field. Can we carry this out with PySpark?
Data Frequency
a a1,1
a a1,2
a a1,3
a a1,4
a a1,5
b b1,1
b b1,2
b b1,3
a a2,1
a a2,2
b b2,1
You can achieve your desired result by doing this,
from pyspark.sql import Window
import pyspark.sql.functions as F
df = spark.createDataFrame(['a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'], 'string').toDF("Data")
print("Original Data:")
df.show()
print("Result:")
df.withColumn("ID", F.monotonically_increasing_id()) \
.withColumn("group",
F.row_number().over(Window.orderBy("ID"))
- F.row_number().over(Window.partitionBy("Data").orderBy("ID"))
) \
.withColumn("element_freq", F.when(F.col('Data') != 'abcd', F.row_number().over(Window.partitionBy("group").orderBy("ID"))).otherwise(F.lit(0)))\
.withColumn("event_freq", F.when(F.col('Data') != 'abcd', F.dense_rank().over(Window.partitionBy("Data").orderBy("group"))).otherwise(F.lit(0)))\
.withColumn("Frequency", F.concat_ws(',', F.concat(F.col("Data"), F.col("event_freq")), F.col("element_freq"))) \
.orderBy("ID")\
.drop("ID", "group", "event_freq", "element_freq")\
.show()
Original Data:
+----+
|Data|
+----+
| a|
| a|
| a|
| a|
| a|
| b|
| b|
| b|
| a|
| a|
| b|
+----+
Result:
+----+---------+
|Data|Frequency|
+----+---------+
| a| a1,1|
| a| a1,2|
| a| a1,3|
| a| a1,4|
| a| a1,5|
| b| b1,1|
| b| b1,2|
| b| b1,3|
| a| a2,1|
| a| a2,2|
| b| b2,1|
+----+---------+
Use Window functions. I give you to options just in case
Option 1, separating groups and Frequency
#Variable to use in the groupby
k=Window.partitionBy().orderBy('index')
(
#Create an index of df to order by
df1.withColumn('index', monotonically_increasing_id())
#Create a column that puts a consecutive and previous Data in a row
.withColumn('group', lag('Data').over(k))
# #Where consecutive and previous dont match, conditionally assign a 1 else o
.withColumn('group', when(col('data')!=col('group'),1).otherwise(0))
# Concat Data and sum of outcome from above per group and ordered by index
.withColumn('group', concat('Data',sum('group').over(Window.partitionBy('Data').orderBy('index'))+1))
#rank outcome above in the order in which they appeared in initial df
.withColumn('Frequency', rank().over(Window.partitionBy('group').orderBy('index')))
).sort('index').drop('index').show(truncate=False)
+----+-----+---------+
|Data|group|Frequency|
+----+-----+---------+
|a |a1 |1 |
|a |a1 |2 |
|a |a1 |3 |
|a |a1 |4 |
|a |a1 |5 |
|b |b2 |1 |
|b |b2 |2 |
|b |b2 |3 |
|a |a2 |1 |
|a |a2 |2 |
|b |b3 |1 |
+----+-----+---------+
Option 2 gives an output you wanted
#Variable to use in the groupby
k=Window.partitionBy().orderBy('index')
(
#Create an index of df to order by
df1.withColumn('index', monotonically_increasing_id())
#Create a column that puts a consecutive and previous Data in a row
.withColumn('Frequency', lag('Data').over(k))
# #Where consecutive and previous dont match, conditionally assign a 1 else o
.withColumn('Frequency', when(col('data')!=col('Frequency'),1).otherwise(0))
# Concat Data and sum of outcome from above per group and ordered by index
.withColumn('Frequency', concat('Data',sum('Frequency').over(Window.partitionBy('Data').orderBy('index'))+1))
#rank outcome above in the order in which they appeared in initial df
.withColumn('Frequency', array_join(array('Frequency',rank().over(Window.partitionBy('Frequency').orderBy('index'))),','))
).sort('index').drop('index').show(truncate=False)
+----+---------+
|Data|Frequency|
+----+---------+
|a |a1,1 |
|a |a1,2 |
|a |a1,3 |
|a |a1,4 |
|a |a1,5 |
|b |b2,1 |
|b |b2,2 |
|b |b2,3 |
|a |a2,1 |
|a |a2,2 |
|b |b3,1 |
+----+---------+

pyspark applying odm mapping on column level

I have below 2 data frames and i would like to apply similar condition and return the values in pyspark data frames.
df1.show()
+---+-------+--------+
|id |tr_type|nominal |
+---+-------+--------+
|1 |K |2.0 |
|2 |ZW |7.0 |
|3 |V |12.5 |
|4 |VW |9.0 |
|5 |CI |5.0 |
+---+-------+--------+
One dimensional mapping:
*abcefgh
+-------+------------+------------+-----------+
|odm_id |return_value|odm_relation|input_value|
+-------+------------+------------+-----------+
|abcefgh|B |EQ |K |
|abcefgh|B |EQ |ZW |
|abcefgh|S |EQ |V |
|abcefgh|S |EQ |VW |
|abcefgh|I |EQ |CI |
+-------+------------+------------+-----------+
I need to apply below condition The nominal volume is negated when there is a sell transaction.
IF (tr_type, $abcefgh.) == 'S' THEN ;
nominal = -nominal ;
The expected output:
+---+-------+-------+-----------+
|id |tr_type|nominal|nominal_new|
+---+-------+-------+-----------+
|1 |K |2.0 |2.0 |
|2 |ZW |7.0 |7.0 |
|3 |V |12.5 |-12.5 |
|4 |VW |9.0 |-9.0 |
|5 |CI |5.0 |5.0 |
+---+-------+-------+-----------+
you could join the 2 dataframes on tr_type == input_value and use a when().otherwise() to create the new column.
see example below using your samples
data_sdf. \
join(odm_sdf.selectExpr('return_value', 'input_value as tr_type').
dropDuplicates(),
['tr_type'],
'left'
). \
withColumn('nominal_new',
func.when(func.col('return_value') == 'S', func.col('nominal') * -1).
otherwise(func.col('nominal'))
). \
drop('return_value'). \
show()
# +-------+---+-------+-----------+
# |tr_type| id|nominal|nominal_new|
# +-------+---+-------+-----------+
# | K| 1| 2.0| 2.0|
# | CI| 5| 5.0| 5.0|
# | V| 3| 12.5| -12.5|
# | VW| 4| 9.0| -9.0|
# | ZW| 2| 7.0| 7.0|
# +-------+---+-------+-----------+

Creating Total and percentage of total columns in Pyspark

Here is my test data
test = spark.createDataFrame([
("2018-06-03",2, 4, 4 ),
("2018-06-04",4, 3, 3 ),
( "2018-06-03",8, 1, 1),
("2018-06-01",3, 1, 1),
( "2018-06-05", 3, 2, 0),
])\
.toDF( "transactiondate", "SalesA", "SalesB","SalesC")
test.show()
I would like to add a row-wise total column and % of the total column corresponding to each sales category (A, B and C)
Desired Output:
+---------------+------+------+------+----------+------+------+------+
|transactiondate|SalesA|SalesB|SalesC|TotalSales|Perc_A|Perc_B|Perc_C|
+---------------+------+------+------+----------+------+------+------+
| 2018-06-03| 2| 4| 4| 10| 0.2| 0.4| 0.4|
| 2018-06-04| 4| 3| 3| 10| 0.4| 0.3| 0.3|
| 2018-06-03| 8| 1| 1| 10| 0.8| 0.1| 0.1|
| 2018-06-01| 3| 1| 1| 5| 0.6| 0.2| 0.2|
| 2018-06-05| 3| 2| 0| 5| 0.6| 0.4| 0.0|
+---------------+------+------+------+----------+------+------+------+
How can I do it in pyspark?
Edit: I want the code to be adaptable even if I add more items, i.e. if I have one more column salesD, code should create total and percentage columns. (i.e. columns shouldn't be hardcoded)
You can use selectExpr and do simple arithmetic SQL operations for each added columns
test = test.selectExpr("*",
"SalesA+SalesB+SalesC as TotalSales",
"SalesA/(SalesA+SalesB+SalesC) as Perc_A",
"SalesB/(SalesA+SalesB+SalesC) as Perc_B",
"SalesC/(SalesA+SalesB+SalesC) as Perc_C"
)
or use a more flexible solution
from pyspark.sql.functions import col, expr
# columns to be included in TotalSales calculation
cols = ['SalesA', 'SalesB', 'SalesC']
test = (test
.withColumn('TotalSales', expr('+'.join(cols)))
.select(col('*'),
*[expr('{0}/TotalSales {1}'.format(c,'Perc_'+c)) for c in cols]))
One option is to use several withColumn statements
import pyspark.sql.functions as F
test\
.withColumn('TotalSales', F.col('SalesA') + F.col('SalesB') + F.col('SalesC'))\
.withColumn('Perc_A', F.col('SalesA') / F.col('TotalSales'))\
.withColumn('Perc_B', F.col('SalesB') / F.col('TotalSales'))\
.withColumn('Perc_C', F.col('SalesC') / F.col('TotalSales'))
Try this spark-sql solution
test.createOrReplaceTempView("sales_table")
sales=[ x for x in test.columns if x.upper().startswith("SALES") ]
sales2="+".join(sales)
print(str(sales)) # ['SalesA', 'SalesB', 'SalesC']
per_sales=[ x +"/TotalSales as " + "Perc_" +x for x in sales ]
per_sales2=",".join(per_sales)
print(str(per_sales)) # ['SalesA/TotalSales as Perc_SalesA', 'SalesB/TotalSales as Perc_SalesB', 'SalesC/TotalSales as Perc_SalesC']
spark.sql(f"""
with t1 ( select *, {sales2} TotalSales from sales_table )
select *, {per_sales2} from t1
""").show(truncate=False)
+---------------+------+------+------+----------+-----------+-----------+-----------+
|transactiondate|SalesA|SalesB|SalesC|TotalSales|Perc_SalesA|Perc_SalesB|Perc_SalesC|
+---------------+------+------+------+----------+-----------+-----------+-----------+
|2018-06-03 |2 |4 |4 |10 |0.2 |0.4 |0.4 |
|2018-06-04 |4 |3 |3 |10 |0.4 |0.3 |0.3 |
|2018-06-03 |8 |1 |1 |10 |0.8 |0.1 |0.1 |
|2018-06-01 |3 |1 |1 |5 |0.6 |0.2 |0.2 |
|2018-06-05 |3 |2 |0 |5 |0.6 |0.4 |0.0 |
+---------------+------+------+------+----------+-----------+-----------+-----------+
You can also use the aggregate() higher order function to sum the sales* columns. But for this the columns must be of Integer/double type, not long.
test2=test.withColumn("SalesA",expr("cast(salesa as int)"))\
.withColumn("SalesB",expr("cast(salesb as int)"))\
.withColumn("SalesC",expr("cast(salesc as int)"))
test2.createOrReplaceTempView("sales_table2")
sales3=",".join(sales) # just join the sales columns with comma
spark.sql(f"""
with t1 ( select *, aggregate(array({sales3}),0,(acc,x) -> acc+x) TotalSales from sales_table2 )
select *, {per_sales2} from t1
""").show(truncate=False)
+---------------+------+------+------+----------+-----------+-----------+-----------+
|transactiondate|SalesA|SalesB|SalesC|TotalSales|Perc_SalesA|Perc_SalesB|Perc_SalesC|
+---------------+------+------+------+----------+-----------+-----------+-----------+
|2018-06-03 |2 |4 |4 |10 |0.2 |0.4 |0.4 |
|2018-06-04 |4 |3 |3 |10 |0.4 |0.3 |0.3 |
|2018-06-03 |8 |1 |1 |10 |0.8 |0.1 |0.1 |
|2018-06-01 |3 |1 |1 |5 |0.6 |0.2 |0.2 |
|2018-06-05 |3 |2 |0 |5 |0.6 |0.4 |0.0 |
+---------------+------+------+------+----------+-----------+-----------+-----------+

duplicating records between date gaps within a selected time interval in a PySpark dataframe

I have a PySpark dataframe that keeps track of changes that occur in a product's price and status over months. This means that a new row is created only when a change occurred (in either status or price) compared to the previous month, like in the dummy data below
----------------------------------------
|product_id| status | price| month |
----------------------------------------
|1 | available | 5 | 2019-10|
----------------------------------------
|1 | available | 8 | 2020-08|
----------------------------------------
|1 | limited | 8 | 2020-10|
----------------------------------------
|2 | limited | 1 | 2020-09|
----------------------------------------
|2 | limited | 3 | 2020-10|
----------------------------------------
I would like to create a dataframe that shows the values for each of the last 6 months. This means that I need to duplicate the records whenever there is a gap in the above dataframe. For example, if the last 6 months are 2020-07, 2020-08, ... 2020-12, then the result for the above dataframe should be
----------------------------------------
|product_id| status | price| month |
----------------------------------------
|1 | available | 5 | 2020-07|
----------------------------------------
|1 | available | 8 | 2020-08|
----------------------------------------
|1 | available | 8 | 2020-09|
----------------------------------------
|1 | limited | 8 | 2020-10|
----------------------------------------
|1 | limited | 8 | 2020-11|
----------------------------------------
|1 | limited | 8 | 2020-12|
----------------------------------------
|2 | limited | 1 | 2020-09|
----------------------------------------
|2 | limited | 3 | 2020-10|
----------------------------------------
|2 | limited | 3 | 2020-11|
----------------------------------------
|2 | limited | 3 | 2020-12|
----------------------------------------
Notice that for product_id = 1 there was an older record from 2019-10 that was propagated until 2020-08 and then trimmed, whereas for product_id = 2 there were no records prior to 2020-09 and thus the months 2020-07, 2020-08 were not filled for it (as the product did not exist prior to 2020-09).
Since the dataframe consists of millions of records, a "brute-force" solution using for loops and checking for each product_id is rather slow. It seems that it should be possible to solve this using window functions, by creating another column next_month and then filling in the gaps based on that column, but I don't know how to achieve that.
With Respect to the #jxc comment, I have prepared the answer for this use case.
Following is the code snippet.
Import the spark SQL functions
from pyspark.sql import functions as F, Window
Prepare the sample data
simpleData = ((1,"Available",5,"2020-07"),
(1,"Available",8,"2020-08"),
(1,"Limited",8,"2020-12"),
(2,"Limited",1,"2020-09"),
(2,"Limited",3,"2020-12")
)
columns= ["product_id", "status", "price", "month"]
Creating dataframe of sample data
df = spark.createDataFrame(data = simpleData, schema = columns)
Add date column in dataframe to get proper formatted date
df0 = df.withColumn("date",F.to_date('month','yyyy-MM'))
df0.show()
+----------+---------+-----+-------+----------+
|product_id| status|price| month| date|
+----------+---------+-----+-------+----------+
| 1|Available| 5|2020-07|2020-07-01|
| 1|Available| 8|2020-08|2020-08-01|
| 1| Limited| 8|2020-12|2020-12-01|
| 2| Limited| 1|2020-09|2020-09-01|
| 2| Limited| 3|2020-12|2020-12-01|
+----------+---------+-----+-------+----------+
Creating WinSpec w1 and use Window aggregate function lead to find the next date over(w1), convert it to the previous months to set up date sequences:
w1 = Window.partitionBy('product_id').orderBy('date')
df1 = df0.withColumn('end_date',F.coalesce(F.add_months(F.lead('date').over(w1),-1),'date'))
df1.show()
+----------+---------+-----+-------+----------+----------+
|product_id| status|price| month| date| end_date|
+----------+---------+-----+-------+----------+----------+
| 1|Available| 5|2020-07|2020-07-01|2020-07-01|
| 1|Available| 8|2020-08|2020-08-01|2020-11-01|
| 1| Limited| 8|2020-12|2020-12-01|2020-12-01|
| 2| Limited| 1|2020-09|2020-09-01|2020-11-01|
| 2| Limited| 3|2020-12|2020-12-01|2020-12-01|
+----------+---------+-----+-------+----------+----------+
Using months_between(end_date, date) to calculate # of months between two dates, and use transform function to iterate through sequence(0, #months), create a named_struct with date=add_months(date,i) and price=IF(i=0,price,price), use inline_outer to explode the array of structs.
df2 = df1.selectExpr("product_id", "status", inline_outer( transform( sequence(0,int(months_between(end_date, date)),1), i -> (add_months(date,i) as date, IF(i=0,price,price) as price) ) ) )
df2.show()
+----------+---------+----------+-----+
|product_id| status| date|price|
+----------+---------+----------+-----+
| 1|Available|2020-07-01| 5|
| 1|Available|2020-08-01| 8|
| 1|Available|2020-09-01| 8|
| 1|Available|2020-10-01| 8|
| 1|Available|2020-11-01| 8|
| 1| Limited|2020-12-01| 8|
| 2| Limited|2020-09-01| 1|
| 2| Limited|2020-10-01| 1|
| 2| Limited|2020-11-01| 1|
| 2| Limited|2020-12-01| 3|
+----------+---------+----------+-----+
Partitioning the dataframe on product_id and adding a rank column in df3 to get row number for each row. Then, Storing the maximum of rank column value with new column max_rank for each product_id and storing max_rank in to df4
w2 = Window.partitionBy('product_id').orderBy('date')
df3 = df2.withColumn('rank',F.row_number().over(w2))
Schema: DataFrame[product_id: bigint, status: string, date: date, price: bigint, rank: int]
df3.show()
+----------+---------+----------+-----+----+
|product_id| status| date|price|rank|
+----------+---------+----------+-----+----+
| 1|Available|2020-07-01| 5| 1|
| 1|Available|2020-08-01| 8| 2|
| 1|Available|2020-09-01| 8| 3|
| 1|Available|2020-10-01| 8| 4|
| 1|Available|2020-11-01| 8| 5|
| 1| Limited|2020-12-01| 8| 6|
| 2| Limited|2020-09-01| 1| 1|
| 2| Limited|2020-10-01| 1| 2|
| 2| Limited|2020-11-01| 1| 3|
| 2| Limited|2020-12-01| 3| 4|
+----------+---------+----------+-----+----+
df4 = df3.groupBy("product_id").agg(F.max('rank').alias('max_rank'))
Schema: DataFrame[product_id: bigint, max_rank: int]
df4.show()
+----------+--------+
|product_id|max_rank|
+----------+--------+
| 1| 6|
| 2| 4|
+----------+--------+
Joining df3 and df4 dataframes on product_id get max_rank
df5 = df3.join(df4,df3.product_id == df4.product_id,"inner") \
.select(df3.product_id,df3.status,df3.date,df3.price,df3.rank,df4.max_rank)
Schema: DataFrame[product_id: bigint, status: string, date: date, price: bigint, rank: int, max_rank: int]
df5.show()
+----------+---------+----------+-----+----+--------+
|product_id| status| date|price|rank|max_rank|
+----------+---------+----------+-----+----+--------+
| 1|Available|2020-07-01| 5| 1| 6|
| 1|Available|2020-08-01| 8| 2| 6|
| 1|Available|2020-09-01| 8| 3| 6|
| 1|Available|2020-10-01| 8| 4| 6|
| 1|Available|2020-11-01| 8| 5| 6|
| 1| Limited|2020-12-01| 8| 6| 6|
| 2| Limited|2020-09-01| 1| 1| 4|
| 2| Limited|2020-10-01| 1| 2| 4|
| 2| Limited|2020-11-01| 1| 3| 4|
| 2| Limited|2020-12-01| 3| 4| 4|
+----------+---------+----------+-----+----+--------+
Then finally filtering the df5 dataframe using between function to get the latest 6 months data.
FinalResultDF = df5.filter(F.col('rank') \
.between(F.when((F.col('max_rank') > 5),(F.col('max_rank')-6)).otherwise(0),F.col('max_rank'))) \
.select(df5.product_id,df5.status,df5.date,df5.price)
FinalResultDF.show(truncate=False)
+----------+---------+----------+-----+
|product_id|status |date |price|
+----------+---------+----------+-----+
|1 |Available|2020-07-01|5 |
|1 |Available|2020-08-01|8 |
|1 |Available|2020-09-01|8 |
|1 |Available|2020-10-01|8 |
|1 |Available|2020-11-01|8 |
|1 |Limited |2020-12-01|8 |
|2 |Limited |2020-09-01|1 |
|2 |Limited |2020-10-01|1 |
|2 |Limited |2020-11-01|1 |
|2 |Limited |2020-12-01|3 |
+----------+---------+----------+-----+
Using spark-sql:
Given input dataframe:
val df = spark.sql(""" with t1 (
select 1 c1, 'available' c2, 5 c3, '2019-10' c4 union all
select 1 c1, 'available' c2, 8 c3, '2020-08' c4 union all
select 1 c1, 'limited' c2, 8 c3, '2020-10' c4 union all
select 2 c1, 'limited' c2, 1 c3, '2020-09' c4 union all
select 2 c1, 'limited' c2, 3 c3, '2020-10' c4
) select c1 product_id, c2 status , c3 price, c4 month from t1
""")
df.createOrReplaceTempView("df")
df.show(false)
+----------+---------+-----+-------+
|product_id|status |price|month |
+----------+---------+-----+-------+
|1 |available|5 |2019-10|
|1 |available|8 |2020-08|
|1 |limited |8 |2020-10|
|2 |limited |1 |2020-09|
|2 |limited |3 |2020-10|
+----------+---------+-----+-------+
Filter on the date window i.e 6 months from 2020-07 to 2020-12 and store them in df1
val df1 = spark.sql("""
select * from df where month > '2020-07' and month < '2020-12'
""")
df1.createOrReplaceTempView("df1")
df1.show(false)
+----------+---------+-----+-------+
|product_id|status |price|month |
+----------+---------+-----+-------+
|1 |available|8 |2020-08|
|1 |limited |8 |2020-10|
|2 |limited |1 |2020-09|
|2 |limited |3 |2020-10|
+----------+---------+-----+-------+
Lower boundary - Get the maximum when the month <='2020-07'. Overwrite the month as '2020-07'
val df2 = spark.sql("""
select product_id, status, price, '2020-07' month from df where (product_id,month) in
( select product_id, max(month) from df where month <= '2020-07' group by 1 )
""")
df2.createOrReplaceTempView("df2")
df2.show(false)
+----------+---------+-----+-------+
|product_id|status |price|month |
+----------+---------+-----+-------+
|1 |available|5 |2020-07|
+----------+---------+-----+-------+
Upper boundary - Get the maximum using <='2020-12'. Overwrite the month as '2020-12'
val df3 = spark.sql("""
select product_id, status, price, '2020-12' month from df where (product_id, month) in
( select product_id, max(month) from df where month <= '2020-12' group by 1 )
""")
df3.createOrReplaceTempView("df3")
df3.show(false)
+----------+-------+-----+-------+
|product_id|status |price|month |
+----------+-------+-----+-------+
|1 |limited|8 |2020-12|
|2 |limited|3 |2020-12|
+----------+-------+-----+-------+
Now union all the 3 and store it in df4
val df4 = spark.sql("""
select product_id, status, price, month from df1 union all
select product_id, status, price, month from df2 union all
select product_id, status, price, month from df3
order by product_id, month
""")
df4.createOrReplaceTempView("df4")
df4.show(false)
+----------+---------+-----+-------+
|product_id|status |price|month |
+----------+---------+-----+-------+
|1 |available|5 |2020-07|
|1 |available|8 |2020-08|
|1 |limited |8 |2020-10|
|1 |limited |8 |2020-12|
|2 |limited |1 |2020-09|
|2 |limited |3 |2020-10|
|2 |limited |3 |2020-12|
+----------+---------+-----+-------+
Result:
Use sequence(date1,date2, interval 1 month) to generate date array for the missing months.
Explode the array and you get the results.
spark.sql("""
select product_id, status, price, month, explode(dt) res_month from
(
select t1.*,
case when months_between(lm||'-01',month||'-01')=1.0 then array(month||'-01')
when month='2020-12' then array(month||'-01')
else sequence(to_date(month||'-01'), add_months(to_date(lm||'-01'),-1), interval 1 month )
end dt
from (
select product_id, status, price, month,
lead(month) over(partition by product_id order by month) lm
from df4
) t1
) t2
order by product_id, res_month
""")
.show(false)
+----------+---------+-----+-------+----------+
|product_id|status |price|month |res_month |
+----------+---------+-----+-------+----------+
|1 |available|5 |2020-07|2020-07-01|
|1 |available|8 |2020-08|2020-08-01|
|1 |available|8 |2020-08|2020-09-01|
|1 |limited |8 |2020-10|2020-10-01|
|1 |limited |8 |2020-10|2020-11-01|
|1 |limited |8 |2020-12|2020-12-01|
|2 |limited |1 |2020-09|2020-09-01|
|2 |limited |3 |2020-10|2020-10-01|
|2 |limited |3 |2020-10|2020-11-01|
|2 |limited |3 |2020-12|2020-12-01|
+----------+---------+-----+-------+----------+

Resources