Mark all data points between two dates in pyspark - apache-spark

I am given a table with sales and data about promotions attached to it.
When an entry has promo data filled in it means, that a promo campaign started this day for and item X. And it will end at promo_end_date.
Here is an example:
date
promo_end_date
sales
item_id
promo_id
1.1.2020
3.1.2020
1
1
A
2.1.2020
null
1
1
null
3.1.2020
null
1
1
null
4.1.2020
null
1
1
null
5.1.2020
6.1.2020
1
1
B
6.1.2020
null
1
1
null
1.1.2020
null
1
2
null
2.1.2020
null
1
2
null
3.1.2020
null
1
2
null
4.1.2020
6.1.2020
1
2
C
5.1.2020
null
1
2
null
6.1.2020
null
1
2
null
I want to create a binary column on_promo, which will be marking each day with promo campaigns.
So it should look like this:
date
promo_end_date
sales
item_id
promo_id
on_promo
1.1.2020
3.1.2020
1
1
A
1
2.1.2020
null
1
1
null
1
3.1.2020
null
1
1
null
1
4.1.2020
null
1
1
null
0
5.1.2020
6.1.2020
1
1
B
1
6.1.2020
null
1
1
null
1
1.1.2020
null
1
2
null
0
2.1.2020
null
1
2
null
0
3.1.2020
null
1
2
null
0
4.1.2020
6.1.2020
1
2
C
1
5.1.2020
null
1
2
null
1
6.1.2020
null
1
2
null
1
I thought it would be done with window function, where I would partition data by item_id and promo_id and have two conditions: start date and end date. However I can't think of a way to make pyspark take a promo_end_date column as an end date condition.

You can get the most recent promo_end_date using last with ignorenulls=True, and then compare the date with the promo_end_date to know whether there is a current promotion:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'date', F.to_date('date', 'd.M.yyyy')
).withColumn(
'promo_end_date', F.to_date('promo_end_date', 'd.M.yyyy')
).withColumn(
'promo_end_date',
F.last('promo_end_date', ignorenulls=True).over(Window.partitionBy('item_id').orderBy('date'))
).withColumn(
'on_promo', F.when(F.col('date') <= F.col('promo_end_date'), 1).otherwise(0)
)
df2.show()
+----------+--------------+-----+-------+--------+--------+
| date|promo_end_date|sales|item_id|promo_id|on_promo|
+----------+--------------+-----+-------+--------+--------+
|2020-01-01| 2020-01-03| 1| 1| A| 1|
|2020-01-02| 2020-01-03| 1| 1| null| 1|
|2020-01-03| 2020-01-03| 1| 1| null| 1|
|2020-01-04| 2020-01-03| 1| 1| null| 0|
|2020-01-05| 2020-01-06| 1| 1| B| 1|
|2020-01-06| 2020-01-06| 1| 1| null| 1|
|2020-01-01| null| 1| 2| null| 0|
|2020-01-02| null| 1| 2| null| 0|
|2020-01-03| null| 1| 2| null| 0|
|2020-01-04| 2020-01-06| 1| 2| C| 1|
|2020-01-05| 2020-01-06| 1| 2| null| 1|
|2020-01-06| 2020-01-06| 1| 2| null| 1|
+----------+--------------+-----+-------+--------+--------+

Related

add rows in pyspark dataframe and adjust the column sequence accordingly

We have a dataframe like below say DF1
col_name
col_seq
Hash_enc_ind
abc
1
0
first_name
2
1
last_name
3
1
full_name
4
1
XYZ
5
0
sal
6
1
AAA
7
0
Now I want to add 2 rows for one row where hash_inc_ind =1 and adjust the col seq accordingly so that the output would be like
DF1:
col_name
col_seq
Hash_enc_ind
abc
1
0
first_name_h
2
1
first_name_e
3
1
last_name_h
4
1
last_name_e
5
1
full_name_h
6
1
full_name_e
7
1
XYZ
8
0
sal_h
9
1
sal_e
10
1
AAA
11
0
You can explode an array constructed using when:
import pyspark.sql.functions as F
df2 = df.withColumn(
'col_name',
F.expr("explode(transform(case when Hash_enc_ind = 1 then array('_h', '_e') else array('') end, x -> col_name || x))")
)
df2.show()
+------------+-------+------------+
| col_name|col_seq|Hash_enc_ind|
+------------+-------+------------+
| abc| 1| 0|
|first_name_h| 2| 1|
|first_name_e| 2| 1|
| last_name_h| 3| 1|
| last_name_e| 3| 1|
| full_name_h| 4| 1|
| full_name_e| 4| 1|
| XYZ| 5| 0|
| sal_h| 6| 1|
| sal_e| 6| 1|
| AAA| 7| 0|
+------------+-------+------------+

Loop 3 times and add a new value each time to a new column in spark DF

I want create 3 rows for every row in pysaprk DF. I wan to add a new column called loopVar=(val1,val2,val3). Three different values must be added as a value in each loop. Any idea how do I do it ?
Original:
a b c
1 2 3
1 2 3
Condition 1: loop = 1 and b is not null then loopvar =va1
Condition 2: loop = 2 and b is not null then loopvar =va2
Condition 3: loop = 3 and c is not null then loopvar =va3
Output :
a b c loopvar
1 2 3 val1
1 2 3 vall
1 2 3 val2
1 2 3 val2
1 2 3 val3
1 2 3 val3
Use a crossJoin:
df = spark.createDataFrame([[1,2,3], [1,2,3]]).toDF('a','b','c')
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 1| 2| 3|
+---+---+---+
df2 = spark.createDataFrame([['val1'], ['val2'], ['val3']]).toDF('loopvar')
df2.show()
+-------+
|loopvar|
+-------+
| val1|
| val2|
| val3|
+-------+
df3 = df.crossJoin(df2)
df3.show()
+---+---+---+-------+
| a| b| c|loopvar|
+---+---+---+-------+
| 1| 2| 3| val1|
| 1| 2| 3| val2|
| 1| 2| 3| val3|
| 1| 2| 3| val1|
| 1| 2| 3| val2|
| 1| 2| 3| val3|
+---+---+---+-------+

Get the previous row value using spark sql

I have a table like this.
Id prod val
1 0 0
2 0 0
3 1 1000
4 0 0
5 1 2000
6 0 0
7 0 0
I want to add a new column new_val and the condition for this column is, if prod = 0, then new_val should be from the preceding row where prod = 1.
If prod = 1 it should have the same value as val column. How do I achieve this using spark sql?
Id prod val new_val
1 0 0 1000
2 0 0 1000
3 1 1000 1000
4 0 0 2000
5 1 2000 2000
6 1 4000 4000
7 1 3000 3000
Any help is greatly appreciated
You can use something like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().orderBy("id")
df = df.withColumn("new_val", F.when(F.col("prod") == 0, F.lag("val").over(w)).otherwise(F.col("val")))
What we are basically doing is using an if-else condition:
When prod == 0, take lag of val which is value of previous row (over a window that is ordered by id column), and if prod == 1, then we use the present value of the column.
You can acheive that with
val w = Window.orderBy("id").rowsBetween(0, Window.unboundedFollowing)
df
.withColumn("new_val", when($"prod" === 0, null).otherwise($"val"))
.withColumn("new_val", first("new_val", ignoreNulls = true).over(w))
It first, creates new column with null values whenever value doesn't change:
+---+----+----+-------+
| id|prod| val|new_val|
+---+----+----+-------+
| 1| 0| 0| null|
| 2| 0| 0| null|
| 3| 1|1000| 1000|
| 4| 0| 0| null|
| 5| 1|2000| 2000|
| 6| 1|4000| 4000|
| 7| 1|3000| 3000|
+---+----+----+-------+
And it replaces values with first non-null value in the following records
+---+----+----+-------+
| id|prod| val|new_val|
+---+----+----+-------+
| 1| 0| 0| 1000|
| 2| 0| 0| 1000|
| 3| 1|1000| 1000|
| 4| 0| 0| 2000|
| 5| 1|2000| 2000|
| 6| 1|4000| 4000|
| 7| 1|3000| 3000|
+---+----+----+-------+

combine multiple row in Spark

I wonder if there is any easy way to combine multiple rows into one in Pyspark, I am new to Python and Spark and been using Spark.sql most of the time.
Here is a data example:
id count1 count2 count3
1 null 1 null
1 3 null null
1 null null 5
2 null 1 null
2 1 null null
2 null null 2
the expected output is :
id count1 count2 count3
1 3 1 5
2 1 1 2
I been using spark SQL to join them multiple times, and wonder if there is any easier way to do that.
Thank you!
Spark SQL will sum null as zero, so if you know there are no "overlapping" data elements, just group by the column you wish aggregate to and sum.
Assuming that you want to keep your original column names (and not sum the id column), you'll need to specify the columns that are summed and then rename them after the aggregation.
before.show()
+---+------+------+------+
| id|count1|count2|count3|
+---+------+------+------+
| 1| null| 1| null|
| 1| 3| null| null|
| 1| null| null| 5|
| 2| null| 1| null|
| 2| 1| null| null|
| 2| null| null| 2|
+---+------+------+------+
after = before
.groupby('id').sum(*[c for c in before.columns if c != 'id'])
.select([col(f"sum({c})").alias(c) for c in before.columns if c != 'id'])
after.show()
+------+------+------+
|count1|count2|count3|
+------+------+------+
| 3| 1| 5|
| 1| 1| 2|
+------+------+------+

Add a priority column in PySpark dataframe

I have a PySpark dataframe(input_dataframe) which looks like below:
**id** **col1** **col2** **col3** **col4** **col_check**
101 1 0 1 1 -1
102 0 1 1 0 -1
103 1 1 0 1 -1
104 0 0 1 1 -1
I want to have a PySpark function(update_col_check), which update column(col_check) of this dataframe. I will pass one column name as an argument to this function. Function should check if value of that column is 1, then update value of col_check as this column name., let us say i am passing col2 as an argument to this function:
output_dataframe = update_col_check(input_dataframe, col2)
So, my output_dataframe should look like below:
**id** **col1** **col2** **col3** **col4** **col_check**
101 1 0 1 1 -1
102 0 1 1 0 col2
103 1 1 0 1 col2
104 0 0 1 1 -1
Can i achieve this using PySpark? Any help will be appreciated.
You can do this fairly straight forward with functions when, otherwise:
from pyspark.sql.functions import when, lit
def update_col_check(df, col_name):
return df.withColumn('col_check', when(df[col_name] == 1, lit(col_name)).otherwise(df['col_check']))
update_col_check(df, 'col1').show()
+---+----+----+----+----+---------+
| id|col1|col2|col3|col4|col_check|
+---+----+----+----+----+---------+
|101| 1| 0| 1| 1| col1|
|102| 0| 1| 1| 0| -1|
|103| 1| 1| 0| 1| col1|
|104| 0| 0| 1| 1| -1|
+---+----+----+----+----+---------+
update_col_check(df, 'col2').show()
+---+----+----+----+----+---------+
| id|col1|col2|col3|col4|col_check|
+---+----+----+----+----+---------+
|101| 1| 0| 1| 1| -1|
|102| 0| 1| 1| 0| col2|
|103| 1| 1| 0| 1| col2|
|104| 0| 0| 1| 1| -1|
+---+----+----+----+----+---------+

Resources