Date-Shift to Roll-up in Pyspark? - apache-spark

I have data that looks similar to this:
What I want to do is replace the Stopdate of the first record with the stopdate of the last record so that I can roll-up all of the records that have a 1 in both gap columns. I know that this is an F.when statement but everything I can think to construct in my head does not give me the result I want. How would I do this while making sure it only applies to records with this id?
Can anyone please help? Thanks!
Sample data in text
ID Startdate Stopdate gap_from_previous_in_days gap_to_next_in_days
1 1/1/2021 1/2/2021
1 1/3/2021 1/4/2021 1 1
1 1/5/2021 1/6/2021 1 1
1 1/7/2021 1/8/2021 1 1
1 1/9/2021 1/10/2021 1 1
1 1/11/2021 1/12/2021 1 1
1 1/13/2021 1/14/2021 1 1
1 1/15/2021 1/16/2021 1 1
1 1/17/2021 1/18/2021 1 1
1 1/19/2021 1/20/2021 1 2
My desired result:
ID Startdate Stopdate gap_from_previous_in_days gap_to_next_in_days
1 1/1/2021 1/20/2021
So basically I'm trying to create a table that instead of looking like this:
ID Startdate Stopdate gap_from_previous_in_days gap_to_next_in_days
1 1/1/2021 1/2/2021
1 1/3/2021 1/4/2021 1 1
1 1/5/2021 1/6/2021 1 1
1 1/7/2021 1/8/2021 1 1
1 1/9/2021 1/10/2021 1 1
1 1/11/2021 1/12/2021 1 1
1 1/13/2021 1/14/2021 1 1
1 1/15/2021 1/16/2021 1 1
1 1/17/2021 1/18/2021 1 1
1 1/19/2021 1/20/2021 1 3
1 1/23/2021 1/25/2021 3
Would look like this
ID Startdate Stopdate gap_from_previous_in_days gap_to_next_in_days
1 1/1/2021 1/2/2021 3
1 1/23/2021 1/25/2021 3
Hopefully that helps illustrate what I'm trying to do.
I'm basically trying to combine records that are only one day apart from each other.

The idea is to create a grouping column based on the count of rows with gap_from_previous_in_days != 1, group by that grouping column and the ID, get the earliest start date and latest stop date, and their associated values of the gap:
from pyspark.sql import functions as F, Window
result = df.withColumn(
'Startdate2', F.to_date('Startdate', 'M/d/yyyy')
).withColumn(
'Stopdate2', F.to_date('Stopdate', 'M/d/yyyy')
).withColumn(
'grp',
F.count(
F.when(F.col('gap_from_previous_in_days') != 1, 1)
).over(
Window.partitionBy('ID').orderBy('Startdate2')
)
).groupBy('ID', 'grp').agg(
F.min(F.struct('Startdate2', 'Startdate', 'gap_from_previous_in_days')).alias('start'),
F.max(F.struct('Stopdate2', 'Stopdate', 'gap_to_next_in_days')).alias('end')
).select(
'ID',
'start.Startdate', 'end.Stopdate',
'start.gap_from_previous_in_days', 'end.gap_to_next_in_days'
)
result.show()
+---+---------+---------+-------------------------+-------------------+
| ID|Startdate| Stopdate|gap_from_previous_in_days|gap_to_next_in_days|
+---+---------+---------+-------------------------+-------------------+
| 1| 1/1/2021|1/20/2021| null| 3|
| 1|1/23/2021|1/25/2021| 3| null|
+---+---------+---------+-------------------------+-------------------+

Another approach is to construct all dates from the given intervals.
Starting from this list of dates new intervals can be recalculated.
For this approach the gap columns are not used/needed. Therefore, at the end they are recalculated but if not needed this can be omitted.
# Create test dataframe
import pyspark.sql.functions as F
df = (spark.createDataFrame([[1, '2021-01-01', '2021-01-10'],
[1, '2021-01-11', '2021-01-12'],
[1, '2021-01-14', '2021-01-16'],
[1, '2021-01-17', '2021-01-20'],
[2, '2021-01-01', '2021-01-10'],
[2, '2021-01-12', '2021-01-14'],
[2, '2021-01-14', '2021-01-15'],
[2, '2021-01-19', '2021-01-20'],
], schema="ID int, From string, To string")
.selectExpr('ID',
'to_date(From, "yyyy-MM-dd") as StartDate',
'to_date(To, "yyyy-MM-dd") as StopDate')
)
# Do actual calculation
df_result = (df
# Get all included dates
.selectExpr("ID", "explode(sequence(StartDate, StopDate)) as dates")
# Get previous and next date
.withColumn("Previous", F.expr('LAG(dates) OVER (PARTITION BY ID ORDER BY dates ASC)'))
.withColumn("Next", F.expr('LAG(dates) OVER (PARTITION BY ID ORDER BY dates DESC)'))
# Flag beginnings and endings of intervals
.withColumn("Begin", F.expr("datediff(dates, previous)> 1 OR previous is NULL"))
.withColumn("End", F.expr("datediff( next, dates)> 1 OR next is NULL"))
# Only keep beginnings and endings
.filter("Begin OR End")
# Get end next to begin and only keep beginnings
.withColumn("IntervalEnd", F.expr('LAG(dates) OVER (PARTITION BY ID ORDER BY dates DESC)'))
.filter("Begin")
# Rename columns + calculate gaps
.selectExpr(
"ID",
"dates as StartDate",
"IntervalEnd as StopDate",
"datediff(dates, LAG(IntervalEnd) OVER (PARTITION BY ID ORDER BY dates ASC)) as gap_from_previous_in_days",
"datediff(LAG(dates) OVER (PARTITION BY ID ORDER BY dates DESC), IntervalEnd ) as gap_to_next_in_days"
)
)
df_result.show()
+---+----------+----------+-------------------------+-------------------+
| ID| StartDate| StopDate|gap_from_previous_in_days|gap_to_next_in_days|
+---+----------+----------+-------------------------+-------------------+
| 1|2021-01-01|2021-01-12| null| 2|
| 1|2021-01-14|2021-01-20| 2| null|
| 2|2021-01-01|2021-01-10| null| 2|
| 2|2021-01-12|2021-01-15| 2| 4|
| 2|2021-01-19|2021-01-20| 4| null|
+---+----------+----------+-------------------------+-------------------+

Related

compare columns with NaN or <NA> values pandas

I have the dataframe with NaN and values, now I want to compare two columns in the same dataframe whether each row values in null or not null. For examples,
if the column a_1 have null values, column a_2 have not null values, then for that particular
row, the result should be 1 in the new column a_12.
If the values in both a_1(value is 123) & a_2(value is 345) is not null, and the values are
not equal, then the result should be 3 in column a_12.
below is the code snippet I have used for comparison, for the scenario 1, I am getting the result as 3 instead of 1. Please guide me to get the correct output.
try:
if (x[cols[0]]==x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 0
elif (np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
return 0
elif (~np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
return 1
elif (np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 2
elif (x[cols[0]]!=x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 3
else:
pass
except Exception as exc:
if (x[cols[0]]==x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 0
elif (pd.isna(x[cols[0]])) & (pd.isna(x[cols[1]])):
return 0
elif (pd.notna(x[cols[0]])) & (pd.isna(x[cols[1]])):
return 1
elif (pd.isna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 2
elif (x[cols[0]]!=x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 3
else:
pass
I have used pd.isna() and pd.notna(), also np.isnan() and ~np.isnan(), because for some columns the second method (np.isnan()) is working, for some columns its just throwing an error.
Please guide me to achieve the result as excepted.
Expected Output:
| a_1 | a_2 | result |
|-----------|---------|--------|
| gssfwe | gssfwe | 0 |
| <NA> | <NA> | 0 |
| fsfsfw | <NA> | 1 |
| <NA> | qweweqw | 2 |
| adsadgsgd | wwuwquq | 3 |
Output Got with the above code:
| a_1 | a_2 | result |
|-----------|---------|--------|
| gssfwe | gssfwe | 0 |
| <NA> | <NA> | 0 |
| fsfsfw | <NA> | 3 |
| <NA> | qweweqw | 3 |
| adsadgsgd | wwuwquq | 3 |
Going by the logic in your code, you'd want to define a function and apply it across your DataFrame.
df = pd.DataFrame({'a_1': [1, 2, np.nan, np.nan, 1], 'a_2': [2, np.nan, 1, np.nan, 1]})
The categories you want map neatly to binary numbers, which you can use to write a short function like -
def nan_check(row):
x, y = row
if x != y:
return int(f'{int(pd.notna(y))}{int(pd.notna(x))}', base=2)
return 0
df['flag'] = df.apply(nan_check, axis=1)
Output
a_1 a_2 flag
0 1.0 2.0 3
1 2.0 NaN 1
2 NaN 1.0 2
3 NaN NaN 0
4 1.0 1.0 0
You can try np.select, but I think you need to rethink the condition and the expected output
Condition 1: if the column a_1 have null values, column a_2 have not null values, then for that particular row, the result should be 1 in the new column a_12.
Condition 2: If the values in both a_1 & a_2 is not null, and the values are not equal, then the result should be 3 in column a_12.
df['a_12'] = np.select(
[df['a_1'].isna() & df['a_2'].notna(),
df['a_1'].notna() & df['a_2'].notna() & df['a_1'].ne(df['a_2'])],
[1, 3],
default=0
)
print(df)
a_1 a_2 result a_12
0 gssfwe gssfwe 0 0
1 NaN NaN 0 0
2 fsfsfw NaN 1 0 # Shouldn't be Condition 1 since a_1 is not NaN
3 NaN qweweqw 2 1 # Condition 1
4 adsadgsgd wwuwquq 3 3

Filter DataFrame to delete duplicate values in pyspark

I have the following dataframe
date | value | ID
--------------------------------------
2021-12-06 15:00:00 25 1
2021-12-06 15:15:00 35 1
2021-11-30 00:00:00 20 2
2021-11-25 00:00:00 10 2
I want to join this DF with another one like this:
idUser | Name | Gender
-------------------
1 John M
2 Anne F
My expected output is:
ID | Name | Gender | Value
---------------------------
1 John M 35
2 Anne F 20
What I need is: Get only the most recent value of the first dataframe and join only this value with my second dataframe. Although, my spark script is joining both values:
My code:
df = df1.select(
col("date"),
col("value"),
col("ID"),
).OrderBy(
col("ID").asc(),
col("date").desc(),
).groupBy(
col("ID"), col("date").cast(StringType()).substr(0,10).alias("date")
).agg (
max(col("value")).alias("value")
)
final_df = df2.join(
df,
(col("idUser") == col("ID")),
how="left"
)
When i perform this join (formating the columns is abstracted in this post) I have the following output:
ID | Name | Gender | Value
---------------------------
1 John M 35
2 Anne F 20
2 Anne F 10
I use substr to remove hours and minutes to filter only by date. But when I have the same ID in different days my output df has the 2 values instead of the most recently. How can I fix this?
Note: I'm using only pyspark functions to do this (I now want to use spark.sql(...)).
You can use window and row_number function in pysaprk
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("ID").orderBy("date").desc()
df1_latest_val = df1.withColumn("row_number", row_number().over(windowSpec)).filter(
f.col("row_number") == 1
)
The output of table df1_latest_val will look something like this
date | value | ID | row_number |
-----------------------------------------------------
2021-12-06 15:15:00 35 1 1
2021-11-30 00:00:00 20 2 1
Now you will have df with the latest val, which you can directly join with another table.

How to join the two dataframe by condition in PySpark?

I am having two dataframe like described below
Dataframe 1
P_ID P_Name P_Description P_Size
100 Moto Mobile 16
200 Apple Mobile 15
300 Oppo Mobile 18
Dataframe 2
P_ID List_Code P_Amount
100 ALPHA 20000
100 BETA 60000
300 GAMMA 15000
Requirement :
Need to join the two dataframe by P_ID.
Information about the dataframe :
In dataframe 1 P_ID is a primary key and dataframe 2 does't have any primary attribute.
How to join the dataframe
Need to create new columns in dataframe 1 from the value of dataframe 2 List_Code appends with "_price". If dataframe 2 List_Code contains 20 unique values we need to create 20 column in dataframe 1. Then, we have fill the value in newly created column in dataframe 1 from the dataframe 2 P_Amount column based on P_ID if present else fills with zero. After creation of dataframe we need to join the dataframe based on the P_ID. If we add the column with the expected value in dataframe 1 we can join the dataframe. My problem is creating new columns with the expected value.
The expected dataframe is shown below
Expected dataframe
P_ID P_Name P_Description P_Size ALPHA_price BETA_price GAMMA_price
100 Moto Mobile 16 20000 60000 0
200 Apple Mobile 15 0 0 0
300 Oppo Mobile 18 0 0 15000
Can you please help me to solve the problem, thanks in advance.
For you application, you need to pivot the second dataframe and then join the first dataframe on to the pivoted result on P_ID using left join.
See the code below.
df_1 = pd.DataFrame({'P_ID' : [100, 200, 300], 'P_Name': ['Moto', 'Apple', 'Oppo'], 'P_Size' : [16, 15, 18]})
sdf_1 = sc.createDataFrame(df_1)
df_2 = pd.DataFrame({'P_ID' : [100, 100, 300], 'List_Code': ['ALPHA', 'BETA', 'GAMMA'], 'P_Amount' : [20000, 60000, 10000]})
sdf_2 = sc.createDataFrame(df_2)
sdf_pivoted = sdf_2.groupby('P_ID').pivot('List_Code').agg(f.sum('P_Amount')).fillna(0)
sdf_joined = sdf_1.join(sdf_pivoted, on='P_ID', how='left').fillna(0)
sdf_joined.show()
+----+------+------+-----+-----+-----+
|P_ID|P_Name|P_Size|ALPHA| BETA|GAMMA|
+----+------+------+-----+-----+-----+
| 300| Oppo| 18| 0| 0|10000|
| 200| Apple| 15| 0| 0| 0|
| 100| Moto| 16|20000|60000| 0|
+----+------+------+-----+-----+-----+
You can change the column names or ordering of the dataframe as needed.

How to efficiently disaggregate data from?

I have Google Analytics data which I am trying to disaggregate.
Below is a simplified version of the dataframe I am dealing with:
date | users | goal_completions
20150101| 2 | 1
20150102| 3 | 2
I would like to disaggregate the data such that each "user" has its own row. In addition, the third column, "goal_completions" will also be disaggregated with the assumption that each user can only have 1 "goal_completion".
The output I am seeking will be something like this:
date | users | goal_completions
20150101| 1 | 1
20150101| 1 | 0
20150102| 1 | 1
20150102| 1 | 1
20150102| 1 | 0
I was able to duplicate each row based on the number of users on a given date, however I can't seem to find a way to disaggregate the "goal_completion" column. Here is what I currently have after duplicating the "users" column:
date | users | goal_completions
20150101| 1 | 1
20150101| 1 | 1
20150102| 1 | 2
20150102| 1 | 2
20150102| 1 | 2
Any help will be appreciated - thanks!
IIUC using repeat create you dfs , then we adjust the two column by cumcount with np.where
df=df.reindex(df.index.repeat(df.users))
df=df.assign(users=1)
df.goal_completions=np.where(df.groupby(level=0).cumcount()<df.goal_completions,1,0)
df
Out[609]:
date users goal_completions
0 20150101 1 1
0 20150101 1 0
1 20150102 1 1
1 20150102 1 1
1 20150102 1 0

Aggregating past and current values(monthly data) of Target column using pandas

I have dataframe like this below in pandas,
EMP_ID| Date| Target_GWP
1 | Jan-2017| 100
2 | Jan 2017| 300
1 | Feb-2017| 500
2 | Feb-2017| 200
and I need my output to be printed in below form.
EMP_ID| Date| Target_GWP | past_Target_GWP
1 | Feb-2017| 600 |100
2 | Feb-2017| 500 |300
Basically I have monthly data coming in excel and I want to aggregate this Target_GWP for each EMP_ID against the latest(current month) and have to create a back up column in pandas dataframe for past month Target_GWP. So How will i back the past month target_GWP and add it to current month Target GWP
Any leads on this would be appreciated.
Use:
#convert to datetime
df['Date'] = pd.to_datetime(df['Date'])
#sorting and get last 2 rows
df = df.sort_values(['EMP_ID','Date']).groupby('EMP_ID').tail(2)
#aggregation
df = df.groupby('EMP_ID', as_index=False).agg({'Date':'last', 'Target_GWP':['sum','first']})
df.columns = ['EMP_ID','Date','Target_GWP','past_Target_GWP']
print (df)
EMP_ID Date Target_GWP past_Target_GWP
0 1 2017-02-01 600 100
1 2 2017-02-01 500 300
Or if need top value in Target_GWP instead sum use last:
df = df.groupby('EMP_ID', as_index=False).agg({'Date':'last', 'Target_GWP':['last','first']})
df.columns = ['EMP_ID','Date','Target_GWP','past_Target_GWP']
print (df)
EMP_ID Date Target_GWP past_Target_GWP
0 1 2017-02-01 500 100
1 2 2017-02-01 200 300

Resources