PySpark join on ID then on year and month from 'date' column - apache-spark

I have 2 PySpark dataframes and want to join on "ID", then on a year from "date1" and "date2" columns and then on month of the same date columns.
df1:
ID col1 date1
1 1 2018-01-05
1 2 2018-02-05
2 4 2018-04-05
2 1 2018-05-05
3 1 2019-01-05
3 4 2019-02-05
df2:
ID col2 date2
1 1 2018-01-08
1 1 2018-02-08
2 4 2018-04-08
2 3 2018-05-08
3 1 2019-01-08
3 4 2019-02-08
Expected output:
ID col1 date1 col2 date2
1 1 2018-01-05 1 2018-01-08
1 2 2018-02-05 1 2018-02-08
2 4 2018-04-05 4 2018-04-08
2 1 2018-05-05 3 2018-05-08
3 1 2019-01-05 1 2019-01-08
3 4 2019-02-05 4 2019-02-08
I tried something along the lines of:
df = df1.join(df2, (ID & (df1.F.year(date1) == df2.F.year(date2)) & (df1.F.month(date1) == df2.F.month(date2))
How to join on date's month and year?

You can to it like this:
join_on = (df1.ID == df2.ID) & \
(F.year(df1.date1) == F.year(df2.date2)) & \
(F.month(df1.date1) == F.month(df2.date2))
df = df1.join(df2, join_on)
Full example:
from pyspark.sql import functions as F
df1 = spark.createDataFrame(
[(1, 1, '2018-01-05'),
(1, 2, '2018-02-05'),
(2, 4, '2018-04-05'),
(2, 1, '2018-05-05'),
(3, 1, '2019-01-05'),
(3, 4, '2019-02-05')],
['ID', 'col1', 'date1'])
df2 = spark.createDataFrame(
[(1, 1, '2018-01-08'),
(1, 1, '2018-02-08'),
(2, 4, '2018-04-08'),
(2, 3, '2018-05-08'),
(3, 1, '2019-01-08'),
(3, 4, '2019-02-08')],
['ID', 'col2', 'date2'])
join_on = (df1.ID == df2.ID) & \
(F.year(df1.date1) == F.year(df2.date2)) & \
(F.month(df1.date1) == F.month(df2.date2))
df = df1.join(df2, join_on).drop(df2.ID)
df.show()
# +---+----+----------+----+----------+
# | ID|col1| date1|col2| date2|
# +---+----+----------+----+----------+
# | 1| 1|2018-01-05| 1|2018-01-08|
# | 1| 2|2018-02-05| 1|2018-02-08|
# | 2| 4|2018-04-05| 4|2018-04-08|
# | 2| 1|2018-05-05| 3|2018-05-08|
# | 3| 1|2019-01-05| 1|2019-01-08|
# | 3| 4|2019-02-05| 4|2019-02-08|
# +---+----+----------+----+----------+

Related

compare columns with NaN or <NA> values pandas

I have the dataframe with NaN and values, now I want to compare two columns in the same dataframe whether each row values in null or not null. For examples,
if the column a_1 have null values, column a_2 have not null values, then for that particular
row, the result should be 1 in the new column a_12.
If the values in both a_1(value is 123) & a_2(value is 345) is not null, and the values are
not equal, then the result should be 3 in column a_12.
below is the code snippet I have used for comparison, for the scenario 1, I am getting the result as 3 instead of 1. Please guide me to get the correct output.
try:
if (x[cols[0]]==x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 0
elif (np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
return 0
elif (~np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
return 1
elif (np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 2
elif (x[cols[0]]!=x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 3
else:
pass
except Exception as exc:
if (x[cols[0]]==x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 0
elif (pd.isna(x[cols[0]])) & (pd.isna(x[cols[1]])):
return 0
elif (pd.notna(x[cols[0]])) & (pd.isna(x[cols[1]])):
return 1
elif (pd.isna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 2
elif (x[cols[0]]!=x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 3
else:
pass
I have used pd.isna() and pd.notna(), also np.isnan() and ~np.isnan(), because for some columns the second method (np.isnan()) is working, for some columns its just throwing an error.
Please guide me to achieve the result as excepted.
Expected Output:
| a_1 | a_2 | result |
|-----------|---------|--------|
| gssfwe | gssfwe | 0 |
| <NA> | <NA> | 0 |
| fsfsfw | <NA> | 1 |
| <NA> | qweweqw | 2 |
| adsadgsgd | wwuwquq | 3 |
Output Got with the above code:
| a_1 | a_2 | result |
|-----------|---------|--------|
| gssfwe | gssfwe | 0 |
| <NA> | <NA> | 0 |
| fsfsfw | <NA> | 3 |
| <NA> | qweweqw | 3 |
| adsadgsgd | wwuwquq | 3 |
Going by the logic in your code, you'd want to define a function and apply it across your DataFrame.
df = pd.DataFrame({'a_1': [1, 2, np.nan, np.nan, 1], 'a_2': [2, np.nan, 1, np.nan, 1]})
The categories you want map neatly to binary numbers, which you can use to write a short function like -
def nan_check(row):
x, y = row
if x != y:
return int(f'{int(pd.notna(y))}{int(pd.notna(x))}', base=2)
return 0
df['flag'] = df.apply(nan_check, axis=1)
Output
a_1 a_2 flag
0 1.0 2.0 3
1 2.0 NaN 1
2 NaN 1.0 2
3 NaN NaN 0
4 1.0 1.0 0
You can try np.select, but I think you need to rethink the condition and the expected output
Condition 1: if the column a_1 have null values, column a_2 have not null values, then for that particular row, the result should be 1 in the new column a_12.
Condition 2: If the values in both a_1 & a_2 is not null, and the values are not equal, then the result should be 3 in column a_12.
df['a_12'] = np.select(
[df['a_1'].isna() & df['a_2'].notna(),
df['a_1'].notna() & df['a_2'].notna() & df['a_1'].ne(df['a_2'])],
[1, 3],
default=0
)
print(df)
a_1 a_2 result a_12
0 gssfwe gssfwe 0 0
1 NaN NaN 0 0
2 fsfsfw NaN 1 0 # Shouldn't be Condition 1 since a_1 is not NaN
3 NaN qweweqw 2 1 # Condition 1
4 adsadgsgd wwuwquq 3 3

PySpark window function - within n months from current row

I want to remove all rows within x months of current row (before and after based on date), when the current row is equal to 1.
E.g. given this PySpark df:
id
date
target
a
"2020-01-01"
0
a
"2020-02-01"
0
a
"2020-03-01"
0
a
"2020-04-01"
1
a
"2020-05-01"
0
a
"2020-06-01"
0
a
"2020-07-01"
0
a
"2020-08-01"
0
a
"2020-09-01"
0
a
"2020-10-01"
1
a
"2020-11-01"
0
b
"2020-01-01"
0
b
"2020-02-01"
0
b
"2020-03-01"
0
b
"2020-05-01"
1
(Notice, April month does not exit for id b)
If using an x value of 2, the resulting df would be:
id
date
target
a
"2020-01-01"
0
a
"2020-04-01"
1
a
"2020-07-01"
0
a
"2020-10-01"
1
b
"2020-01-01"
0
b
"2020-02-01"
0
b
"2020-05-01"
1
I am able to remove xth row before and after row of interest using the code from below, but I want to remove all rows between current row and x both ways based on date.
window = 2
windowSpec = Window.partitionBy("id").orderBy(['id','date'])
df= df.withColumn("lagvalue", lag('target', window).over(windowSpec))
df= df.withColumn("leadvalue", lead('target', window).over(windowSpec))
df= df.where(col("lagvalue") == 0 & col("leadvalue") == 0)
In your case, rangeBetween can be very useful. It pays attention to the values and takes only the values which fall into the range. E.g. rangeBetween(-2, 2) would take all the values from 2 below to 2 above. As rangeBetween does not work with dates (or strings), I translated them into integers using months_between.
from pyspark.sql import functions as F, Window
df = spark.createDataFrame(
[('a', '2020-01-01', 0),
('a', '2020-02-01', 0),
('a', '2020-03-01', 0),
('a', '2020-04-01', 1),
('a', '2020-05-01', 0),
('a', '2020-06-01', 0),
('a', '2020-07-01', 0),
('a', '2020-08-01', 0),
('a', '2020-09-01', 0),
('a', '2020-10-01', 1),
('a', '2020-11-01', 0),
('b', '2020-01-01', 0),
('b', '2020-02-01', 0),
('b', '2020-03-01', 0),
('b', '2020-05-01', 1)],
['id', 'date', 'target']
)
window = 2
windowSpec = Window.partitionBy('id').orderBy(F.months_between('date', F.lit('1970-01-01'))).rangeBetween(-window, window)
df = df.withColumn('to_remove', F.sum('target').over(windowSpec) - F.col('target'))
df = df.where(F.col('to_remove') == 0).drop('to_remove')
df.show()
# +---+----------+------+
# | id| date|target|
# +---+----------+------+
# | a|2020-01-01| 0|
# | a|2020-04-01| 1|
# | a|2020-07-01| 0|
# | a|2020-10-01| 1|
# | b|2020-01-01| 0|
# | b|2020-02-01| 0|
# | b|2020-05-01| 1|
# +---+----------+------+

Date-Shift to Roll-up in Pyspark?

I have data that looks similar to this:
What I want to do is replace the Stopdate of the first record with the stopdate of the last record so that I can roll-up all of the records that have a 1 in both gap columns. I know that this is an F.when statement but everything I can think to construct in my head does not give me the result I want. How would I do this while making sure it only applies to records with this id?
Can anyone please help? Thanks!
Sample data in text
ID Startdate Stopdate gap_from_previous_in_days gap_to_next_in_days
1 1/1/2021 1/2/2021
1 1/3/2021 1/4/2021 1 1
1 1/5/2021 1/6/2021 1 1
1 1/7/2021 1/8/2021 1 1
1 1/9/2021 1/10/2021 1 1
1 1/11/2021 1/12/2021 1 1
1 1/13/2021 1/14/2021 1 1
1 1/15/2021 1/16/2021 1 1
1 1/17/2021 1/18/2021 1 1
1 1/19/2021 1/20/2021 1 2
My desired result:
ID Startdate Stopdate gap_from_previous_in_days gap_to_next_in_days
1 1/1/2021 1/20/2021
So basically I'm trying to create a table that instead of looking like this:
ID Startdate Stopdate gap_from_previous_in_days gap_to_next_in_days
1 1/1/2021 1/2/2021
1 1/3/2021 1/4/2021 1 1
1 1/5/2021 1/6/2021 1 1
1 1/7/2021 1/8/2021 1 1
1 1/9/2021 1/10/2021 1 1
1 1/11/2021 1/12/2021 1 1
1 1/13/2021 1/14/2021 1 1
1 1/15/2021 1/16/2021 1 1
1 1/17/2021 1/18/2021 1 1
1 1/19/2021 1/20/2021 1 3
1 1/23/2021 1/25/2021 3
Would look like this
ID Startdate Stopdate gap_from_previous_in_days gap_to_next_in_days
1 1/1/2021 1/2/2021 3
1 1/23/2021 1/25/2021 3
Hopefully that helps illustrate what I'm trying to do.
I'm basically trying to combine records that are only one day apart from each other.
The idea is to create a grouping column based on the count of rows with gap_from_previous_in_days != 1, group by that grouping column and the ID, get the earliest start date and latest stop date, and their associated values of the gap:
from pyspark.sql import functions as F, Window
result = df.withColumn(
'Startdate2', F.to_date('Startdate', 'M/d/yyyy')
).withColumn(
'Stopdate2', F.to_date('Stopdate', 'M/d/yyyy')
).withColumn(
'grp',
F.count(
F.when(F.col('gap_from_previous_in_days') != 1, 1)
).over(
Window.partitionBy('ID').orderBy('Startdate2')
)
).groupBy('ID', 'grp').agg(
F.min(F.struct('Startdate2', 'Startdate', 'gap_from_previous_in_days')).alias('start'),
F.max(F.struct('Stopdate2', 'Stopdate', 'gap_to_next_in_days')).alias('end')
).select(
'ID',
'start.Startdate', 'end.Stopdate',
'start.gap_from_previous_in_days', 'end.gap_to_next_in_days'
)
result.show()
+---+---------+---------+-------------------------+-------------------+
| ID|Startdate| Stopdate|gap_from_previous_in_days|gap_to_next_in_days|
+---+---------+---------+-------------------------+-------------------+
| 1| 1/1/2021|1/20/2021| null| 3|
| 1|1/23/2021|1/25/2021| 3| null|
+---+---------+---------+-------------------------+-------------------+
Another approach is to construct all dates from the given intervals.
Starting from this list of dates new intervals can be recalculated.
For this approach the gap columns are not used/needed. Therefore, at the end they are recalculated but if not needed this can be omitted.
# Create test dataframe
import pyspark.sql.functions as F
df = (spark.createDataFrame([[1, '2021-01-01', '2021-01-10'],
[1, '2021-01-11', '2021-01-12'],
[1, '2021-01-14', '2021-01-16'],
[1, '2021-01-17', '2021-01-20'],
[2, '2021-01-01', '2021-01-10'],
[2, '2021-01-12', '2021-01-14'],
[2, '2021-01-14', '2021-01-15'],
[2, '2021-01-19', '2021-01-20'],
], schema="ID int, From string, To string")
.selectExpr('ID',
'to_date(From, "yyyy-MM-dd") as StartDate',
'to_date(To, "yyyy-MM-dd") as StopDate')
)
# Do actual calculation
df_result = (df
# Get all included dates
.selectExpr("ID", "explode(sequence(StartDate, StopDate)) as dates")
# Get previous and next date
.withColumn("Previous", F.expr('LAG(dates) OVER (PARTITION BY ID ORDER BY dates ASC)'))
.withColumn("Next", F.expr('LAG(dates) OVER (PARTITION BY ID ORDER BY dates DESC)'))
# Flag beginnings and endings of intervals
.withColumn("Begin", F.expr("datediff(dates, previous)> 1 OR previous is NULL"))
.withColumn("End", F.expr("datediff( next, dates)> 1 OR next is NULL"))
# Only keep beginnings and endings
.filter("Begin OR End")
# Get end next to begin and only keep beginnings
.withColumn("IntervalEnd", F.expr('LAG(dates) OVER (PARTITION BY ID ORDER BY dates DESC)'))
.filter("Begin")
# Rename columns + calculate gaps
.selectExpr(
"ID",
"dates as StartDate",
"IntervalEnd as StopDate",
"datediff(dates, LAG(IntervalEnd) OVER (PARTITION BY ID ORDER BY dates ASC)) as gap_from_previous_in_days",
"datediff(LAG(dates) OVER (PARTITION BY ID ORDER BY dates DESC), IntervalEnd ) as gap_to_next_in_days"
)
)
df_result.show()
+---+----------+----------+-------------------------+-------------------+
| ID| StartDate| StopDate|gap_from_previous_in_days|gap_to_next_in_days|
+---+----------+----------+-------------------------+-------------------+
| 1|2021-01-01|2021-01-12| null| 2|
| 1|2021-01-14|2021-01-20| 2| null|
| 2|2021-01-01|2021-01-10| null| 2|
| 2|2021-01-12|2021-01-15| 2| 4|
| 2|2021-01-19|2021-01-20| 4| null|
+---+----------+----------+-------------------------+-------------------+

Fetch column value based on dynamic input

I have a dataframe, where in I have 1 column, which contains names of column satisfying certain conditions for each row.
It's like if columns of dataframe are Index, Col1, Col2, Col3, Col_Name. Where Col_Name has either Col1 or Col2 or Col3 for each row.
Now in a new column say Col_New, I want output for each row such as if 5th row Col_Name mentions Col_1, then value of Col_1 in 5th row.
I am sorry I cannot post the code I am working on, hence gave this hypothetical example.
Obliged for any help, thanks.
IIUC you could use:
df['col_new'] = df.reset_index().apply(lambda x: df.at[x['index'], x['col_name']], axis=1)
Example:
cols = ['Col1', 'Col2', 'Col3']
df = pd.DataFrame(np.random.rand(10, 3), columns=cols)
df['Col_Name'] = np.random.choice(cols, 10)
print(df)
Col1 Col2 Col3 Col_Name
0 0.833988 0.939254 0.256450 Col2
1 0.675909 0.609494 0.641944 Col3
2 0.877474 0.971299 0.218273 Col3
3 0.201189 0.265742 0.800580 Col2
4 0.397945 0.135153 0.941313 Col2
5 0.666252 0.697983 0.164768 Col2
6 0.863377 0.839421 0.601316 Col2
7 0.138975 0.731359 0.379258 Col3
8 0.412148 0.541033 0.197861 Col2
9 0.980040 0.506752 0.823274 Col3
df['Col_New'] = df.reset_index().apply(lambda x: df.at[x['index'], x['Col_Name']], axis=1)
[out]
Col1 Col2 Col3 Col_Name Col_New
0 0.833988 0.939254 0.256450 Col2 0.939254
1 0.675909 0.609494 0.641944 Col3 0.641944
2 0.877474 0.971299 0.218273 Col3 0.218273
3 0.201189 0.265742 0.800580 Col2 0.265742
4 0.397945 0.135153 0.941313 Col2 0.135153
5 0.666252 0.697983 0.164768 Col2 0.697983
6 0.863377 0.839421 0.601316 Col2 0.839421
7 0.138975 0.731359 0.379258 Col3 0.379258
8 0.412148 0.541033 0.197861 Col2 0.541033
9 0.980040 0.506752 0.823274 Col3 0.823274
Example 2 (based on integer col references)
cols = [1, 2, 3]
np.random.seed(0)
df = pd.DataFrame(np.random.rand(10, 3), columns=cols)
df[13] = np.random.choice(cols, 10)
print(df)
1 2 3 13
0 0.548814 0.715189 0.602763 3
1 0.544883 0.423655 0.645894 3
2 0.437587 0.891773 0.963663 1
3 0.383442 0.791725 0.528895 3
4 0.568045 0.925597 0.071036 1
5 0.087129 0.020218 0.832620 1
6 0.778157 0.870012 0.978618 1
7 0.799159 0.461479 0.780529 2
8 0.118274 0.639921 0.143353 2
9 0.944669 0.521848 0.414662 3
Instead use:
df['Col_New'] = df.reset_index().apply(lambda x: df.at[int(x['index']), int(x[13])], axis=1)
1 2 3 13 Col_New
0 0.548814 0.715189 0.602763 3 0.602763
1 0.544883 0.423655 0.645894 3 0.645894
2 0.437587 0.891773 0.963663 1 0.437587
3 0.383442 0.791725 0.528895 3 0.528895
4 0.568045 0.925597 0.071036 1 0.568045
5 0.087129 0.020218 0.832620 1 0.087129
6 0.778157 0.870012 0.978618 1 0.778157
7 0.799159 0.461479 0.780529 2 0.461479
8 0.118274 0.639921 0.143353 2 0.639921
9 0.944669 0.521848 0.414662 3 0.414662
Using the example DataFrame from Chris A.
You could do it like this:
cols = ['Col1', 'Col2', 'Col3']
df = pd.DataFrame(np.random.rand(10, 3), columns=cols)
df['Col_Name'] = np.random.choice(cols, 10)
print(df)
df['Col_New'] = [df.loc[df.index[i],j]for i,j in enumerate(df.Col_Name)]
print(df)
In pandas is for this function DataFrame.lookup, also it seems need same types of values in columns and looking column, so is possible convert both to strings:
np.random.seed(123)
cols = [1, 2, 3]
df = pd.DataFrame(np.random.randint(10, size=(5, 3)), columns=cols).rename(columns=str)
df['Col_Name'] = np.random.choice(cols, 5)
df['Col_New'] = df.lookup(df.index, df['Col_Name'].astype(str))
print(df)
1 2 3 Col_Name Col_New
0 2 2 6 3 6
1 1 3 9 2 3
2 6 1 0 1 6
3 1 9 0 1 1
4 0 9 3 1 0

pandas - with restructuring data in data frame

I have a data frame that has data in format
time | name | value
01/01/1970 | A | 1
02/01/1970 | A | 2
03/01/1970 | A | 1
01/01/1970 | B | 5
02/01/1970 | B | 3
I what to change this data to something like
time | A | B
01/01/1970 | 1 | 5
02/01/1970 | 2 | 3
03/01/1970 | 1 | NA
How can I achieve this in pandas? I have tried groupby on dataframe and then joining but its coming out right.
thanks in advance
Use DataFrame.pivot (doc):
import numpy as np
df = pd.DataFrame(
{'name': ['A', 'A', 'A', 'B', 'B'],
'time': ['01/01/1970', '02/01/1970', '03/01/1970', '01/01/1970', '02/01/1970'],
'value': [1, 2, 1, 5, 3]})
print(df.pivot(index='time', columns='name', values='value'))
yields
A B
time
01/01/1970 1 5
02/01/1970 2 3
03/01/1970 1 NaN
Note that time is now the index. If you wish to make it a column, call reset_index():
df.pivot(index='time', columns='name', values='value').reset_index()
# name time A B
# 0 01/01/1970 1 5
# 1 02/01/1970 2 3
# 2 03/01/1970 1 NaN
Use the .pivot function:
df = pd.DataFrame({'time' : [0,1,2,3],
'name': ['A','A', 'B', 'B'], 'value': [10,20,30,40]})
df.pivot(index = 'time', columns = 'name', values = 'value')

Resources