Adjust the overlapping dates in group by with priority from another columns - python-3.x

As Title Suggest, I am working on a problem to find overlapping dates based on ID and adjust overlapping date based on priority(weight). Following piece of code helped to find overlapping dates.
df['overlap'] = (df.groupby('ID')
.apply(lambda x: (x['End_date'].shift() - x['Start_date']) > timedelta(0))
.reset_index(level=0, drop=True))
df
Now issue I'm facing is, how to introduce priority(weight) and adjust start_date by that. In the below image, I have highlighted adjusted dates based on weight where A takes precedence over B and B takes over C.
Should I create a dictionary for string to numeric weight values and then what? I'm stuck here to set up logic.
Dataframe:
op_d = {'ID': [1,1,1,2,2,3,3,3],'Start_date':['9/1/2020','10/10/2020','11/18/2020','4/1/2015','5/12/2016','4/1/2015','5/15/2016','8/1/2018'],\
'End_date':['10/9/2020','11/25/2020','12/31/2020','5/31/2016','12/31/2016','5/29/2016','9/25/2018','10/15/2020'],\
'Weight':['A','B','C','A','B','A','B','C']}
df = pd.DataFrame(data=op_d)

You have already identified the overlap condition, you can then try adding a day to End_Date and shift, then assign them to start date where overlap column is true:
arr = np.where(df['overlap'],df['End_date'].add(pd.Timedelta(1,unit='d')).shift(),
df['Start_date'])
out = df.assign(Output_Start_Date = arr,Output_End_Date=df['End_date'])
print(out)
ID Start_date End_date Weight overlap Output_Start_Date Output_End_Date
0 1 2020-09-01 2020-10-09 A False 2020-09-01 2020-10-09
1 1 2020-10-10 2020-11-25 B False 2020-10-10 2020-11-25
2 1 2020-11-18 2020-12-31 C True 2020-11-26 2020-12-31
3 2 2015-04-01 2016-05-31 A False 2015-04-01 2016-05-31
4 2 2016-05-12 2016-12-31 B True 2016-06-01 2016-12-31
5 3 2015-04-01 2016-05-29 A False 2015-04-01 2016-05-29
6 3 2016-05-15 2018-09-25 B True 2016-05-30 2018-09-25
7 3 2018-08-01 2020-10-15 C True 2018-09-26 2020-10-15

Related

Pandas : Finding correct time window

I have a pandas dataframe which gets updated every hour with latest hourly data. I have to filter out IDs based upon a threshold, i.e. PR_Rate > 50 and CNT_12571 < 30 for 3 consecutive hours from a lookback period of 5 hours. I was using the below statements to accomplish this:
df_thld=df[(df['Date'] > df['Date'].max() - pd.Timedelta(hours=5))& (df.PR_Rate>50) & (df.CNT_12571 < 30)]
df_thld.loc[:,'HR_CNT'] = df_thld.groupby('ID')['Date'].nunique().to_frame('HR_CNT').reset_index()
df_thld[(df_thld['HR_CNT'] >3]
The problem with this approach is that since lookback period requirement is 5 hours, so, this HR_CNT can count any non consecutive hours breaching this critieria.
MY Dataset is as below:
DataFrame
Date IDs CT_12571 PR_Rate
16/06/2021 10:00 A1 15 50.487
16/06/2021 11:00 A1 31 40.806
16/06/2021 12:00 A1 25 52.302
16/06/2021 13:00 A1 13 61.45
16/06/2021 14:00 A1 7 73.805
In the above Dataframe, threshold was not breached at 1100 hrs, but while counting the hours, 10,12 and 13 as the hours that breached the threshold instead of 12,13,14 as required. Each id may or may not have this critieria breached in a single day. Any idea, How can I fix this issue?
Please excuse me, if I have misinterpreted your problem. As I understand the issues you have a dataframe which is updated hourly. An example of this dataframe is illustrated below as df. From this dataframe, you want to filter only those rows which satisfy the following two conditions:
PR_Rate > 50 and CNT_12571 < 30
If and only if the threshold is surpassed for three consecutive hours
Given these assumptions, I would proceed as follows:
df:
Date IDs CT_1257 PR_Rate
0 2021-06-16 10:00:00 A1 15 50.487
1 2021-06-16 12:00:00 A1 31 40.806
2 2021-06-16 14:00:00 A1 25 52.302
3 2021-06-16 15:00:00 A1 13 61.450
4 2021-06-16 16:00:00 A1 7 73.805
Note in this dataframe, the only time fr5ame which satisfies the above conditions is the entries for the of 14:00, 15:00 and 16:00.
def filterFrame(df, dur, pr_threshold, ct_threshold):
ff = df[(df['CT_1257']< ct_threshold) & (df['PR_Rate'] >pr_threshold) ].reset_index()
ml = list(ff.rolling(f'{dur}h', on='Date').count()['IDs'])
r = len(ml)- 1
rows= []
while r >= 0:
end = r
start = None
if int(ml[r]) < dur:
r -= 1
else:
k = int(ml[r])
for i in range(k):
rows.append(r-i)
r -= k
rows = rows[::-1]
return ff.filter(items= rows, axis = 0).reset_index()
running filterFrame(df, 3, 50, 30) yields:
level_0 index Date IDs CT_1257 PR_Rate
0 1 2 2021-06-16 14:00:00 A1 25 52.302
1 2 3 2021-06-16 15:00:00 A1 13 61.450
2 3 4 2021-06-16 16:00:00 A1 7 73.805

Find Matching rows in the data frame by comparing all rows based on certain conditions

I'm fairly new to python and would appreciate if someone can guide me in the right direction.
I have a dataset that has unique trades in each row. I need to find all rows that match on certain conditions. Basically, find any offsetting trades that fit a certain condition. For example:
Find trades that have the same REF_RATE, RECEIVE is within a difference of 5, MATURITY_DATE is with 7 days of each other. I have attached the image of data.
Thank You.
You can use groupby to achieve this. As per you requirement specific to this ask Find trades that have the same REF_RATE, RECEIVE is within a difference of 5, MATURITY_DATE is with 7 days of each other you can proceed like this.
#sample data created from the image of your dataset
>>> data = {'Maturity_Date':['2/01/2021','10/01/2021','10/01/2021','6/06/2021'],'Trade_id':['10484','12880','11798','19561'],'REF_RATE':['BBSW','BBSW','OIS','BBSW'],'Recive':[1.5,1.25,2,10]}
>>> df = pd.DataFrame(data)
>>> df
Maturity_Date Trade_id REF_RATE Recive
0 2/01/2021 10484 BBSW 1.50
1 10/01/2021 12880 BBSW 1.25
2 10/01/2021 11798 OIS 2.00
3 6/06/2021 19561 BBSW 10.00
#convert Maturity_Date to datetime format and sort REF_RATE by date if needed
>>> df['Maturity_Date'] = pd.to_datetime(df['Maturity_Date'], dayfirst=True)
>>> df['Maturity_Date'] = df.groupby('REF_RATE')['Maturity_Date'].apply(lambda x: x.sort_values()) #if needed
>>> df
Maturity_Date Trade_id REF_RATE Recive
0 2021-01-02 10484 BBSW 1.50
1 2021-01-10 12880 BBSW 1.25
2 2021-01-10 11798 OIS 2.00
3 2021-06-06 19561 BBSW 10.00
#groupby of REF_RATE and apply condition on date and receive column
>>> df['date_diff>7'] = df.groupby('REF_RATE')['Maturity_Date'].diff() / np.timedelta64(1, 'D') > 7
>>> df['rate_diff>5'] = df.groupby('REF_RATE')['Recive'].diff() > 5
>>> df
Maturity_Date Trade_id REF_RATE Recive date_diff>7 rate_diff>5
0 2021-01-02 10484 BBSW 1.50 False False
1 2021-01-10 12880 BBSW 1.25 True False #date_diff true as for BBSW Maturity date is more than 7
2 2021-01-10 11798 OIS 2.00 False False
3 2021-06-06 19561 BBSW 10.00 True True #rate_diff and date_diff true because date>7 and receive difference>5

update columns based on id pandas

df_2:
order_id date amount name interval is_sent
123 2020-01-02 3 white today false
456 NaT 2 blue weekly false
789 2020-10-11 0 red monthly false
135 2020-6-01 3 orange weekly false
I am merging two dataframes locating when a date is greater than the previous result as well as looking to see if a data type has changed:
df_1['date'] = pd.to_datetime(df_1['date'])
df_2['date'] = pd.to_datetime(df_2['date'])
res = df_1.merge(df_2, on='order_id', suffixes=['_orig', ''])
m = res['date'].gt(res['date_orig']) | (res['date_orig'].isnull() & res['date'].notnull())
changes_df = res.loc[m, ['order_id', 'date', 'amount', 'name', 'interval', 'is_sent']]
After locating all my entities I am changing changes_df['is_sent'] to true:
changes_df['is_sent'] = True
after the above is ran changes_df is:
order_id date amount name interval is_sent
123 2020-01-03 3 white today true
456 2020-12-01 2 blue weekly true
135 2020-6-02 3 orange weekly true
I want to then update only the values in df_2['date'] and df_2['is_sent'] to equal changes_df['date'] and changes_df['is_sent']
Any insight is greatly appreciated.
Let us try update with set_index
cf = changes_df[['order_id','date','is_sent']].set_index('order_id')
df_2 = df_2.set_index('order_id')
df_2.update(cf)
df_2.reset_index(inplace=True)
df_2
order_id date amount name interval is_sent
0 123 2020-01-03 3 white today True
1 456 2020-12-01 2 blue weekly True
2 789 2020-10-11 0 red monthly False
3 135 2020-6-02 3 orange weekly True
df3 = df2.combine_first(
cap_df1).reindex(df.index)
This is my solution

Get the last date before an nth date for each month in Python

I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0

Define function to classify records within a df and adding new columns. Pandas dfs

I have a list of about 20 dfs and I want to clean the data for analysis.
Can there be a function that loops through all the dfs in the list & performs the tasks below, if all the columns are the same?
Create a column [time_class] that classifies each as arrival time as "early" or "late" by comparing with the [appt_time] column. Next I want to classify each record as "early_yes", "early_no", "late_yes" and "late_no" in another column called [time_response]. This column would check the values of [time_class], [YES] and [NO]. If a record is 'early' and '1' for yes then the [time_response] column should say "early_yes" Then a frequency table to count the [time_response] occurrences. the frequency table headers will be from the [time_response] column.
How can I check to make sure the time columns are reading as times in pandas?
How can I change the values in the yes and no column to 'yes' and 'no' instead of the 1's?
each df has this format for these specific columns:
Arrival_time Appt_Time YES NO
07:25:00 08:00 1
08:24:00 08:40 1
08:12:00 09:00 1
09:20:00 09:30 1
10:01:00 10:00 1
09:33:00 09:30 1
10:22:00 10:20 1
10:29:00 10:30 1
I also have an age column in each df that I have tried binning using the cut() method, and I usually get the error that the input must be one dimensional array. Does this mean I cannot use this method if the df has other columns other than just the age?
How can you define a function to check the age column and create bins grouped by 10 [20-100], then use these bins to create a frequency table? Ideally I'd like the freq table to be columns in each df. I am using pandas.
Any help is appreciated!!
UPDATE: When I try to compare arrival time and scheduled time, I get a type error TypeError: '<=' not supported between instances of 'int' and 'datetime.time'
Hopefully this helps you get started - you'll see that there are a few useful methods like replace in pandas and select from the numpy library. Also if you want to apply any of the code to multiple dataframes that are all in the same format, you'll want to wrap this code in a function.
import numpy as np
import pandas as pd
### this code helps recreate the df you posted
df = pd.DataFrame({
"Arrival_time": ['07:25:00', '08:24:00', '08:12:00', '09:20:00', '10:01:00', '09:33:00', '10:22:00', '10:29:00'],
"Appt_Time":['08:00', '08:40', '09:00', '09:30', '10:00', '09:30', '10:20', '10:30'],
"YES": ['1','1','','','','1','','1'],
"NO": ['','','1','1','1','','1','']})
df.Arrival_time = pd.to_datetime(df.Arrival_time, format='%H:%M:%S').dt.time
df.Appt_Time = pd.to_datetime(df.Appt_Time, format='%H:%M').dt.time
### end here
# you can start using the code from this line onward:
# creates "time_class" column based on Arrival_time being before Appt_Time
df["time_class"] = (df.Arrival_time <= df.Appt_Time).replace({True: "early", False: "late"})
# creates a new column "time_response" based on conditions
# this may need to be changed depending on whether your "YES" and "NO" columns
# are a string or an int... I just assumed a string so you can modify this code as needed
conditions = [
(df.time_class == "early") & (df.YES == '1'),
(df.time_class == "early") & (df.YES != '1'),
(df.time_class == "late") & (df.YES == '1'),
(df.time_class == "late") & (df.YES != '1')]
choices = ["early_yes", "early_no", "late_yes", "late_no"]
df["time_response"] = np.select(conditions, choices)
# creates a new df to sum up each time_response
df_time_response_count = pd.DataFrame({"Counts": df["time_response"].value_counts()})
# replace 1 with YES and 1 with NO in your YES and NO columns
df.YES = df.YES.replace({'1': "YES"})
df.NO = df.NO.replace({'1': "NO"})
Output:
>>> df
Arrival_time Appt_Time YES NO time_class time_response
0 07:25:00 08:00:00 YES early early_yes
1 08:24:00 08:40:00 YES early early_yes
2 08:12:00 09:00:00 NO early early_no
3 09:20:00 09:30:00 NO early early_no
4 10:01:00 10:00:00 NO late late_no
5 09:33:00 09:30:00 YES late late_yes
6 10:22:00 10:20:00 NO late late_no
7 10:29:00 10:30:00 YES early early_yes
>>> df_time_response_count
Counts
early_yes 3
late_no 2
early_no 2
late_yes 1
To answer your question about binning, I think np.linspace() is easiest to create the bins you want.
So I'll add some random ages between 20 and 100 to the df:
df['age'] = [21,31,34,26,46,70,56,55]
So the dataframe looks like this:
df
Arrival_time Appt_Time YES NO time_class time_response age
0 07:25:00 08:00:00 YES early early_yes 21
1 08:24:00 08:40:00 YES early early_yes 31
2 08:12:00 09:00:00 NO early early_no 34
3 09:20:00 09:30:00 NO early early_no 26
4 10:01:00 10:00:00 NO late late_no 46
5 09:33:00 09:30:00 YES late late_yes 70
6 10:22:00 10:20:00 NO late late_no 56
7 10:29:00 10:30:00 YES early early_yes 55
Then use the value_counts method in pandas and with the bins parameter:
df_age_counts = pd.DataFrame({"Counts": df.age.value_counts(bins = np.linspace(20,100,9))})
df_age_counts = df_age_counts.sort_index()
Output:
>>> df_age_counts
Counts
(19.999, 30.0] 2
(30.0, 40.0] 2
(40.0, 50.0] 1
(50.0, 60.0] 2
(60.0, 70.0] 1
(70.0, 80.0] 0
(80.0, 90.0] 0
(90.0, 100.0] 0

Resources