update columns based on id pandas - python-3.x

df_2:
order_id date amount name interval is_sent
123 2020-01-02 3 white today false
456 NaT 2 blue weekly false
789 2020-10-11 0 red monthly false
135 2020-6-01 3 orange weekly false
I am merging two dataframes locating when a date is greater than the previous result as well as looking to see if a data type has changed:
df_1['date'] = pd.to_datetime(df_1['date'])
df_2['date'] = pd.to_datetime(df_2['date'])
res = df_1.merge(df_2, on='order_id', suffixes=['_orig', ''])
m = res['date'].gt(res['date_orig']) | (res['date_orig'].isnull() & res['date'].notnull())
changes_df = res.loc[m, ['order_id', 'date', 'amount', 'name', 'interval', 'is_sent']]
After locating all my entities I am changing changes_df['is_sent'] to true:
changes_df['is_sent'] = True
after the above is ran changes_df is:
order_id date amount name interval is_sent
123 2020-01-03 3 white today true
456 2020-12-01 2 blue weekly true
135 2020-6-02 3 orange weekly true
I want to then update only the values in df_2['date'] and df_2['is_sent'] to equal changes_df['date'] and changes_df['is_sent']
Any insight is greatly appreciated.

Let us try update with set_index
cf = changes_df[['order_id','date','is_sent']].set_index('order_id')
df_2 = df_2.set_index('order_id')
df_2.update(cf)
df_2.reset_index(inplace=True)
df_2
order_id date amount name interval is_sent
0 123 2020-01-03 3 white today True
1 456 2020-12-01 2 blue weekly True
2 789 2020-10-11 0 red monthly False
3 135 2020-6-02 3 orange weekly True

df3 = df2.combine_first(
cap_df1).reindex(df.index)
This is my solution

Related

How to create new columns in dataframe based on conditional matches on another dataframe?

Situation
I have two dataframes df1 that holds some information about cars:
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': [22000,25000,27000,35000]
}
and df2 that holds media types corresponding to the cars in df1:
images = {'Brand': ['Honda Civic','Honda Civic','Honda Civic','Toyota Corolla','Toyota Corolla','Audi A4'],
'MediaType': ['A','B','C','A','B','C']
}
Expected result
In result I wanna create an overview in df1 that tells if there is a media type available for the car or not:
result = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': [22000,25000,27000,35000],
'MediaTypeA' : [True,True,False,False],
'MediaTypeB' : [True,True,False,False],
'MediaTypeC' : [False,False,False,True]
}
How can I realize this?
I already could check if a Brand from df1 exists in df2, what tells me there is or there is no media type available at all:
df1['check'] = df1['Brand'].isin(df2['Brand'])
but I am not sure how to glue it with the check for the special media types.
Use get_dummies for indicators, create unique index by max and add to first DataFrame by DataFrame.join, last replace missing values:
df11 = pd.get_dummies(df2.set_index('Brand')['MediaType'], dtype=bool).max(level=0)
df = df1.join(df11, on='Brand').fillna(False)
print (df)
Brand Price A B C
0 Honda Civic 22000 True True True
1 Toyota Corolla 25000 True True False
2 Ford Focus 27000 False False False
3 Audi A4 35000 False False True
If possible some missing values in df1 then need DataFrame.reindex with fill_value=False:
df22 = pd.get_dummies(df2.set_index('Brand')['MediaType'], dtype=bool).max(level=0)
df = df1.join(df22.reindex(df1['Brand'].unique(), fill_value=False), on='Brand')
print (df)
Brand Price A B C
0 Honda Civic 22000 True True True
1 Toyota Corolla 25000 True True False
2 Ford Focus 27000 False False False
3 Audi A4 35000 False False True

Find Matching rows in the data frame by comparing all rows based on certain conditions

I'm fairly new to python and would appreciate if someone can guide me in the right direction.
I have a dataset that has unique trades in each row. I need to find all rows that match on certain conditions. Basically, find any offsetting trades that fit a certain condition. For example:
Find trades that have the same REF_RATE, RECEIVE is within a difference of 5, MATURITY_DATE is with 7 days of each other. I have attached the image of data.
Thank You.
You can use groupby to achieve this. As per you requirement specific to this ask Find trades that have the same REF_RATE, RECEIVE is within a difference of 5, MATURITY_DATE is with 7 days of each other you can proceed like this.
#sample data created from the image of your dataset
>>> data = {'Maturity_Date':['2/01/2021','10/01/2021','10/01/2021','6/06/2021'],'Trade_id':['10484','12880','11798','19561'],'REF_RATE':['BBSW','BBSW','OIS','BBSW'],'Recive':[1.5,1.25,2,10]}
>>> df = pd.DataFrame(data)
>>> df
Maturity_Date Trade_id REF_RATE Recive
0 2/01/2021 10484 BBSW 1.50
1 10/01/2021 12880 BBSW 1.25
2 10/01/2021 11798 OIS 2.00
3 6/06/2021 19561 BBSW 10.00
#convert Maturity_Date to datetime format and sort REF_RATE by date if needed
>>> df['Maturity_Date'] = pd.to_datetime(df['Maturity_Date'], dayfirst=True)
>>> df['Maturity_Date'] = df.groupby('REF_RATE')['Maturity_Date'].apply(lambda x: x.sort_values()) #if needed
>>> df
Maturity_Date Trade_id REF_RATE Recive
0 2021-01-02 10484 BBSW 1.50
1 2021-01-10 12880 BBSW 1.25
2 2021-01-10 11798 OIS 2.00
3 2021-06-06 19561 BBSW 10.00
#groupby of REF_RATE and apply condition on date and receive column
>>> df['date_diff>7'] = df.groupby('REF_RATE')['Maturity_Date'].diff() / np.timedelta64(1, 'D') > 7
>>> df['rate_diff>5'] = df.groupby('REF_RATE')['Recive'].diff() > 5
>>> df
Maturity_Date Trade_id REF_RATE Recive date_diff>7 rate_diff>5
0 2021-01-02 10484 BBSW 1.50 False False
1 2021-01-10 12880 BBSW 1.25 True False #date_diff true as for BBSW Maturity date is more than 7
2 2021-01-10 11798 OIS 2.00 False False
3 2021-06-06 19561 BBSW 10.00 True True #rate_diff and date_diff true because date>7 and receive difference>5

Get the last date before an nth date for each month in Python

I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0

Adjust the overlapping dates in group by with priority from another columns

As Title Suggest, I am working on a problem to find overlapping dates based on ID and adjust overlapping date based on priority(weight). Following piece of code helped to find overlapping dates.
df['overlap'] = (df.groupby('ID')
.apply(lambda x: (x['End_date'].shift() - x['Start_date']) > timedelta(0))
.reset_index(level=0, drop=True))
df
Now issue I'm facing is, how to introduce priority(weight) and adjust start_date by that. In the below image, I have highlighted adjusted dates based on weight where A takes precedence over B and B takes over C.
Should I create a dictionary for string to numeric weight values and then what? I'm stuck here to set up logic.
Dataframe:
op_d = {'ID': [1,1,1,2,2,3,3,3],'Start_date':['9/1/2020','10/10/2020','11/18/2020','4/1/2015','5/12/2016','4/1/2015','5/15/2016','8/1/2018'],\
'End_date':['10/9/2020','11/25/2020','12/31/2020','5/31/2016','12/31/2016','5/29/2016','9/25/2018','10/15/2020'],\
'Weight':['A','B','C','A','B','A','B','C']}
df = pd.DataFrame(data=op_d)
You have already identified the overlap condition, you can then try adding a day to End_Date and shift, then assign them to start date where overlap column is true:
arr = np.where(df['overlap'],df['End_date'].add(pd.Timedelta(1,unit='d')).shift(),
df['Start_date'])
out = df.assign(Output_Start_Date = arr,Output_End_Date=df['End_date'])
print(out)
ID Start_date End_date Weight overlap Output_Start_Date Output_End_Date
0 1 2020-09-01 2020-10-09 A False 2020-09-01 2020-10-09
1 1 2020-10-10 2020-11-25 B False 2020-10-10 2020-11-25
2 1 2020-11-18 2020-12-31 C True 2020-11-26 2020-12-31
3 2 2015-04-01 2016-05-31 A False 2015-04-01 2016-05-31
4 2 2016-05-12 2016-12-31 B True 2016-06-01 2016-12-31
5 3 2015-04-01 2016-05-29 A False 2015-04-01 2016-05-29
6 3 2016-05-15 2018-09-25 B True 2016-05-30 2018-09-25
7 3 2018-08-01 2020-10-15 C True 2018-09-26 2020-10-15

How to remove duplicate values in dataframe while preserving the rest of the row in Pandas?

I am working on some gross profit reports in a jupyter notebook. I have exported the data out of our CRM as a csv and am using Pandas to with with the data. Some of the data is being duplicated in a couple of columns. I need to remove those duplicate values in those columns, but preserve the rest of the row.
I have tried to drop_duplicates on a subset of the two columns, but it removes the entire row.
INV INV SUB PO Number PO Subtotal \
0 INV-002504 USD 350.00 PO-03977 240
1 INV-002507 USD 1,400.00 PO-03846 603.56
2 NaN NaN PO-03847 295
3 INV-002489 USD 891.25 PO-03861 658.31
4 INV-002453 USD 3,132.50 PO-03889 4751.19
5 INV-002537 USD 3,856.29 PO-03889 4751.19
6 INV-002420 USD 592.43 PO-03577 1188.46
7 INV-002415 USD 10,779.00 PO-03727 5389.21
Rows 4 & 5 are an example being duplicated in the PO Number & PO Subtotal columns.
I expect the output to remove the duplicate so the value is only shown once in all cases.
INV INV SUB PO Number PO Subtotal \
0 INV-002504 USD 350.00 PO-03977 240
1 INV-002507 USD 1,400.00 PO-03846 603.56
2 NaN NaN PO-03847 295
3 INV-002489 USD 891.25 PO-03861 658.31
4 INV-002453 USD 3,132.50 PO-03889 4751.19
5 INV-002537 USD 3,856.29
6 INV-002420 USD 592.43 PO-03577 1188.46
7 INV-002415 USD 10,779.00 PO-03727 5389.21
Use DataFrame.duplicated to check which rows contain duplicates based on PO Number & PO Subtotal. Then conditionally replace the value by '' with np.where:
m = df.duplicated(['PO Number', 'PO Subtotal'])
df['PO Number'] = np.where(m, '', df['PO Number'])
df['PO Subtotal'] = np.where(m, '', df['PO Subtotal'])
Or using .loc to select the correct rows and columns and replace those rows with '':
m = df.duplicated(['PO Number', 'PO Subtotal'])
df.loc[m, ['PO Number', 'PO Subtotal']] = ''
Output
INV INV SUB PO Number PO Subtotal
0 INV-002504 USD 350.00 PO-03977 240.0
1 INV-002507 USD 1,400.00 PO-03846 603.56
2 NaN NaN PO-03847 295.0
3 INV-002489 USD 891.25 PO-03861 658.31
4 INV-002453 USD 3,132.50 PO-03889 4751.19
5 INV-002537 USD 3,856.29
6 INV-002420 USD 592.43 PO-03577 1188.46
7 INV-002415 USD 10,779.00 PO-03727 5389.21

Resources