Find Matching rows in the data frame by comparing all rows based on certain conditions - python-3.x

I'm fairly new to python and would appreciate if someone can guide me in the right direction.
I have a dataset that has unique trades in each row. I need to find all rows that match on certain conditions. Basically, find any offsetting trades that fit a certain condition. For example:
Find trades that have the same REF_RATE, RECEIVE is within a difference of 5, MATURITY_DATE is with 7 days of each other. I have attached the image of data.
Thank You.

You can use groupby to achieve this. As per you requirement specific to this ask Find trades that have the same REF_RATE, RECEIVE is within a difference of 5, MATURITY_DATE is with 7 days of each other you can proceed like this.
#sample data created from the image of your dataset
>>> data = {'Maturity_Date':['2/01/2021','10/01/2021','10/01/2021','6/06/2021'],'Trade_id':['10484','12880','11798','19561'],'REF_RATE':['BBSW','BBSW','OIS','BBSW'],'Recive':[1.5,1.25,2,10]}
>>> df = pd.DataFrame(data)
>>> df
Maturity_Date Trade_id REF_RATE Recive
0 2/01/2021 10484 BBSW 1.50
1 10/01/2021 12880 BBSW 1.25
2 10/01/2021 11798 OIS 2.00
3 6/06/2021 19561 BBSW 10.00
#convert Maturity_Date to datetime format and sort REF_RATE by date if needed
>>> df['Maturity_Date'] = pd.to_datetime(df['Maturity_Date'], dayfirst=True)
>>> df['Maturity_Date'] = df.groupby('REF_RATE')['Maturity_Date'].apply(lambda x: x.sort_values()) #if needed
>>> df
Maturity_Date Trade_id REF_RATE Recive
0 2021-01-02 10484 BBSW 1.50
1 2021-01-10 12880 BBSW 1.25
2 2021-01-10 11798 OIS 2.00
3 2021-06-06 19561 BBSW 10.00
#groupby of REF_RATE and apply condition on date and receive column
>>> df['date_diff>7'] = df.groupby('REF_RATE')['Maturity_Date'].diff() / np.timedelta64(1, 'D') > 7
>>> df['rate_diff>5'] = df.groupby('REF_RATE')['Recive'].diff() > 5
>>> df
Maturity_Date Trade_id REF_RATE Recive date_diff>7 rate_diff>5
0 2021-01-02 10484 BBSW 1.50 False False
1 2021-01-10 12880 BBSW 1.25 True False #date_diff true as for BBSW Maturity date is more than 7
2 2021-01-10 11798 OIS 2.00 False False
3 2021-06-06 19561 BBSW 10.00 True True #rate_diff and date_diff true because date>7 and receive difference>5

Related

Adjust the overlapping dates in group by with priority from another columns

As Title Suggest, I am working on a problem to find overlapping dates based on ID and adjust overlapping date based on priority(weight). Following piece of code helped to find overlapping dates.
df['overlap'] = (df.groupby('ID')
.apply(lambda x: (x['End_date'].shift() - x['Start_date']) > timedelta(0))
.reset_index(level=0, drop=True))
df
Now issue I'm facing is, how to introduce priority(weight) and adjust start_date by that. In the below image, I have highlighted adjusted dates based on weight where A takes precedence over B and B takes over C.
Should I create a dictionary for string to numeric weight values and then what? I'm stuck here to set up logic.
Dataframe:
op_d = {'ID': [1,1,1,2,2,3,3,3],'Start_date':['9/1/2020','10/10/2020','11/18/2020','4/1/2015','5/12/2016','4/1/2015','5/15/2016','8/1/2018'],\
'End_date':['10/9/2020','11/25/2020','12/31/2020','5/31/2016','12/31/2016','5/29/2016','9/25/2018','10/15/2020'],\
'Weight':['A','B','C','A','B','A','B','C']}
df = pd.DataFrame(data=op_d)
You have already identified the overlap condition, you can then try adding a day to End_Date and shift, then assign them to start date where overlap column is true:
arr = np.where(df['overlap'],df['End_date'].add(pd.Timedelta(1,unit='d')).shift(),
df['Start_date'])
out = df.assign(Output_Start_Date = arr,Output_End_Date=df['End_date'])
print(out)
ID Start_date End_date Weight overlap Output_Start_Date Output_End_Date
0 1 2020-09-01 2020-10-09 A False 2020-09-01 2020-10-09
1 1 2020-10-10 2020-11-25 B False 2020-10-10 2020-11-25
2 1 2020-11-18 2020-12-31 C True 2020-11-26 2020-12-31
3 2 2015-04-01 2016-05-31 A False 2015-04-01 2016-05-31
4 2 2016-05-12 2016-12-31 B True 2016-06-01 2016-12-31
5 3 2015-04-01 2016-05-29 A False 2015-04-01 2016-05-29
6 3 2016-05-15 2018-09-25 B True 2016-05-30 2018-09-25
7 3 2018-08-01 2020-10-15 C True 2018-09-26 2020-10-15

Converting timedeltas to integers for consecutive time points in pandas

Suppose I have the dataframe
import pandas as pd
df = pd.DataFrame({"Time": ['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04']})
print(df)
Time
0 2010-01-01
1 2010-01-02
2 2010-01-03
3 2010-01-04
If I want to calculate the time from the lowest time point for each time in the dataframe, I can use the apply function like
df['Time'] = pd.to_datetime(df['Time'])
df.sort_values(inplace = True)
df['Time'] = df['Time'].apply(lambda x: (x - df['Time'].iloc[0]).days)
print(df)
Time
0 0
1 1
2 2
3 3
Is there a function in Pandas that does this already?
I will recommend not use apply
(df.Time-df.Time.iloc[0]).dt.days
0 0
1 1
2 2
3 3
Name: Time, dtype: int64

subselect columns pandas with count

I have a table below
I was trying to create an additional column to count if Std_1,Std_2 and Std_3 greater than its mean value.
for example, for ACCMGR Row, only Std_2 is greater than the average, so the new column should be 1.
Not sure how to do it.
You need to be a bit careful with how you specify the axes, but you can just use .gt + .mean + .sum
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'APPL': ['ACCMGR', 'ACCOUNTS', 'ADVISOR', 'AUTH', 'TEST'],
'Std_1': [106.875, 121.703, np.NaN, 116.8585, 1],
'Std_2': [130.1899, 113.4927, np.NaN, 112.4486, 4],
'Std_3': [107.186, 114.5418, np.NaN, 115.2699, np.NaN]})
Code
df = df.set_index('APPL')
df['cts'] = df.gt(df.mean(axis=1), axis=0).sum(axis=1)
df = df.reset_index()
Output:
APPL Std_1 Std_2 Std_3 cts
0 ACCMGR 106.8750 130.1899 107.1860 1
1 ACCOUNTS 121.7030 113.4927 114.5418 1
2 ADVISOR NaN NaN NaN 0
3 AUTH 116.8585 112.4486 115.2699 2
4 TEST 1.0000 4.0000 NaN 1
Considered dataframe
quantity price
0 6 1.45
1 3 1.85
2 2 2.25
apply lambda function on axis =1 , for each series of row check the column of value greater than mean and get the index of column
df.apply(lambda x:df.columns.get_loc(x[x>np.mean(x)].index[0]),axis=1)
Out:
quantity price > than mean
0 6 1.45 0
1 3 1.85 0
2 2 2.25 1

Combine rows based on index or column

I have three dataframes: df1, df2, df3. I am trying to add a list of ART_UNIT do df1.
df1 is 260846 rows x 4 columns:
Index SYMBOL level not-allocatable additional-only
0 A 2 True False
1 A01 4 True False
2 A01B 5 True False
3 A01B1/00 7 False False
4 A01B1/02 8 False False
5 A01B1/022 9 False False
6 A01B1/024 9 False False
7 A01B1/026 9 False False
df2 is 941516 rows x 2 columns:
Index CLASSIFICATION_SYMBOL_CD ART_UNIT
0 A44C27/00 3715
1 A44C27/001 2015
2 A44C27/001 3715
3 A44C27/001 2615
4 A44C27/005 2815
5 A44C27/006 3725
6 A44C27/007 3215
7 A44C27/008 3715
8 F41A33/00 3715
9 F41A33/02 3715
10 F41A33/04 3715
11 F41A33/06 3715
12 G07C13/00 3715
13 G07C13/005 3715
14 G07C13/02 3716
And df3 is the same format as df2, but has 673023 rows x 2 columns
The 'CLASSIFICATION_SYMBOL_CD' in df2 and df3 are not unique.
For each 'CLASSIFICATION_SYMBOL_CD' in df2 and df3, I want to find the same string in df1 'SYMBOL' and add a new column to df1 'ART_UNIT' that contains all of the 'ART_UNIT' from df2 and df3.
For example, in df2, 'CLASSIFICATION_SYMBOL_CD' A44C27/001 has ART_UNIT 2015, 3715, and 2615.
I want to write those ART_UNIT to the correct row in df1 so that is reads:
Index SYMBOL level not-allocatable additional-only ART_UNIT
211 A44C27/001 2 True False [2015, 3715, 2615]
So far, I've tried to group df2/df3 by 'CLASSIFICATION_SYMBOL_CD'
gp = df2.groupby(['CLASSIFICATION_SYMBOL_CD'])
for x in df2['CLASSIFICATION_SYMBOL_CD'].unique():
df2_g = gp.get_group(x)
Which gives me:
Index CLASSIFICATION_SYMBOL_CD ART_UNIT
1354 A61N1/3714 3762
117752 A61N1/3714 3766
347573 A61N1/3714 3736
548026 A61N1/3714 3762
560771 A61N1/3714 3762
566120 A61N1/3714 3766
566178 A61N1/3714 3762
799486 A61N1/3714 3736
802408 A61N1/3714 3736
Since df2 and df3 have the same format concatentate them first.
import pandas as pd
df = pd.concat([df2, df3])
Then to get the lists of all art units, groupby and apply list.
df = df.groupby('CLASSIFICATION_SYMBOL_CD').ART_UNIT.apply(list).reset_index()
# CLASSIFICATION_SYMBOL_CD ART_UNIT
#0 A44C27/00 [3715]
#1 A44C27/001 [2015, 3715, 2615]
#2 A44C27/005 [2815]
#3 A44C27/006 [3725]
#...
Finally, bring this information to df1 with a merge (you could map or something else too). Rename the column first to have less to clean up after the merge.
df = df.rename(columns={'CLASSIFICATION_SYMBOL_CD': 'SYMBOL'})
df1 = df1.merge(df, on='SYMBOL', how='left')
Output:
Index SYMBOL level not-allocatable additional-only ART_UNIT
0 0 A 2 True False NaN
1 1 A01 4 True False NaN
2 2 A01B 5 True False NaN
3 3 A01B1/00 7 False False NaN
4 4 A01B1/02 8 False False NaN
5 5 A01B1/022 9 False False NaN
6 6 A01B1/024 9 False False NaN
7 7 A01B1/026 9 False False NaN
Sadly, you didn't provide any overlapping SYMBOLs in df1, so nothing merged. But this will work with your full data.

Using relative positioning with Python 3.5 and pandas

I am formatting some csv files, and I need to add columns that use other columns for arithmetic. Like in Excel, B3 = sum(A1:A3)/3, then B4 = sum(A2:A4)/3. I've looked up relative indexes and haven't found what I'm Trying to do.
def formula_columns(csv_list, dir_env):
for file in csv_list:
df = pd.read_csv(dir_env + file)
avg_12(df)
print(df[10:20])
# Create AVG(12) Column
def avg_12 ( df ):
df[ 'AVG(12)' ] = df[ 'Price' ]
# Right Here I want to set each value of 'AVG(12)' to equal
# the sum of the value of price from its own index plus the
# previous 11 indexes
df.loc[:10, 'AVG(12)'] = 0
I would imagine this to be a common task, I would assume I'm looking in the wrong places. If anyone has some advice I would appreciate it, Thank.
That can be done with the rolling method:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1, 5, 10), columns = ['A'])
df
Out[151]:
A
0 2
1 4
2 1
3 1
4 4
5 2
6 4
7 2
8 4
9 1
Take the averages of A1:A3, A2:A4 etc:
df.rolling(3).mean()
Out[152]:
A
0 NaN
1 NaN
2 2.333333
3 2.000000
4 2.000000
5 2.333333
6 3.333333
7 2.666667
8 3.333333
9 2.333333
It requires pandas 18. For earlier versions, use pd.rolling_mean():
pd.rolling_mean(df['A'], 3)

Resources