I'm fairly new to python and would appreciate if someone can guide me in the right direction.
I have a dataset that has unique trades in each row. I need to find all rows that match on certain conditions. Basically, find any offsetting trades that fit a certain condition. For example:
Find trades that have the same REF_RATE, RECEIVE is within a difference of 5, MATURITY_DATE is with 7 days of each other. I have attached the image of data.
Thank You.
You can use groupby to achieve this. As per you requirement specific to this ask Find trades that have the same REF_RATE, RECEIVE is within a difference of 5, MATURITY_DATE is with 7 days of each other you can proceed like this.
#sample data created from the image of your dataset
>>> data = {'Maturity_Date':['2/01/2021','10/01/2021','10/01/2021','6/06/2021'],'Trade_id':['10484','12880','11798','19561'],'REF_RATE':['BBSW','BBSW','OIS','BBSW'],'Recive':[1.5,1.25,2,10]}
>>> df = pd.DataFrame(data)
>>> df
Maturity_Date Trade_id REF_RATE Recive
0 2/01/2021 10484 BBSW 1.50
1 10/01/2021 12880 BBSW 1.25
2 10/01/2021 11798 OIS 2.00
3 6/06/2021 19561 BBSW 10.00
#convert Maturity_Date to datetime format and sort REF_RATE by date if needed
>>> df['Maturity_Date'] = pd.to_datetime(df['Maturity_Date'], dayfirst=True)
>>> df['Maturity_Date'] = df.groupby('REF_RATE')['Maturity_Date'].apply(lambda x: x.sort_values()) #if needed
>>> df
Maturity_Date Trade_id REF_RATE Recive
0 2021-01-02 10484 BBSW 1.50
1 2021-01-10 12880 BBSW 1.25
2 2021-01-10 11798 OIS 2.00
3 2021-06-06 19561 BBSW 10.00
#groupby of REF_RATE and apply condition on date and receive column
>>> df['date_diff>7'] = df.groupby('REF_RATE')['Maturity_Date'].diff() / np.timedelta64(1, 'D') > 7
>>> df['rate_diff>5'] = df.groupby('REF_RATE')['Recive'].diff() > 5
>>> df
Maturity_Date Trade_id REF_RATE Recive date_diff>7 rate_diff>5
0 2021-01-02 10484 BBSW 1.50 False False
1 2021-01-10 12880 BBSW 1.25 True False #date_diff true as for BBSW Maturity date is more than 7
2 2021-01-10 11798 OIS 2.00 False False
3 2021-06-06 19561 BBSW 10.00 True True #rate_diff and date_diff true because date>7 and receive difference>5
Related
As Title Suggest, I am working on a problem to find overlapping dates based on ID and adjust overlapping date based on priority(weight). Following piece of code helped to find overlapping dates.
df['overlap'] = (df.groupby('ID')
.apply(lambda x: (x['End_date'].shift() - x['Start_date']) > timedelta(0))
.reset_index(level=0, drop=True))
df
Now issue I'm facing is, how to introduce priority(weight) and adjust start_date by that. In the below image, I have highlighted adjusted dates based on weight where A takes precedence over B and B takes over C.
Should I create a dictionary for string to numeric weight values and then what? I'm stuck here to set up logic.
Dataframe:
op_d = {'ID': [1,1,1,2,2,3,3,3],'Start_date':['9/1/2020','10/10/2020','11/18/2020','4/1/2015','5/12/2016','4/1/2015','5/15/2016','8/1/2018'],\
'End_date':['10/9/2020','11/25/2020','12/31/2020','5/31/2016','12/31/2016','5/29/2016','9/25/2018','10/15/2020'],\
'Weight':['A','B','C','A','B','A','B','C']}
df = pd.DataFrame(data=op_d)
You have already identified the overlap condition, you can then try adding a day to End_Date and shift, then assign them to start date where overlap column is true:
arr = np.where(df['overlap'],df['End_date'].add(pd.Timedelta(1,unit='d')).shift(),
df['Start_date'])
out = df.assign(Output_Start_Date = arr,Output_End_Date=df['End_date'])
print(out)
ID Start_date End_date Weight overlap Output_Start_Date Output_End_Date
0 1 2020-09-01 2020-10-09 A False 2020-09-01 2020-10-09
1 1 2020-10-10 2020-11-25 B False 2020-10-10 2020-11-25
2 1 2020-11-18 2020-12-31 C True 2020-11-26 2020-12-31
3 2 2015-04-01 2016-05-31 A False 2015-04-01 2016-05-31
4 2 2016-05-12 2016-12-31 B True 2016-06-01 2016-12-31
5 3 2015-04-01 2016-05-29 A False 2015-04-01 2016-05-29
6 3 2016-05-15 2018-09-25 B True 2016-05-30 2018-09-25
7 3 2018-08-01 2020-10-15 C True 2018-09-26 2020-10-15
Suppose I have the dataframe
import pandas as pd
df = pd.DataFrame({"Time": ['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04']})
print(df)
Time
0 2010-01-01
1 2010-01-02
2 2010-01-03
3 2010-01-04
If I want to calculate the time from the lowest time point for each time in the dataframe, I can use the apply function like
df['Time'] = pd.to_datetime(df['Time'])
df.sort_values(inplace = True)
df['Time'] = df['Time'].apply(lambda x: (x - df['Time'].iloc[0]).days)
print(df)
Time
0 0
1 1
2 2
3 3
Is there a function in Pandas that does this already?
I will recommend not use apply
(df.Time-df.Time.iloc[0]).dt.days
0 0
1 1
2 2
3 3
Name: Time, dtype: int64
I have a table below
I was trying to create an additional column to count if Std_1,Std_2 and Std_3 greater than its mean value.
for example, for ACCMGR Row, only Std_2 is greater than the average, so the new column should be 1.
Not sure how to do it.
You need to be a bit careful with how you specify the axes, but you can just use .gt + .mean + .sum
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'APPL': ['ACCMGR', 'ACCOUNTS', 'ADVISOR', 'AUTH', 'TEST'],
'Std_1': [106.875, 121.703, np.NaN, 116.8585, 1],
'Std_2': [130.1899, 113.4927, np.NaN, 112.4486, 4],
'Std_3': [107.186, 114.5418, np.NaN, 115.2699, np.NaN]})
Code
df = df.set_index('APPL')
df['cts'] = df.gt(df.mean(axis=1), axis=0).sum(axis=1)
df = df.reset_index()
Output:
APPL Std_1 Std_2 Std_3 cts
0 ACCMGR 106.8750 130.1899 107.1860 1
1 ACCOUNTS 121.7030 113.4927 114.5418 1
2 ADVISOR NaN NaN NaN 0
3 AUTH 116.8585 112.4486 115.2699 2
4 TEST 1.0000 4.0000 NaN 1
Considered dataframe
quantity price
0 6 1.45
1 3 1.85
2 2 2.25
apply lambda function on axis =1 , for each series of row check the column of value greater than mean and get the index of column
df.apply(lambda x:df.columns.get_loc(x[x>np.mean(x)].index[0]),axis=1)
Out:
quantity price > than mean
0 6 1.45 0
1 3 1.85 0
2 2 2.25 1
I have three dataframes: df1, df2, df3. I am trying to add a list of ART_UNIT do df1.
df1 is 260846 rows x 4 columns:
Index SYMBOL level not-allocatable additional-only
0 A 2 True False
1 A01 4 True False
2 A01B 5 True False
3 A01B1/00 7 False False
4 A01B1/02 8 False False
5 A01B1/022 9 False False
6 A01B1/024 9 False False
7 A01B1/026 9 False False
df2 is 941516 rows x 2 columns:
Index CLASSIFICATION_SYMBOL_CD ART_UNIT
0 A44C27/00 3715
1 A44C27/001 2015
2 A44C27/001 3715
3 A44C27/001 2615
4 A44C27/005 2815
5 A44C27/006 3725
6 A44C27/007 3215
7 A44C27/008 3715
8 F41A33/00 3715
9 F41A33/02 3715
10 F41A33/04 3715
11 F41A33/06 3715
12 G07C13/00 3715
13 G07C13/005 3715
14 G07C13/02 3716
And df3 is the same format as df2, but has 673023 rows x 2 columns
The 'CLASSIFICATION_SYMBOL_CD' in df2 and df3 are not unique.
For each 'CLASSIFICATION_SYMBOL_CD' in df2 and df3, I want to find the same string in df1 'SYMBOL' and add a new column to df1 'ART_UNIT' that contains all of the 'ART_UNIT' from df2 and df3.
For example, in df2, 'CLASSIFICATION_SYMBOL_CD' A44C27/001 has ART_UNIT 2015, 3715, and 2615.
I want to write those ART_UNIT to the correct row in df1 so that is reads:
Index SYMBOL level not-allocatable additional-only ART_UNIT
211 A44C27/001 2 True False [2015, 3715, 2615]
So far, I've tried to group df2/df3 by 'CLASSIFICATION_SYMBOL_CD'
gp = df2.groupby(['CLASSIFICATION_SYMBOL_CD'])
for x in df2['CLASSIFICATION_SYMBOL_CD'].unique():
df2_g = gp.get_group(x)
Which gives me:
Index CLASSIFICATION_SYMBOL_CD ART_UNIT
1354 A61N1/3714 3762
117752 A61N1/3714 3766
347573 A61N1/3714 3736
548026 A61N1/3714 3762
560771 A61N1/3714 3762
566120 A61N1/3714 3766
566178 A61N1/3714 3762
799486 A61N1/3714 3736
802408 A61N1/3714 3736
Since df2 and df3 have the same format concatentate them first.
import pandas as pd
df = pd.concat([df2, df3])
Then to get the lists of all art units, groupby and apply list.
df = df.groupby('CLASSIFICATION_SYMBOL_CD').ART_UNIT.apply(list).reset_index()
# CLASSIFICATION_SYMBOL_CD ART_UNIT
#0 A44C27/00 [3715]
#1 A44C27/001 [2015, 3715, 2615]
#2 A44C27/005 [2815]
#3 A44C27/006 [3725]
#...
Finally, bring this information to df1 with a merge (you could map or something else too). Rename the column first to have less to clean up after the merge.
df = df.rename(columns={'CLASSIFICATION_SYMBOL_CD': 'SYMBOL'})
df1 = df1.merge(df, on='SYMBOL', how='left')
Output:
Index SYMBOL level not-allocatable additional-only ART_UNIT
0 0 A 2 True False NaN
1 1 A01 4 True False NaN
2 2 A01B 5 True False NaN
3 3 A01B1/00 7 False False NaN
4 4 A01B1/02 8 False False NaN
5 5 A01B1/022 9 False False NaN
6 6 A01B1/024 9 False False NaN
7 7 A01B1/026 9 False False NaN
Sadly, you didn't provide any overlapping SYMBOLs in df1, so nothing merged. But this will work with your full data.
I am formatting some csv files, and I need to add columns that use other columns for arithmetic. Like in Excel, B3 = sum(A1:A3)/3, then B4 = sum(A2:A4)/3. I've looked up relative indexes and haven't found what I'm Trying to do.
def formula_columns(csv_list, dir_env):
for file in csv_list:
df = pd.read_csv(dir_env + file)
avg_12(df)
print(df[10:20])
# Create AVG(12) Column
def avg_12 ( df ):
df[ 'AVG(12)' ] = df[ 'Price' ]
# Right Here I want to set each value of 'AVG(12)' to equal
# the sum of the value of price from its own index plus the
# previous 11 indexes
df.loc[:10, 'AVG(12)'] = 0
I would imagine this to be a common task, I would assume I'm looking in the wrong places. If anyone has some advice I would appreciate it, Thank.
That can be done with the rolling method:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1, 5, 10), columns = ['A'])
df
Out[151]:
A
0 2
1 4
2 1
3 1
4 4
5 2
6 4
7 2
8 4
9 1
Take the averages of A1:A3, A2:A4 etc:
df.rolling(3).mean()
Out[152]:
A
0 NaN
1 NaN
2 2.333333
3 2.000000
4 2.000000
5 2.333333
6 3.333333
7 2.666667
8 3.333333
9 2.333333
It requires pandas 18. For earlier versions, use pd.rolling_mean():
pd.rolling_mean(df['A'], 3)