Combine rows based on index or column - python-3.x

I have three dataframes: df1, df2, df3. I am trying to add a list of ART_UNIT do df1.
df1 is 260846 rows x 4 columns:
Index SYMBOL level not-allocatable additional-only
0 A 2 True False
1 A01 4 True False
2 A01B 5 True False
3 A01B1/00 7 False False
4 A01B1/02 8 False False
5 A01B1/022 9 False False
6 A01B1/024 9 False False
7 A01B1/026 9 False False
df2 is 941516 rows x 2 columns:
Index CLASSIFICATION_SYMBOL_CD ART_UNIT
0 A44C27/00 3715
1 A44C27/001 2015
2 A44C27/001 3715
3 A44C27/001 2615
4 A44C27/005 2815
5 A44C27/006 3725
6 A44C27/007 3215
7 A44C27/008 3715
8 F41A33/00 3715
9 F41A33/02 3715
10 F41A33/04 3715
11 F41A33/06 3715
12 G07C13/00 3715
13 G07C13/005 3715
14 G07C13/02 3716
And df3 is the same format as df2, but has 673023 rows x 2 columns
The 'CLASSIFICATION_SYMBOL_CD' in df2 and df3 are not unique.
For each 'CLASSIFICATION_SYMBOL_CD' in df2 and df3, I want to find the same string in df1 'SYMBOL' and add a new column to df1 'ART_UNIT' that contains all of the 'ART_UNIT' from df2 and df3.
For example, in df2, 'CLASSIFICATION_SYMBOL_CD' A44C27/001 has ART_UNIT 2015, 3715, and 2615.
I want to write those ART_UNIT to the correct row in df1 so that is reads:
Index SYMBOL level not-allocatable additional-only ART_UNIT
211 A44C27/001 2 True False [2015, 3715, 2615]
So far, I've tried to group df2/df3 by 'CLASSIFICATION_SYMBOL_CD'
gp = df2.groupby(['CLASSIFICATION_SYMBOL_CD'])
for x in df2['CLASSIFICATION_SYMBOL_CD'].unique():
df2_g = gp.get_group(x)
Which gives me:
Index CLASSIFICATION_SYMBOL_CD ART_UNIT
1354 A61N1/3714 3762
117752 A61N1/3714 3766
347573 A61N1/3714 3736
548026 A61N1/3714 3762
560771 A61N1/3714 3762
566120 A61N1/3714 3766
566178 A61N1/3714 3762
799486 A61N1/3714 3736
802408 A61N1/3714 3736

Since df2 and df3 have the same format concatentate them first.
import pandas as pd
df = pd.concat([df2, df3])
Then to get the lists of all art units, groupby and apply list.
df = df.groupby('CLASSIFICATION_SYMBOL_CD').ART_UNIT.apply(list).reset_index()
# CLASSIFICATION_SYMBOL_CD ART_UNIT
#0 A44C27/00 [3715]
#1 A44C27/001 [2015, 3715, 2615]
#2 A44C27/005 [2815]
#3 A44C27/006 [3725]
#...
Finally, bring this information to df1 with a merge (you could map or something else too). Rename the column first to have less to clean up after the merge.
df = df.rename(columns={'CLASSIFICATION_SYMBOL_CD': 'SYMBOL'})
df1 = df1.merge(df, on='SYMBOL', how='left')
Output:
Index SYMBOL level not-allocatable additional-only ART_UNIT
0 0 A 2 True False NaN
1 1 A01 4 True False NaN
2 2 A01B 5 True False NaN
3 3 A01B1/00 7 False False NaN
4 4 A01B1/02 8 False False NaN
5 5 A01B1/022 9 False False NaN
6 6 A01B1/024 9 False False NaN
7 7 A01B1/026 9 False False NaN
Sadly, you didn't provide any overlapping SYMBOLs in df1, so nothing merged. But this will work with your full data.

Related

Stack row under row from two different dataframe using python? [duplicate]

df1 = pd.DataFrame({'a':[1,2,3],'x':[4,5,6],'y':[7,8,9]})
df2 = pd.DataFrame({'b':[10,11,12],'x':[13,14,15],'y':[16,17,18]})
I'm trying to merge the two data frames using the keys from the df1. I think I should use pd.merge for this, but I how can I tell pandas to place the values in the b column of df2 in the a column of df1. This is the output I'm trying to achieve:
a x y
0 1 4 7
1 2 5 8
2 3 6 9
3 10 13 16
4 11 14 17
5 12 15 18
Just use concat and rename the column for df2 so it aligns:
In [92]:
pd.concat([df1,df2.rename(columns={'b':'a'})], ignore_index=True)
Out[92]:
a x y
0 1 4 7
1 2 5 8
2 3 6 9
3 10 13 16
4 11 14 17
5 12 15 18
similarly you can use merge but you'd need to rename the column as above:
In [103]:
df1.merge(df2.rename(columns={'b':'a'}),how='outer')
Out[103]:
a x y
0 1 4 7
1 2 5 8
2 3 6 9
3 10 13 16
4 11 14 17
5 12 15 18
Use numpy to concatenate the dataframes, so you don't have to rename all of the columns (or explicitly ignore indexes). np.concatenate also works on an arbitrary number of dataframes.
df = pd.DataFrame( np.concatenate( (df1.values, df2.values), axis=0 ) )
df.columns = [ 'a', 'x', 'y' ]
df
You can rename columns and then use functions append or concat:
df2.columns = df1.columns
df1.append(df2, ignore_index=True)
# pd.concat([df1, df2], ignore_index=True)
You can also concatenate both dataframes with vstack from numpy and convert the resulting ndarray to dataframe:
pd.DataFrame(np.vstack([df1, df2]), columns=df1.columns)

pandas drop rows based on condition on groupby

I have a DataFrame like below
I am trying to groupby cell column and drop the "NA" values where group size > 1.
required Output :
How to get my expected output? How to filter on a condition and drop rows in groupby statement?
From your DataFrame, first we group by cell to get the size of each groups :
>>> df_grouped = df.groupby(['cell'], as_index=False).size()
>>> df_grouped
cell size
0 A 3
1 B 1
2 D 3
Then, we merge the result with the original DataFrame like so :
>>> df_merged = pd.merge(df, df_grouped, on='cell', how='left')
>>> df_merged
cell value kpi size
0 A 5.0 thpt 3
1 A 6.0 ret 3
2 A NaN thpt 3
3 B NaN acc 1
4 D 8.0 int 3
5 D NaN ps 3
6 D NaN yret 3
To finish, we filter the Dataframe to get the expected result :
>>> df_filtered = df_merged[~((df_merged['value'].isna()) & (df_merged['size'] > 1))]
>>> df_filtered[['cell', 'value', 'kpi']]
cell value kpi
0 A 5.0 thpt
1 A 6.0 ret
3 B NaN acc
4 D 8.0 int
Use boolean mask:
>>> df[df.groupby('cell').cumcount().eq(0) | df['value'].notna()]
cell value kpi
0 A crud thpt
1 A 6 ret
3 B NaN acc
4 D hi int
Details:
m1 = df.groupby('cell').cumcount().eq(0)
m2 = df['value'].notna()
df.assign(keep_at_least_one=m1, keep_notna=m2, keep_rows=m1|m2)
# Output:
cell value kpi keep_at_least_one keep_notna keep_rows
0 A crud thpt True True True
1 A 6 ret False True True
2 A NaN thpt False False False
3 B NaN acc True False True
4 D hi int True True True
5 D NaN ps False False False
6 D NaN yret False False False

Find Matching rows in the data frame by comparing all rows based on certain conditions

I'm fairly new to python and would appreciate if someone can guide me in the right direction.
I have a dataset that has unique trades in each row. I need to find all rows that match on certain conditions. Basically, find any offsetting trades that fit a certain condition. For example:
Find trades that have the same REF_RATE, RECEIVE is within a difference of 5, MATURITY_DATE is with 7 days of each other. I have attached the image of data.
Thank You.
You can use groupby to achieve this. As per you requirement specific to this ask Find trades that have the same REF_RATE, RECEIVE is within a difference of 5, MATURITY_DATE is with 7 days of each other you can proceed like this.
#sample data created from the image of your dataset
>>> data = {'Maturity_Date':['2/01/2021','10/01/2021','10/01/2021','6/06/2021'],'Trade_id':['10484','12880','11798','19561'],'REF_RATE':['BBSW','BBSW','OIS','BBSW'],'Recive':[1.5,1.25,2,10]}
>>> df = pd.DataFrame(data)
>>> df
Maturity_Date Trade_id REF_RATE Recive
0 2/01/2021 10484 BBSW 1.50
1 10/01/2021 12880 BBSW 1.25
2 10/01/2021 11798 OIS 2.00
3 6/06/2021 19561 BBSW 10.00
#convert Maturity_Date to datetime format and sort REF_RATE by date if needed
>>> df['Maturity_Date'] = pd.to_datetime(df['Maturity_Date'], dayfirst=True)
>>> df['Maturity_Date'] = df.groupby('REF_RATE')['Maturity_Date'].apply(lambda x: x.sort_values()) #if needed
>>> df
Maturity_Date Trade_id REF_RATE Recive
0 2021-01-02 10484 BBSW 1.50
1 2021-01-10 12880 BBSW 1.25
2 2021-01-10 11798 OIS 2.00
3 2021-06-06 19561 BBSW 10.00
#groupby of REF_RATE and apply condition on date and receive column
>>> df['date_diff>7'] = df.groupby('REF_RATE')['Maturity_Date'].diff() / np.timedelta64(1, 'D') > 7
>>> df['rate_diff>5'] = df.groupby('REF_RATE')['Recive'].diff() > 5
>>> df
Maturity_Date Trade_id REF_RATE Recive date_diff>7 rate_diff>5
0 2021-01-02 10484 BBSW 1.50 False False
1 2021-01-10 12880 BBSW 1.25 True False #date_diff true as for BBSW Maturity date is more than 7
2 2021-01-10 11798 OIS 2.00 False False
3 2021-06-06 19561 BBSW 10.00 True True #rate_diff and date_diff true because date>7 and receive difference>5

How can I sort 3 columns and assign it to one python pandas

I have a dataframe:
df = {A:[1,1,1], B:[2012,3014,3343], C:[12,13,45], D:[111,222,444]}
but I need to join the last 3 columns in consecutive order horizontally and thus assign it to the first column, some like this:
df2 = {A:[1,1,1,2,2,2], Fusion3:[2012,12,111,3014,13,222]}
I have tried with .melt, but you are struggling with some ideas and grateful for your comments
From the desired output I'm making the assumption that the initial dataframe should have 1,2,3 in the A column rather 1,1,1
import pandas as pd
df= pd.DataFrame({'A':[1,2,3], 'B':[2012,3014,3343], 'C':[12,13,45], 'D':[111,222,444]})
df = df.set_index('A')
df = df.stack().droplevel(1)
will give you this series:
A
1 2012
1 12
1 111
2 3014
2 13
2 222
3 3343
3 45
3 444
Check melt
out = df.melt('A').drop('variable',1)
Out[15]:
A value
0 1 2012
1 2 3014
2 3 3343
3 1 12
4 2 13
5 3 45
6 1 111
7 2 222
8 3 444

How to use “na_values='?'” option in the pd.read.csv() function?

I am trying to find the operation with na_values='?' option in the pd.read.csv() function.
So that I can find the list of rows containing "?" value and then remove that value.
Sample:
import pandas as pd
from pandas.compat import StringIO
#test data
temp=u"""id,col1,col2,col3
1,13?,15,14
1,13,15,?
1,12,15,13
2,?,15,?
2,18,15,13
2,18?,15,13"""
#in real data use
#df = pd.read_csv('test.csv')
df = pd.read_csv(StringIO(temp))
print (df)
id col1 col2 col3
0 1 13? 15 14
1 1 13 15 ?
2 1 12 15 13
3 2 ? 15 ?
4 2 18 15 13
5 2 18? 15 13
If want remove values with ? which are separately or substrings need mask created by str.contains and then check if at least one True per row by DataFrame.any:
print (df.astype(str).apply(lambda x: x.str.contains('?', regex=False)))
id col1 col2 col3
0 False True False False
1 False False False True
2 False False False False
3 False True False True
4 False False False False
5 False True False False
m = ~df.astype(str).apply(lambda x: x.str.contains('?', regex=False)).any(axis=1)
print (m)
0 False
1 False
2 True
3 False
4 True
5 False
dtype: bool
df = df[m]
print (df)
id col1 col2 col3
2 1 12 15 13
4 2 18 15 13
If want replace only separately ? simply compare value:
print (df.astype(str) == '?')
id col1 col2 col3
0 False False False False
1 False False False True
2 False False False False
3 False True False True
4 False False False False
5 False False False False
m = ~(df.astype(str) == '?').any(axis=1)
print (m)
0 True
1 False
2 True
3 False
4 True
5 True
dtype: bool
df = df[m]
print (df)
id col1 col2 col3
0 1 13? 15 14
2 1 12 15 13
4 2 18 15 13
5 2 18? 15 13
It replace all ? to NaNs is necessary parameter na_values and dropna if want remove all rows with NaNs:
import pandas as pd
from pandas.compat import StringIO
#test data
temp=u"""id,col1,col2,col3
1,13?,15,14
1,13,15,?
1,12,15,13
2,?,15,?
2,18,15,13
2,18?,15,13"""
#in real data use
#df = pd.read_csv('test.csv', na_values='?')
df = pd.read_csv(StringIO(temp), na_values='?')
print (df)
id col1 col2 col3
0 1 13? 15 14.0
1 1 13 15 NaN
2 1 12 15 13.0
3 2 NaN 15 NaN
4 2 18 15 13.0
5 2 18? 15 13.0
df = df.dropna()
print (df)
id col1 col2 col3
0 1 13? 15 14.0
2 1 12 15 13.0
4 2 18 15 13.0
5 2 18? 15 13.0
na_values = ['NO CLUE', 'N/A', '0']
requests = pd.read_csv('some-data.csv', na_values=na_values)
Create a list with useless parameters and use it trough reading from file
"??" or "####" type of junk values can be converted into missing value, since in python all the blank values can be replaced with nan. Hence you can also replace these type of junk value to missing value by passing them as as list to the parameter
'na_values'.
data_csv = pd.read_csv('test.csv',na_values = ["??"])
If you want to remove the rows which are contain "?" in pandas dataframe, you can try with:
suppose you have df:
import pandas as pd
df = pd.read_csv('test.csv')
df:
A B
0 Maths 4/13/2017
1 Physics 4/15/2016
2 English 4/16/2016
3 test?dsfsa 9/15/2016
check if column A contain "?" to generate new df1:
df1 = df[df.A.str.contains("\?")==False]
df1 will be:
A B
0 Maths 4/13/2017
1 Physics 4/15/2016
2 English 4/16/2016
which will give you the new df1 which doesn't contain "?".

Resources