Pandas Dataframe merge on two different key to get original data - python-3.x

Question title might be confusing but here is the example of what I intends to perform.
Below is one the main dataframe with request data
d = {'ID':['A1','A2','A3','A4'],'ID2': ['B1','B2','B3','B4'],'B':[-1,5,6,7000],'ExtD':['CA','CB','CC','CD']}
df = pd.DataFrame(data=d)
df
Now, Response might be based on ID or ID2 column and looks like this -
d = {'RetID':['A1','A2','B3','B4'],'C':[1.3,5.4,4.5,1.3]}
df2 = pd.DataFrame(data=d)
df2
where RetID could be ID or ID2 from the request along with additional data C. Once response is received I need to merge it back with original dataframe to get data ExtD.
the solution I have come up with is to do -
df2 = df2.merge(df[['ID','ExtD',]],'left',left_on=['RetID'],right_on=['ID'])
df2 = df2.merge(df[['ID2','ExtD']],'left',left_on=['RetID'],right_on=['ID2'],suffixes = ('_d1','_d2'))
df2.rename({'ExtD_d1':'ExtD'},axis=1,inplace=True)
df2.loc[df2['ExtD'].isnull(),'ExtD'] = df2['ExtD_d2']
df2.drop({'ID2','ExtD_d2'},axis=1,inplace=True)
so expected output is,
res = {'RetID':['A1','A2','B3','B4'],'C':[1.3,5.4,4.5,1.3],'ExtD':['CA','CB','CC','CD']}
df2= pd.DataFrame(data=res)
df2
EDIT2: updated requirement tweak.
res = {'RetID':['A1','A2','B1','B2'],'C':[1.3,5.4,4.5,1.3],'ExtD':['CA','CB','CC','CD'],'ID':['A1','A2','A3','A4'],'ID2': ['B1','B2','B3','B4']}
Is there an efficient way to do this ? There might be more than 2 IDs - ID, ID2, ID3 and more than one column to join from the reqest dataframe. TIA.
EDIT: Fixed the typo.

Use melt to transform your first dataframe then merge with the second:
tmp = df.melt('ExtD', value_vars=['ID', 'ID2'], value_name='RetID')
df2 = df2.merge(tmp[['ExtD', 'RetID']])
>>> df2
RetID C ExtD
0 A1 1.3 CA
1 A2 5.4 CB
2 B1 4.5 CA
3 B2 1.3 CB
>>> tmp
ExtD variable RetID
0 CA ID A1
1 CB ID A2
2 CC ID A3
3 CD ID A4
4 CA ID2 B1
5 CB ID2 B2
6 CC ID2 B3
7 CD ID2 B4
Update
What if I need to merge ID and ID2 columns as well?
df2 = df2.merge(df[['ID', 'ID2', 'ExtD']], on='ExtD')
>>> df2
RetID C ExtD ID ID2
0 A1 1.3 CA A1 B1
1 A2 5.4 CB A2 B2
2 B3 4.5 CC A3 B3
3 B4 1.3 CD A4 B4

Related

How to subset a DataFrame based on similar column names

How to subset similar columns in pandas based on keywords like A B C D. Now I have taken this as example is there any better way if new columns were given and logic should work.
df
A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 D3 D4
1 a x 1 a x 3 c 7 d s 4
2 b 5 2 b 5 4 d s c 7 d
3 c 7 3 c 7 1 a x 1 a x
4 d s 4 d s b 5 2 b s 7
You can use pandas.Index.groupby
groups = df.columns.groupby(df.columns.str[0])
#{'A': ['A1', 'A2', 'A3'],
# 'B': ['B1', 'B2', 'B3'],
# 'C': ['C1', 'C2'],
# 'D': ['D1', 'D2', 'D3', 'D4']}
Then you can access data this way:
df[groups['B']]
# B1 B2 B3
#0 1 a x
#1 2 b 5
#2 3 c 7
#3 4 d s
Keep in mind groups is a dict, so you can use any dict method too.
Another approach can be to use df.columns in conjuction with str.contains
a_col_lst = df.columns[df.columns.str.contains('A')]
b_col_lst = df.columns[df.columns.str.contains('B')]
df_A = df.loc[a_col_lst]
df_B = df.loc[b_col_lst]
You can apply regex as well within columns.str.contains
You could use filter along with a regex pattern, e.g.
df_A = df.filter(regex=(r'^A.*'))
You could also use select along with startswith:
df_A = df.select(lambda col: col.startswith('A'), axis=1)

Merging 2 data frames on 3 columns where data sometimes exists

I am attempting merge and fill in missing values in one data frame from another one. Hopefully this isn't too long of an explanation i have just been wracking my brain around this for too long. I am working with 2 huge CSV files so i made a small example here. I have included the entire code at the end in case you were curious to assist. THANK YOU SO MUCH IN ADVANCE. Here we go!
print(df1)
A B C D E
0 1 B1 D1 E1
1 C1 D1 E1
2 1 B1 D1 E1
3 2 B2 D2 E2
4 B2 C2 D2 E2
5 3 D3 E3
6 3 B3 C3 D3 E3
7 4 C4 D4 E4
print(df2)
A B C F G
0 1 C1 F1 G1
1 B2 C2 F2 G2
2 3 B3 F3 G3
3 4 B4 C4 F4 G4
I would essentially like to merge df2 into df1 by 3 different columns. i understand that you can merge on multiple column names but it seems to not give me the desired result. I would like to KEEP all data in df1, and fill in the data from df2 so i use how='left'.
I am fairly new to python and have done a lot of research but have hit a stuck point. Here is what i have tried.
data3 = df1.merge(df2, how='left', on=['A'])
print(data3)
A B_x C_x D E B_y C_y F G
0 1 B1 D1 E1 C1 F1 G1
1 C1 D1 E1 B2 C2 F2 G2
2 1 B1 D1 E1 C1 F1 G1
3 2 B2 D2 E2 NaN NaN NaN NaN
4 B2 C2 D2 E2 B2 C2 F2 G2
5 3 D3 E3 B3 F3 G3
6 3 B3 C3 D3 E3 B3 F3 G3
7 4 C4 D4 E4 B4 C4 F4 G4
As you can see here it sort of worked with just A, however since this is a csv file with blank values. the blank values seem to merge together. which i do not want. because df2 was blank in row 2 it filled in the data where it saw blanks, which is not what i want. it should be NaN if it could not find a match.
whenever i start putting additional rows into my "on=['A', 'B'] it does not do anything different. in-fact, A no longer merges.
data3 = df1.merge(df2, how='left', on=['A', 'B'])
print(data3)
A B C_x D E C_y F G
0 1 B1 D1 E1 NaN NaN NaN
1 C1 D1 E1 NaN NaN NaN
2 1 B1 D1 E1 NaN NaN NaN
3 2 B2 D2 E2 NaN NaN NaN
4 B2 C2 D2 E2 C2 F2 G2
5 3 D3 E3 NaN NaN NaN
6 3 B3 C3 D3 E3 F3 G3
7 4 C4 D4 E4 NaN NaN NaN
Rows A, B, and C are the values i want to correlate and merge on. Using both data frames it should know enough to fill in all the gaps. my ending df should look like:
print(desired_output):
A B C D E F G
0 1 B1 C1 D1 E1 F1 G1
1 1 B1 C1 D1 E1 F1 G1
2 1 B1 C1 D1 E1 F1 G1
3 2 B2 C2 D2 E2 F2 G2
4 2 B2 C2 D2 E2 F2 G2
5 3 B3 C3 D3 E3 F3 G3
6 3 B3 C3 D3 E3 F3 G3
7 4 B4 C4 D4 E4 F4 G4
even though A, B, and C have repeating rows i want to keep ALL the data and just fill in the data from df2 where it might fit, even if it is repeat data. i also do not want to have all of the _x and _y the suffix's from merging. i know how to rename but doing 3 different merges and merging those merges starts to get really complicated really fast with repeated rows and suffix's...
long story short, how can i merge both data-frames by A, and then B, and then C? order in which it happens is irrelevant.
Here is a sample of actual data. I have my own data that has additional data and i relate it to this data by certain identifiers. basically by MMSI, Name and IMO. i want to keep duplicates because they aren't actually duplicates, just additional data points for each vessel
MMSI BaseDateTime LAT LON VesselName IMO CallSign
366940480.0 2017-01-04T11:39:36 52.48730 -174.02316 EARLY DAWN 7821130 WDB7319
366940480.0 2017-01-04T13:51:07 52.41575 -174.60041 EARLY DAWN 7821130 WDB7319
273898000.0 2017-01-06T16:55:33 63.83668 -174.41172 MYS CHUPROVA NaN UAEZ
352844000.0 2017-01-31T22:51:31 51.89778 -176.59334 JACHA 8512920 3EFC4
352844000.0 2017-01-31T23:06:31 51.89795 -176.59333 JACHA 8512920 3EFC4

Search (row values) data from another dataframe

I have two dataframes, df1 and df2 respectively.
In one dataframe I have a list of search values (Actually Col1)
Col1 Col2
A1 val1, val2
B2 val4, val1
C3 val2, val5
I have another dataframe where I have a list of items
value items
val1 apples, oranges
val2 honey, mustard
val3 banana, milk
val4 biscuit
val5 chocolate
I want to iterate though the first DF and try to use that val as key to search for items from the second DF
Expected output:
A1 apples, oranges, honey, mustard
B2 biscuit, appleas, oranges
C3 honey, mustard, chocolate
I am able to add the values into dataframe and iterate through 1st DF
for index, row in DF1:
#list to hold all the values
finalList = []
list = df1['col2'].split(',')
for i in list:
print(i)
I just need help to fetch values from the second dataframe.
Would appreciate any help. Thanks.
Idea is use lambda function with split and lookup by dictionary:
d = df2.set_index('value')['items'].to_dict()
df1['Col2'] = df1['Col2'].apply(lambda x: ', '.join(d[y] for y in x.split(', ') if y in d))
print (df1)
Col1 Col2
0 A1 apples, oranges, honey, mustard
1 B2 biscuit, apples, oranges
2 C3 honey, mustard, chocolate
If there are lists in items values solution is changed with flattening:
d = df2.set_index('value')['items'].to_dict()
f = lambda x: ', '.join(z for y in x.split(', ') if y in d for z in d[y])
df1['Col2'] = df1['Col2'].apply(f)
print (df1)
Col1 Col2
0 A1 apples, oranges
1 B2 biscuit, apples, oranges
2 C3 honey, mustard, chocolate

get distinct columns dataframe

Hello how can i do to only the lines where val is different in the 2 dataframes.
Notice that i can have id1 or id2 or both as below.
d2 = {'id1': ['X22', 'X13',np.nan,'X02','X14'],'id2': ['Y1','Y2','Y3','Y4',np.nan],'VAL1':[1,0,2,3,0]}
F1 = pd.DataFrame(data=d2)
d2 = {'id1': ['X02', 'X13',np.nan,'X22','X14'],'id2': ['Y4','Y2','Y3','Y1','Y22'],'VAL2':[1,0,4,3,1]}
F2 = pd.DataFrame(data=d2)
Expected Output
d2 = {'id1': ['X02',np.nan,'X22','X14'],'id2': ['Y4','Y3','Y1',np.nan],'VAL1':[3,2,1,0],'VAL2':[1,4,3,1]}
F3 = pd.DataFrame(data=d2)
First merge by all columns with left_on and right_on parameters, then filter out both rows and remove missing values by reshape by stack with unstack:
df=pd.merge(F1, F2, left_on=['id1','id2','VAL2'],
right_on=['id1','id2','VAL1'], how="outer", indicator=True)
df=(df[df['_merge'] !='both']
.set_index(['id1','id2'])
.drop('_merge', 1)
.stack()
.unstack()
.reset_index())
print (df)
id1 id2 VAL2 VAL1
0 X02 Y4 3 1
1 X22 Y1 1 3
F1.merge(F2,how='left',left_on=['id1','id2'],right_on=['id1','id2'])\
.query("VAL1!=VAL2")

Sorting data based on column entries

I have a text file containing two column lets say col1 and col2.
col1 Col2
A20 A19
A120 A117
A120 A118
A120 B19
A120 B20
.
.
.
B40 A205
and so on.
I want to sort the above columns such that it gives me only those entries which have A and B side by side like:
col1 col2
A120 B20
B40 A205
I've tried using pd.DataFrame.sort but it doesn't return the required output.
Any help will be highly appreciated.
Use indexing by str with boolean indexing for check if not equal first 2 characters:
df = df[df['col1'].str[0] != df['Col2'].str[0]]
print (df)
col1 Col2
3 A120 B19
4 A120 B20
5 B40 A205
If possible multiple starting letters and need test only A and B:
print (df)
col1 Col2
0 A20 C19 <-changed sample data
1 A120 A117
2 A120 A118
3 A120 B19
4 A120 B20
5 B40 A205
a = df['col1'].str[0]
b = df['Col2'].str[0]
m1 = a.isin(['A','B'])
m2 = b.isin(['A','B'])
m3 = a != b
df = df[m1 & m2 & m3]
print (df)
col1 Col2
3 A120 B19
4 A120 B20
5 B40 A205

Resources