merge dataframes on multiple columns ignoring order - python-3.x

I have the following dataframes:
df1=pd.DataFrame({'fr':[1,2,3],'to':[4,5,6],'R':[0.1,0.2,0.3]})
df2=pd.DataFrame({'fr':[1,5,3],'to':[4,2,6],'X':[0.4,0.5,0.6]})
I would like to merge these two dataframes on fr and to, ignoring the order of fr and to, i.e., (2,5) is the same as (5,2). The desired output is:
dfO=pd.DataFrame({'fr':[1,2,3],'to':[4,5,6],'R':[0.1,0.2,0.3],'X':[0.4,0.5,0.6]})
or
dfO=pd.DataFrame({'fr':[1,5,3],'to':[4,2,6],'R':[0.1,0.2,0.3],'X':[0.4,0.5,0.6]})
I can do the following:
pd.merge(df1,df2,on=['fr','to'],how='left')
However, as expected, the X value of the second row is NaN.
Thank you for your help.

You need do numpy sort first
df1[['fr','to']] = np.sort(df1[['fr','to']].values,1)
df2[['fr','to']] = np.sort(df2[['fr','to']].values,1)
out = df1.merge(df2,how='left')
out
Out[44]:
fr to R X
0 1 4 0.1 0.4
1 2 5 0.2 0.5
2 3 6 0.3 0.6

You can create a temp field and then join on it
df1['tmp'] = df1.apply(lambda x: ','.join(sorted([str(x.fr), str(x.to)])), axis=1)
df2['tmp'] = df2.apply(lambda x: ','.join(sorted([str(x.fr), str(x.to)])), axis=1)
This will give the result that you expect
pd.merge(df1,df2[['tmp', 'X']],on=['tmp'], how='left').drop(columns=['tmp'])

Related

How to correspondence of unique values ​between 2 tables?

I am fairly new to Python and I am trying to create a new function to work on my project.
The function will aim to detect which unique value is present in another column of another table.
At first, the function seeks to keep only the unique values ​​of the two tables, then merges them into a new dataframe
It's the rest that gets complicated because I would like to return which row and on which table my value is missing
If you have any other leads or thought patterns, I'm also interested.
Here is my code :
def correspondance_cle(df1, df2, col):
df11 = pd.DataFrame(df1[col].unique())
df11.columns= [col]
df11['test1'] = 1
df21 = pd.DataFrame(df2[col].unique())
df21.columns= [col]
df21['test2'] = 1
df3 = pd.merge(df11, df21, on=col, how='outer')
df3 = df3.loc[(fk['test1'].isna() == True) | (fk['test2'].isna() == True),:]
df3.info()
for row in df3[col]:
if df3['test1'].isna() == True:
print(row, "is not in df1")
else:
print(row, 'is not in df2')
Thanks to everyone who took the time to read the post.
First use outer join with remove duplicates by Series.drop_duplicates and Series.reset_index for avoid removed original indices:
df1 = pd.DataFrame({'a':[1,2,5,5]})
df2 = pd.DataFrame({'a':[2,20,5,8]})
col = 'a'
df = (df1[col].drop_duplicates().reset_index()
.merge(df2[col].drop_duplicates().reset_index(),
indicator=True,
how='outer',
on=col))
print (df)
index_x a index_y _merge
0 0.0 1 NaN left_only
1 1.0 2 0.0 both
2 2.0 5 2.0 both
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only
Then filter rows by helper column _merge:
print (df[df['_merge'].eq('left_only')])
index_x a index_y _merge
0 0.0 1 NaN left_only
print (df[df['_merge'].eq('right_only')])
index_x a index_y _merge
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only

Python: dictionaries of different dimensions to excel

Is there any effcient way to write different dimensions dictionaries to excel using pandas?
Example:
import pandas as pd
mylist1=[7,8,'woo']
mylist2=[[1,2,3],[4,5,6],['foo','boo','doo']]
d=dict(y=mylist1,x=mylist2)
df=pd.DataFrame.from_dict(d, orient='index').transpose().fillna('')
writer = pd.ExcelWriter('output.xls',engine = 'xlsxwriter')
df.to_excel(writer)
writer.save()
The current results,
The desired results,
Please note that my database is much bigger than this simple example. So a generic answer would be appreciated.
You can fix your dataframe first before exporting to excel:
df=pd.DataFrame.from_dict(d, orient='index').transpose()
df = pd.concat([df["y"],pd.DataFrame(df["x"].tolist(),columns=list("x"*len(df["x"])))],axis=1)
Or do it upstream:
df = pd.DataFrame([[a, *b] for a,b in zip(mylist1, mylist2)],columns=list("yxxx"))
Both yield the same result:
y x x x
0 7 1 2 3
1 8 4 5 6
2 woo foo boo doo
Get first appropriate format then save to excel.
df = df.join(df.x.apply(pd.Series)).drop('x',1)
df.columns = list('yxxx')
df
y x x x
0 7 1 2 3
1 8 4 5 6
2 woo foo boo doo
For Dynamic columns name
df.columns = ['y'] + list('x' * (len(df.columns)-1))

Python Pandas Merge data from different Dataframes on specific index and create new one

My code is given below: I have two data frames a,b. I want to create a new data frame c by merging a specific index data of a, b frames.
import pandas as pd
a = [10,20,30,40,50,60]
b = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
a = pd.DataFrame(a,columns=['Voltage'])
b = pd.DataFrame(b,columns=['Current'])
c = pd.merge(a,b,left_index=True, right_index=True)
print(c)
The actual output is:
Voltage Current
0 10 0.1
1 20 0.2
2 30 0.3
3 40 0.4
4 50 0.5
5 60 0.6
I don't want all the rows. But, specific index rows something like:
c = Voltage Current
0 30 0.3
1 40 0.4
How to modify c = pd.merge(a,b,left_index=True, right_index=True) code so that, I only want those specific third and fourth rows in c with new index order as given above?
Use iloc for select rows by positions and add reset_index with drop=True for default index in both DataFrames:
Solution1 with concat:
c = pd.concat([a.iloc[2:4].reset_index(drop=True),
b.iloc[2:4].reset_index(drop=True)], axis=1)
Or use merge:
c = pd.merge(a.iloc[2:4].reset_index(drop=True),
b.iloc[2:4].reset_index(drop=True),
left_index=True,
right_index=True)
print(c)
Voltage Current
0 30 0.3
1 40 0.4

Removing negative values in pandas column keeping NaN

I was wondering how I can remove rows which have a negative value but keep the NaNs. At the moment I am using:
DF = DF.ix[DF['RAF01Time'] >= 0]
But this removes the NaNs.
Thanks in advance.
You need boolean indexing with another condition with isnull:
DF = DF[(DF['RAF01Time'] >= 0) | (DF['RAF01Time'].isnull())]
Sample:
DF = pd.DataFrame({'RAF01Time':[-1,2,3,np.nan]})
print (DF)
RAF01Time
0 -1.0
1 2.0
2 3.0
3 NaN
DF = DF[(DF['RAF01Time'] >= 0) | (DF['RAF01Time'].isnull())]
print (DF)
RAF01Time
1 2.0
2 3.0
3 NaN
Another solution with query:
DF = DF.query("~(RAF01Time < 0)")
print (DF)
RAF01Time
1 2.0
2 3.0
3 NaN
You can just use < 0 and then take the inverse of the condition.
DF = DF[~(DF['RAF01Time'] < 0)]

Pandas Pivot Table Count Values (Exclude "NaN")

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Site':['a','a','a','b','b','b'],
'x':[1,1,0,1,np.nan,0],
'y':[1,np.nan,0,1,1,0]
})
df
Site y x
0 a 1.0 1
1 a NaN 1
2 a 0.0 0
3 b 1.0 1
4 b 1.0 NaN
5 b 0.0 0
I'd like to pivot this data frame to get the count of values (excluding "NaN") for each column.
I tried what I found in other posts, but nothing seems to work (maybe there was a change in pandas 0.18)?
Desired result:
Item count
Site
a y 2
b y 3
a x 3
b x 2
Thanks in advance!
pvt = pd.pivot_table(df, index = "Site", values = ["x", "y"], aggfunc = "count").stack().reset_index(level = 1)
pvt.columns = ["Item", "count"]
pvt
Out[38]:
Item count
Site
a x 3
a y 2
b x 2
b y 3
You can add pvt.sort_values("Item", ascending = False) if you want y's to appear first.

Resources