In Pandas, how to filter against other dataframe with Multi-Index - python-3.x

I have two dataframes. The first one (df1) has a Multi-Index A,B.
The second one (df2) has those fields A and B as columns.
How do I filter df2 for a large dataset (2 million rows in each) to get only the rows in df2 where A and B are not in the multi index of df1
import pandas as pd
df1 = pd.DataFrame([(1,2,3),(1,2,4),(1,2,4),(2,3,4),(2,3,1)],
columns=('A','B','C')).set_index(['A','B'])
df2 = pd.DataFrame([(7,7,1,2,3),(7,7,1,2,4),(6,6,1,2,4),
(5,5,6,3,4),(2,7,2,2,1)],
columns=('X','Y','A','B','C'))
df1:
C
A B
1 2 3
2 4
2 4
2 3 4
3 1
df2 before filtering:
X Y A B C
0 7 7 1 2 3
1 7 7 1 2 4
2 6 6 1 2 4
3 5 5 6 3 4
4 2 7 2 2 1
df2 wanted result:
X Y A B C
3 5 5 6 3 4
4 2 7 2 2 1

Create MultiIndex in df2 by A,B columns and filter by Index.isin with ~ for invert boolean mask with boolean indexing:
df = df2[~df2.set_index(['A','B']).index.isin(df1.index)]
print (df)
X Y A B C
3 5 5 6 3 4
4 2 7 2 2 1
Another similar solution with MultiIndex.from_arrays:
df = df2[~pd.MultiIndex.from_arrays([df2['A'],df2['B']]).isin(df1.index)]
Another solution by #Sandeep Kadapa:
df = df2[df2[['A','B']].ne(df1.reset_index()[['A','B']]).any(axis=1)]

Related

merging and joining the wo pandas data frames with out including right side data frame columns

I have two dataframes like given below.
***df1 = pd.DataFrame({'a':[1,1,2,2,3,3], 'b':[1,2,1,2,1,2], 'c':[1,2,4,0,0,2]})***
df1
a b c
0 1 1 1
1 1 2 2
2 2 1 4
3 2 2 0
4 3 1 0
5 3 2 2
***df2 = pd.DataFrame({'a':[1,1,2,2], 'b':[1,2,1,2], 'c':[1,5,6,2]})***
df2
a b c
0 1 1 1
1 1 2 5
2 2 1 6
3 2 2 2
I want to apply inner join of the both data frames and don't want the columns from df2, so tried with below code.
***merged_df = df1.merge(df2, how='inner', left_on=["a", "b"], right_on=["a","b"])***
a b c_x c_y
0 1 1 1 1
1 1 2 2 5
2 2 1 4 6
3 2 2 0 2
from the above code without droping c_x and c_y manually, is there any way to not to merge right dataframe(df2)
basically, I want all the columns from df1 and don't want any columns from df2 after merging.
Thanks in advance.
Idea is filter only columns for merging, here a,b. If want merge by both columns on parameter should be omit (then pandas merge by intersection of columns in both DataFrames):
merged_df = df1.merge(df2[["a", "b"]])
working like:
merged_df = df1.merge(df2[["a", "b"]], on=['a','b'])

The way `Drop column by id ` result in all same columns removed in dataframe

import pandas as pd
df1 = pd.DataFrame({"A":[14, 4, 5, 4],"B":[1,2,3,4]})
df2 = pd.DataFrame({"A":[14, 4, 5, 4],"C":[5,6,7,8]})
df = pd.concat([df1,df2],axis=1)
Let's see the concated df,the first column and third column shares the same column name A.
df
A B A C
0 14 1 14 5
1 4 2 4 6
2 5 3 5 7
3 4 4 4 8
I want to get the following format.
df
A B C
0 14 1 5
1 4 2 6
2 5 3 7
3 4 4 8
Drop column by id.
result = df.drop(df.columns[2],axis=1)
result
B C
0 1 5
1 2 6
2 3 7
3 4 8
I can get what i expect this way:
import pandas as pd
df1 = pd.DataFrame({"A":[14, 4, 5, 4],"B":[1,2,3,4]})
df2 = pd.DataFrame({"A":[14, 4, 5, 4],"C":[5,6,7,8]})
df2 = df2.drop(df2.columns[0],axis=1)
df = pd.concat([df1,df2],axis=1)
It is so strange that both the first and third column removed when to drop specified column by id.
1.Please tell me the reason of dataframe's this action.
2.How can i remove the third column at the same time keep the first column undeleted?
Here's a way using indexes:
index_to_drop = 2
# get indexes to keep
col_idxs = [en for en, _ in enumerate(df.columns) if en != index_to_drop]
# subset the df
df = df.iloc[:,col_idxs]
A B C
0 14 1 5
1 4 2 6
2 5 3 7
3 4 4 8

Create a new column with the minimum of other columns on same row

I have the following DataFrame
Input:
A B C D E
2 3 4 5 6
1 1 2 3 2
2 3 4 5 6
I want to add a new column that has the minimum of A, B and C for that row
Output:
A B C D E Goal
2 3 4 5 6 2
1 1 2 3 2 1
2 3 4 5 6 2
I have tried to use
df = df[['A','B','C]].min()
but I get errors about hashing lists and also I think this will be the min of the whole column I only want the min of the row for those specific columns.
How can I best accomplish this?
Use min along the columns with axis=1
Inline solution that produces copy that doesn't alter the original
df.assign(Goal=lambda d: d[['A', 'B', 'C']].min(1))
A B C D E Goal
0 2 3 4 5 6 2
1 1 1 2 3 2 1
2 2 3 4 5 6 2
Same answer put different
Add column to existing dataframe
new = df[['A', 'B', 'C']].min(axis=1)
df['Goal'] = new
df
A B C D E Goal
0 2 3 4 5 6 2
1 1 1 2 3 2 1
2 2 3 4 5 6 2
Add axis = 1 to your min
df['Goal'] = df[['A','B','C']].min(axis = 1)
you have to define an axis across which you are applying the min function, which would be 1 (columns).
df['ABC_row_min'] = df[['A', 'B', 'C']].min(axis = 1)

Pandas: How to extra only latest date in pivot table dataframe

How do I create a new dataframe which only include as index the latest date of the column 'txn_date' for each 'day' based on the pivot table in the picture?
Thank you
d1 = pd.to_datetime(['2016-06-25'] *2 + ['2016-06-28']*4)
df = pd.DataFrame({'txn_date':pd.date_range('2012-03-05 10:20:03', periods=6),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'day':d1}).set_index(['day','txn_date'])
print (df)
B C D E
day txn_date
2016-06-25 2012-03-05 10:20:03 4 7 1 5
2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-07 10:20:03 4 9 5 6
2012-03-08 10:20:03 5 4 7 9
2012-03-09 10:20:03 5 2 1 2
2012-03-10 10:20:03 4 3 0 4
1.
I think you need first sort_index if necessary first, then groupby by level day and aggregate last:
df1 = df.sort_index().reset_index(level=1).groupby(level='day').last()
print (df1)
txn_date B C D E
day
2016-06-25 2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-10 10:20:03 4 3 0 4
2.
Filter by boolean indexing with duplicated:
#if necessary
df = df.sort_index()
df2 = df[~df.index.get_level_values('day').duplicated(keep='last')]
print(df2)
B C D E
day txn_date
2016-06-25 2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-10 10:20:03 4 3 0 4

Pandas Conditionally Combine (and sum) Rows

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':[1,1,2,1,1,1],
'C':[2,4,6,3,5,7]})
df
A B C
0 A 1 2
1 A 1 4
2 A 2 6
3 B 1 3
4 B 1 5
5 B 1 7
Wherever there are duplicate rows per columns 'A' and 'B', I'd like to combine those rows and sum the value under column 'C' like this:
A B C
0 A 1 6
2 A 2 6
3 B 1 15
So far, I can at least identify the duplicates like this:
df['Dup']=df.duplicated(['A','B'],keep=False)
Thanks in advance!
use groupby() and sum():
In [94]: df.groupby(['A','B']).sum().reset_index()
Out[94]:
A B C
0 A 1 6
1 A 2 6
2 B 1 15

Resources