Compare two dataframes and export unmatched data using pandas or other packages? - python-3.x

I have two dataframes and one is a subset of another one (picture below). I am not sure whether pandas can compare two dataframes and filter the data which is not in the subset and export it as a dataframe. Or is there any package doing this kind of task?
The subset dataframe was generated from RandomUnderSampler but the RandomUnderSampler did not have function which exports the unselected data. Any comments are welcome.

Use drop_duplicates with keep=False parameter:
Example:
>>> df1
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
>>> df2
A B
0 0 1
1 2 3
2 6 7
>>> pd.concat([df1, df2]).drop_duplicates(keep=False)
A B
2 4 5
4 8 9

Related

Pandas: Getting new dataframe from existing dataframe from list of substring present in column name

Hello I have dataframe called df and list of substring present in dataframe main problem i am facing is some of the substrings are not present in dataframe.
ls = ["SRR123", "SRR154", "SRR655", "SRR224","SRR661"]
data = {'SRR123_em1': [1,2,3], 'SRR123_em2': [4,5,6], 'SRR661_em1': [7,8,9], 'SRR661_em2': [6,7,8],'SRR453_em2': [10,11,12]}
df = pd.DataFrame(data)
Output:
SRR123_em1 SRR123_em2 SRR661_em1 SRR661_em2
1 4 7 6
2 5 8 7
3 6 9 8
please any one suggest me how can obtaine my output
Do filter with str.contains
sub_df=df.loc[:,df.columns.str.contains('|'.join(ls))].copy()
Out[295]:
SRR123_em1 SRR123_em2 SRR661_em1 SRR661_em2
0 1 4 7 6
1 2 5 8 7
2 3 6 9 8

Pandas data frame concat return same data of first dataframe

I have this datafram
PNN_sh NN_shap PNN_corr NN_corr
1 25005 1 25005
2 25012 2 25001
3 25011 3 25009
4 25397 4 25445
5 25006 5 25205
Then I made 2 dataframs from this one.
NN_sh = data[['PNN_sh', 'NN_shap']]
NN_corr = data[['PNN_corr', 'NN_corr']]
Thereafter, I sorted them and saved in new dataframes.
NN_sh_sort = NN_sh.sort_values(by=['NN_shap'])
NN_corr_sort = NN_corr.sort_values(by=['NN_corr'])
Now I want to combine 2 columns from the 2 dataframs above.
all_pd = pd.concat([NN_sh_sort['PNN_sh'], NN_corr_sort['PNN_corr']], axis=1, join='inner')
But what I got is only the first column copied into second one also.
PNN_sh PNN_corr
1 1
5 5
3 3
2 2
4 4
The second column should be
PNN_corr
2
1
3
5
4
Any idea how to fix it? Thanks in advance
Put ignore_index=True to sort_values():
NN_sh_sort = NN_sh.sort_values(by=['NN_shap'], ignore_index=True)
NN_corr_sort = NN_corr.sort_values(by=['NN_corr'], ignore_index=True)
Then the result after concat will be:
PNN_sh PNN_corr
0 1 2
1 5 1
2 3 3
3 2 5
4 4 4
I think when you sort you are preserving the original indices of the example DataFrames. Therefore, it is joining the PNN_corr value that was originally in the same row (at same index). Try resetting the index of each DataFrame after sorting, then join/concat.
NN_sh_sort = NN_sh.sort_values(by=['NN_shap']).reset_index()
NN_corr_sort = NN_corr.sort_values(by=['NN_corr']).reset_index()
all_pd = pd.concat([NN_sh_sort['PNN_sh'], NN_corr_sort['PNN_corr']], axis=1, join='inner')

Replace missing dataframe with values from a reference dataframe in Python

This is regarding a project using pandas in Python 3.7
I have a reference Dataframe df1
code name
0 1 A
2 2 B
3 3 C
4 4 D
And I have another bigger data frame df2 with missing values
code name
0 3 C
1 2
2 1 A
3 4
4 3
5 1 B
6 4
7 2
8 3 C
9 2
As you see here df2 has missing values.
How can I fill these values from the reference dataframe df1 using
I used the following:
'''
df2 = df2.merge(df1,on='code',how='left')
'''

How to remove the repeated row spaning two dataframe index in python

I have a dataframe as follow:
import pandas as pd
d = {'location1': [1, 2,3,8,6], 'location2':
[2,1,4,6,8]}
df = pd.DataFrame(data=d)
The dataframe df means there is a road between two locations. look like:
location1 location2
0 1 2
1 2 1
2 3 4
3 8 6
4 6 8
The first row means there is a road between locationID1 and locationID2, however, the second row also encodes this information. The forth and fifth rows also have repeated information. I am trying the remove those repeated by keeping only one row. Any of row is okay.
For example, my expected output is
location1 location2
0 1 2
2 3 4
4 6 8
Any efficient way to do that because I have a large dataframe with lots of repeated rows.
Thanks a lot,
It looks like you want every other row in your dataframe. This should work.
import pandas as pd
d = {'location1': [1, 2,3,8,6], 'location2':
[2,1,4,6,8]}
df = pd.DataFrame(data=d)
print(df)
location1 location2
0 1 2
1 2 1
2 3 4
3 8 6
4 6 8
def Every_other_row(a):
return a[::2]
Every_other_row(df)
location1 location2
0 1 2
2 3 4
4 6 8

Creating a sub-index in pandas dataframe [duplicate]

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 1 year ago.
Okay this is tricky. I have a pandas dataframe and I am dealing with machine log data. I have an index in the data, but this dataframe has various jobs in it. I wanted to be able to give those individual jobs an index of their own, so that i could compare them with each other. So I want another column with an index beginning with zero, which goes till the end of the job and then resets to zero for the new job. Or do i do this line by line?
I think you need set_index with cumcount for count categories:
df = df.set_index(df.groupby('Job Columns').cumcount(), append=True)
Sample:
np.random.seed(456)
df = pd.DataFrame({'Jobs':np.random.choice(['a','b','c'], size=10)})
#solution with sorting
df1 = df.sort_values('Jobs').reset_index(drop=True)
df1 = df1.set_index(df1.groupby('Jobs').cumcount(), append=True)
print (df1)
Jobs
0 0 a
1 1 a
2 2 a
3 0 b
4 1 b
5 2 b
6 3 b
7 0 c
8 1 c
9 2 c
#solution with no sorting
df2 = df.set_index(df.groupby('Jobs').cumcount(), append=True)
print (df2)
Jobs
0 0 b
1 1 b
2 0 c
3 0 a
4 1 c
5 2 c
6 1 a
7 2 b
8 2 a
9 3 b

Resources