Pandas: Getting new dataframe from existing dataframe from list of substring present in column name - python-3.x

Hello I have dataframe called df and list of substring present in dataframe main problem i am facing is some of the substrings are not present in dataframe.
ls = ["SRR123", "SRR154", "SRR655", "SRR224","SRR661"]
data = {'SRR123_em1': [1,2,3], 'SRR123_em2': [4,5,6], 'SRR661_em1': [7,8,9], 'SRR661_em2': [6,7,8],'SRR453_em2': [10,11,12]}
df = pd.DataFrame(data)
Output:
SRR123_em1 SRR123_em2 SRR661_em1 SRR661_em2
1 4 7 6
2 5 8 7
3 6 9 8
please any one suggest me how can obtaine my output

Do filter with str.contains
sub_df=df.loc[:,df.columns.str.contains('|'.join(ls))].copy()
Out[295]:
SRR123_em1 SRR123_em2 SRR661_em1 SRR661_em2
0 1 4 7 6
1 2 5 8 7
2 3 6 9 8

Related

Stack row under row from two different dataframe using python? [duplicate]

df1 = pd.DataFrame({'a':[1,2,3],'x':[4,5,6],'y':[7,8,9]})
df2 = pd.DataFrame({'b':[10,11,12],'x':[13,14,15],'y':[16,17,18]})
I'm trying to merge the two data frames using the keys from the df1. I think I should use pd.merge for this, but I how can I tell pandas to place the values in the b column of df2 in the a column of df1. This is the output I'm trying to achieve:
a x y
0 1 4 7
1 2 5 8
2 3 6 9
3 10 13 16
4 11 14 17
5 12 15 18
Just use concat and rename the column for df2 so it aligns:
In [92]:
pd.concat([df1,df2.rename(columns={'b':'a'})], ignore_index=True)
Out[92]:
a x y
0 1 4 7
1 2 5 8
2 3 6 9
3 10 13 16
4 11 14 17
5 12 15 18
similarly you can use merge but you'd need to rename the column as above:
In [103]:
df1.merge(df2.rename(columns={'b':'a'}),how='outer')
Out[103]:
a x y
0 1 4 7
1 2 5 8
2 3 6 9
3 10 13 16
4 11 14 17
5 12 15 18
Use numpy to concatenate the dataframes, so you don't have to rename all of the columns (or explicitly ignore indexes). np.concatenate also works on an arbitrary number of dataframes.
df = pd.DataFrame( np.concatenate( (df1.values, df2.values), axis=0 ) )
df.columns = [ 'a', 'x', 'y' ]
df
You can rename columns and then use functions append or concat:
df2.columns = df1.columns
df1.append(df2, ignore_index=True)
# pd.concat([df1, df2], ignore_index=True)
You can also concatenate both dataframes with vstack from numpy and convert the resulting ndarray to dataframe:
pd.DataFrame(np.vstack([df1, df2]), columns=df1.columns)

Compare two dataframes and export unmatched data using pandas or other packages?

I have two dataframes and one is a subset of another one (picture below). I am not sure whether pandas can compare two dataframes and filter the data which is not in the subset and export it as a dataframe. Or is there any package doing this kind of task?
The subset dataframe was generated from RandomUnderSampler but the RandomUnderSampler did not have function which exports the unselected data. Any comments are welcome.
Use drop_duplicates with keep=False parameter:
Example:
>>> df1
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
>>> df2
A B
0 0 1
1 2 3
2 6 7
>>> pd.concat([df1, df2]).drop_duplicates(keep=False)
A B
2 4 5
4 8 9

Append Dataframes of different dimensions

I have multiple dataframes with a different number of rows and columns respectively.
example:
df1:
a b c d
0 1 5 6
8 9 8 7
and df2:
g h
9 8
4 5
6 7
I have to append both the dataframes without a change in their dimensions.
The desired output should be one dataframe Result_df as:
a b c d
0 1 5 6
8 9 8 7
g h
9 8
4 5
6 7
Can anyone please help me to append dataframes without change in their structure.
Thank you

Replace missing dataframe with values from a reference dataframe in Python

This is regarding a project using pandas in Python 3.7
I have a reference Dataframe df1
code name
0 1 A
2 2 B
3 3 C
4 4 D
And I have another bigger data frame df2 with missing values
code name
0 3 C
1 2
2 1 A
3 4
4 3
5 1 B
6 4
7 2
8 3 C
9 2
As you see here df2 has missing values.
How can I fill these values from the reference dataframe df1 using
I used the following:
'''
df2 = df2.merge(df1,on='code',how='left')
'''

How to remove the repeated row spaning two dataframe index in python

I have a dataframe as follow:
import pandas as pd
d = {'location1': [1, 2,3,8,6], 'location2':
[2,1,4,6,8]}
df = pd.DataFrame(data=d)
The dataframe df means there is a road between two locations. look like:
location1 location2
0 1 2
1 2 1
2 3 4
3 8 6
4 6 8
The first row means there is a road between locationID1 and locationID2, however, the second row also encodes this information. The forth and fifth rows also have repeated information. I am trying the remove those repeated by keeping only one row. Any of row is okay.
For example, my expected output is
location1 location2
0 1 2
2 3 4
4 6 8
Any efficient way to do that because I have a large dataframe with lots of repeated rows.
Thanks a lot,
It looks like you want every other row in your dataframe. This should work.
import pandas as pd
d = {'location1': [1, 2,3,8,6], 'location2':
[2,1,4,6,8]}
df = pd.DataFrame(data=d)
print(df)
location1 location2
0 1 2
1 2 1
2 3 4
3 8 6
4 6 8
def Every_other_row(a):
return a[::2]
Every_other_row(df)
location1 location2
0 1 2
2 3 4
4 6 8

Resources