I have two pandas df x and y, both with the same 3 columns A B C (not nullable). I need to create a new df z, obtained by "subtracting from x the rows which are entirely identical to the rows of y", i.e. a
x left join y on x.A=y.A and x.B=y.B and x.C=y.C
where y.A is null
How would I do that? Got stuck with indexes, concat, merge, join, ...
Example:
dataframe x
A B C
q1 q2 q3
q4 q2 q3
q7 q2 q9
dataframe y
A B C
q4 q2 q3
dataframe z
A B C
q1 q2 q3
q7 q2 q9
I think need merge with indicator and filter only rows from left DataFrame:
df = x.merge(y, indicator='i', how='outer').query('i == "left_only"').drop('i', axis=1)
print (df)
A B C
0 q1 q2 q3
2 q7 q2 q93
In earlier versions of pandas, it may be necessary to replace .drop('i', axis=1) with .drop('i',1). The former is necessary to avoid warnings in later versions of Pandas.
Here are a few other ways to remove certain lines from a dataframe using another dataframe:
pd.concat([dfx,dfy]).drop_duplicates(keep=False)
or
dfx.loc[[i not in dfy.to_records(index = False) for i in dfx.to_records(index = False)]]
or
dfx.loc[~dfx.apply(tuple,axis=1).isin(dfy.to_records(index = False))]
or
pd.MultiIndex.from_frame(dfx).symmetric_difference(pd.MultiIndex.from_frame(dfy)).to_frame().reset_index(drop=True)
pd.DataFrame(set(dfx.apply(tuple,axis=1)).symmetric_difference(dfy.apply(tuple,axis=1)))
Related
CHR
SNP
BP
A1
A2
OR
P
8
rs62513865
101592213
T
C
1.00652
0.8086
8
rs79643588
106973048
A
T
1.01786
0.4606
I have this table example, and I want to filter rows by comparing column A1 with A2.
If this four conditions happen, delete the line
A1
A2
A
T
T
A
C
G
G
C
(e.g. line 2 in the first table).
How can i do that using python Pandas ?
here is one way to do it
Combine the two columns for each of the two DF. Make it a list in case of the second DF and search the first combination in the second one
df[~(df['A1']+df['A2']).str.strip()
.isin(df2['A1']+df2['A2'].tolist())]
CHR SNP BP A1 A2 OR P
0 8 rs62513865 101592213 T C 1.00652 0.8086
keeping
Assuming df1 and df2, you can simply merge to keep the common values:
out = df1.merge(df2)
output:
CHR SNP BP A1 A2 OR P
0 8 rs79643588 106973048 A T 1.01786 0.4606
dropping
For removing the rows, perform a negative merge:
out = (df1.merge(df2, how='outer', indicator=True)
.loc[lambda d: d.pop('_merge').eq('left_only')]
)
Or merge and get the remaining indices to drop (requires unique indices):
out = df1.drop(df1.reset_index().merge(df2)['index'])
output:
CHR SNP BP A1 A2 OR P
0 8.0 rs62513865 101592213.0 T C 1.00652 0.8086
alternative approach
As it seems you have nucleotides and want to drop the cases that do not match a A/T or G/C pair, you could translate A to T and C to G in A1 and check that the value is not identical to that of A2:
m = df1['A1'].map({'A': 'T', 'C': 'G'}).fillna(df1['A1']).ne(df1['A2'])
out = df1[m]
I have a pandas dataframe with about 100 million rows. I am interested in deduplicating it but have some criteria that I haven't been able to find documentation for.
I would like to deduplicate the dataframe, ignoring one column that will differ. If that row is a duplicate, except for that column, I would like to only keep the row that has a specific string, say X.
Sample dataframe:
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
Desired output:
>>> df_dedup
A B C
0 1 2 00X
1 1 3 010
So, alternatively stated, the row index 2 would be removed because row index 0 has the information in columns A and B, and X in column C
As this data is slightly large, I hope to avoid iterating over rows, if possible. Ignore Index is the closest thing I've found to the built-in drop_duplicates().
If there is no X in column C then the row should require that C is identical to be deduplicated.
In the case in which there are matching A and B in a row, but have multiple versions of having an X in C, the following would be expected.
df = pd.DataFrame(columns=["A","B","C"],
data = [[1,2,"0X0"],
[1,2,"X00"],
[1,2,"0X0"]])
Output should be:
>>> df_dedup
A B C
0 1 2 0X0
1 1 2 X00
Use DataFrame.duplicated on columns A and B to create a boolean mask m1 corresponding to condition where values in column A and B are not duplicated, then use Series.str.contains + Series.duplicated on column C to create a boolean mask corresponding to condition where C contains string X and C is not duplicated. Finally using these masks filter the rows in df.
m1 = ~df[['A', 'B']].duplicated()
m2 = df['C'].str.contains('X') & ~df['C'].duplicated()
df = df[m1 | m2]
Result:
#1
A B C
0 1 2 00X
1 1 3 010
#2
A B C
0 1 2 0X0
1 1 2 X00
Does the column "C" always have X as the last character of each value? You could try creating a column D with 1 if column C has an X or 0 if it does not. Then just sort the values using sort_values and finally use drop_duplicates with keep='last'
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
df['D'] = 0
df.loc[df['C'].str[-1] == 'X', 'D'] = 1
df.sort_values(by=['D'], inplace=True)
df.drop_duplicates(subset=['A', 'B'], keep='last', inplace=True)
This is assuming you also want to drop duplicates in case there is no X in the 'C' column among the duplicates of columns A and B
Here is another approach. I left 'count' (a helper column) in for transparency.
# use df as defined above
# count the A,B pairs
df['count'] = df.groupby(['A', 'B']).transform('count').squeeze()
m1 = (df['count'] == 1)
m2 = (df['count'] > 1) & df['C'].str.contains('X') # could be .endswith('X')
print(df.loc[m1 | m2]) # apply masks m1, m2
A B C count
0 1 2 00X 2
1 1 3 010 1
I made a pandas df from parts of 2 others:
Here is the pseudocode for what I want to do.
4-column pandas dataframe, values in all columns are single words cols A B C D
and I want this: cols A B C D E F
in pseudcode (for every s in A; if s == any string (not substring) in D; write Yes to E (new column) else write No to E; if str in B (same row as s) == str in C (same row as string found in D) write yes to F (new column): else write No to F)
The following code works but now I need a function to do what is described above:
I'm not allowd to paste images of sample data and expected outcome.
cols = [1,2,3,5]
df3.drop(df3.columns[cols],axis=1, inplace=True)
df4.drop(df4.columns[1],axis=1, inplace=True)
listi = [df4]
listi.append(df3)
df5 = pd.concat(listi, axis = 1)
Hope this helps
I created a sample data frame
>>> df
A B C D
0 alpha spiderman theta superman
1 beta batman alpha spiderman
2 gamma superman epsilon hulk
Now add column E that shows if item in A is in C and add column F that shows if item in B is in D
>>> df['E'] = df.A.isin(df.C).replace({True: "Yes", False: "No"})
>>> df['F'] = df.B.isin(df.D).replace({True: "Yes", False: "No"})
>>> df
A B C D E F
0 alpha spiderman theta superman Yes Yes
1 beta batman alpha spiderman No No
2 gamma superman epsilon hulk No Yes
I have two Dataframe in pyspark:
d1: (x,y,value) and d2: (k,v, value). The entries in d1 are unique (you can consider the column x alone is a unique, and y alone as a key)
x y value
a b 0.2
c d 0.4
e f 0,8
d2 is the following format:
k v value
a c 0.7
k k 0.3
j h 0.8
e p 0.1
a b 0.1
I need to filter d2 accorning the co-occurence on d1. i.e., a , c 0.7 and e p 0.1 should be deleted as a can occur only with b and similarly for e.
I tried to select from d1 the x and y columns.
sourceList = df1.select("x").collect()
sourceList = [row.x for row in sourceList]
sourceList_b = sc.broadcast(sourceList)
then
check_id_isin = sf.udf(lambda x: x in sourceList , BooleanType())
d2 = d2.where(~d2.k.isin(sourceList_b.value))
for small datasets it works well but for large one, the collect cause an exception. I want to know if there is a better logic to compute this step.
One way could be to join d1 to d2, then fill the missing value in the column y from the column v using coalesce, then filter the row where y and v are different such as:
import pyspark.sql.functions as F
(d2.join( d1.select('x','y').withColumnRenamed('x','k'), #rename x to k for easier join
on=['k'], how='left') #join left to keep only d2 rows
.withColumn('y', F.coalesce('y', 'v')) #fill the value missing in y with the one from v
.filter((F.col('v') == F.col('y'))) #keep only where the value in v are equal to y
.drop('y').show()) #drop the column y not necessary
and you get:
+---+---+-----+
| k| v|value|
+---+---+-----+
| k| k| 0.3|
| j| h| 0.8|
+---+---+-----+
and should keep also any rows where both values in couple (x,y) are in (k,v)
So you have two problems here:
Logic for joining these two tables:
This can be done by performing an inner join on two columns instead of one. This is the code for that:
# Create an expression wherein you do an inner join on two cols
joinExpr = ((d1.x = d2.k) & (d1.y == d2.y))
joinDF = d1.join(d2, joinExpr)
The second problem is speed. There are multiple ways of fixing it. Here are my top two:
a. If one of the dataframes is significantly smaller (usually under 2 GB) than the other dataframe, then you can use the broadcast join. It essentially copies the smaller dataframe to all the workers so that there is no need to shuffle while joining. Here is an example:
from pyspark.sql.functions import broadcast
joinExpr = ((d1.x = d2.k) & (d1.y == d2.y))
joinDF = d1.join(broadcast(d2), joinExpr)
b. Try adding more workers and increasing the memory.
What you probably want todo is think of this in relational terms. Join d1 and d2 on d1.x = d2.k AND d1.y = d2.kv. An inner join will drop any records from D2 that don't have a corresponding pair in d1. By join a join spark will do a cluster wide shuffle of the data allowing for much greater parallelism and scalability compared to a broadcast exchange which general caps out at about ~10mb of data (which is what spark uses as the cut over point between a shuffle join and a broadcast join.
Also as in FYI WHERE (a,b) IS IN (...) gets translated into a join in most cases unless the (...) is a small set of data.
https://github.com/vaquarkhan/vaquarkhan/wiki/Apache-Spark--Shuffle-hash-join-vs--Broadcast-hash-join
I have one Pandas DF with three columns like below:
City1 City2 Totalamount
0 A B 1000
1 A C 2000
2 B A 1000
3 B C 500
4 C A 2000
5 C B 500
I want to delete the duplicated rows where (city1,city2) =(city2,city1). The result should be
City1 City2 Totalamount
0 A B 1000
1 A C 2000
2 B C 500
I tried
res=DFname.drop(DFname[(DFname.City1,DFname.City2) == (DFname.City2,DFname.City1)].index)
but its giving an error.
Could you please help
Thanks
You sort first, then drop the duplicates:
import numpy as np
cols = ['City1', 'City2']
df[cols] = np.sort(df[cols].values, axis=1)
df = df.drop_duplicates()
If the entire dataframe follows the pattern you show in your sample, where:
All rows are duplicated like (A, B) and (B, A)
There are no unpaired entries
CityA and CityB are always different (no instances of (A, A))
then you can simply do
df = df[df['City1'] < df['City2']]
If the sample is not representative of your whole dataframe, please include a sample that is.