Pandas Filter rows by comparing columns A1 with A2 - python-3.x

CHR
SNP
BP
A1
A2
OR
P
8
rs62513865
101592213
T
C
1.00652
0.8086
8
rs79643588
106973048
A
T
1.01786
0.4606
I have this table example, and I want to filter rows by comparing column A1 with A2.
If this four conditions happen, delete the line
A1
A2
A
T
T
A
C
G
G
C
(e.g. line 2 in the first table).
How can i do that using python Pandas ?

here is one way to do it
Combine the two columns for each of the two DF. Make it a list in case of the second DF and search the first combination in the second one
df[~(df['A1']+df['A2']).str.strip()
.isin(df2['A1']+df2['A2'].tolist())]
CHR SNP BP A1 A2 OR P
0 8 rs62513865 101592213 T C 1.00652 0.8086

keeping
Assuming df1 and df2, you can simply merge to keep the common values:
out = df1.merge(df2)
output:
CHR SNP BP A1 A2 OR P
0 8 rs79643588 106973048 A T 1.01786 0.4606
dropping
For removing the rows, perform a negative merge:
out = (df1.merge(df2, how='outer', indicator=True)
.loc[lambda d: d.pop('_merge').eq('left_only')]
)
Or merge and get the remaining indices to drop (requires unique indices):
out = df1.drop(df1.reset_index().merge(df2)['index'])
output:
CHR SNP BP A1 A2 OR P
0 8.0 rs62513865 101592213.0 T C 1.00652 0.8086
alternative approach
As it seems you have nucleotides and want to drop the cases that do not match a A/T or G/C pair, you could translate A to T and C to G in A1 and check that the value is not identical to that of A2:
m = df1['A1'].map({'A': 'T', 'C': 'G'}).fillna(df1['A1']).ne(df1['A2'])
out = df1[m]

Related

Pandas dataframe deduplicate rows with column logic

I have a pandas dataframe with about 100 million rows. I am interested in deduplicating it but have some criteria that I haven't been able to find documentation for.
I would like to deduplicate the dataframe, ignoring one column that will differ. If that row is a duplicate, except for that column, I would like to only keep the row that has a specific string, say X.
Sample dataframe:
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
Desired output:
>>> df_dedup
A B C
0 1 2 00X
1 1 3 010
So, alternatively stated, the row index 2 would be removed because row index 0 has the information in columns A and B, and X in column C
As this data is slightly large, I hope to avoid iterating over rows, if possible. Ignore Index is the closest thing I've found to the built-in drop_duplicates().
If there is no X in column C then the row should require that C is identical to be deduplicated.
In the case in which there are matching A and B in a row, but have multiple versions of having an X in C, the following would be expected.
df = pd.DataFrame(columns=["A","B","C"],
data = [[1,2,"0X0"],
[1,2,"X00"],
[1,2,"0X0"]])
Output should be:
>>> df_dedup
A B C
0 1 2 0X0
1 1 2 X00
Use DataFrame.duplicated on columns A and B to create a boolean mask m1 corresponding to condition where values in column A and B are not duplicated, then use Series.str.contains + Series.duplicated on column C to create a boolean mask corresponding to condition where C contains string X and C is not duplicated. Finally using these masks filter the rows in df.
m1 = ~df[['A', 'B']].duplicated()
m2 = df['C'].str.contains('X') & ~df['C'].duplicated()
df = df[m1 | m2]
Result:
#1
A B C
0 1 2 00X
1 1 3 010
#2
A B C
0 1 2 0X0
1 1 2 X00
Does the column "C" always have X as the last character of each value? You could try creating a column D with 1 if column C has an X or 0 if it does not. Then just sort the values using sort_values and finally use drop_duplicates with keep='last'
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
df['D'] = 0
df.loc[df['C'].str[-1] == 'X', 'D'] = 1
df.sort_values(by=['D'], inplace=True)
df.drop_duplicates(subset=['A', 'B'], keep='last', inplace=True)
This is assuming you also want to drop duplicates in case there is no X in the 'C' column among the duplicates of columns A and B
Here is another approach. I left 'count' (a helper column) in for transparency.
# use df as defined above
# count the A,B pairs
df['count'] = df.groupby(['A', 'B']).transform('count').squeeze()
m1 = (df['count'] == 1)
m2 = (df['count'] > 1) & df['C'].str.contains('X') # could be .endswith('X')
print(df.loc[m1 | m2]) # apply masks m1, m2
A B C count
0 1 2 00X 2
1 1 3 010 1

Pandas filter, group-by and then transform

I have a pandas dataframe, which looks like the following:
df =
a b
a1. 1
a2 0
a1 0
a3 1
a2 1
a1 1
I would like to first filter b on 1 and then, group by a and count number of times each group occurs (call this column count) and then attach this column with original df. b is guaranteed to be have at least one time 1 for each value of a.
Expected output:
df =
a b. count
a1. 1 2
a2 0. 1
a1 0. 2
a3 1 1
a2 1. 1
a1 1 2
I tried:
df['count] = df.groupby('a').b.transform('size')
But, this counts zeros as well. I want to filter for b == 1 first.
I also tried:
df['count'] = df[df['b' == 1].groupby('a').b.transform('size')
But, this introduces nans in the count column?
How can I do this in one line?
Check with get the condition apply to b then sum
df['b'].eq(1).groupby(df['a']).transform('sum')
Out[103]:
0 2.0
1 1.0
2 2.0
3 1.0
4 1.0
5 2.0
Name: b, dtype: float64

Sqlite join columns on mapping of values

I want to be able to join two tables, where there is a mapping between the column values, rather than their values matching.
So rather than:
A|m | f B|m | f
a1 1 b1 1
a2 2 b2 3
a3 3 b3 5
SELECT a1, a2, b1, b2
FROM A
INNER JOIN B on B.f = A.f
giving:
|m| A.f B.f |m|
a1 1 1 b1
a3 3 3 b2
Given then mapping (1->a)(2->b)(3->c)
A|m | f B|m | f
a1 1 b1 a
a2 2 b2 b
a3 3 b3 c
to give when joined on f:
|m| A.f B.f |m|
a1 1 a b1
a3 3 c b2
The question below seems to be trying something similar, but they seem to want to change the column values, I just want the mappng to be part of the query, I don't want to change the column values thenselves. Besides it is in R and I'm working in Python.
Mapping column values
One solution is to create a temporary table of mappings AB:
CREATE TEMP TABLE AB (a TEXT, b TEXT, PRIMARY KEY(a, b));
Then insert mappings,
INSERT INTO temp.AB VALUES (1, "a"), (2, "b"), (3, "c");
or executemany with params.
Then select using intermediary table.
SELECT A.m AS Am, A.f AS Af, B.f AS Bf, B.m AS Bm
FROM A
LEFT JOIN temp.AB ON A.f=AB.a
LEFT JOIN B ON B.f=AB.b;
If you don't want to create an intermediary table, another solution would be building the query yourself.
mappings = ((1,'a'), (3,'c'))
sql = 'SELECT A.m AS Am, A.f AS Af, B.f AS Bf, B.m AS Bm FROM A, B WHERE ' \
+ ' OR '.join(['(A.f=? AND B.f=?)'] * len(mappings))
c.execute(sql, [i for m in mappings for i in m])

Can I use SUMPRODUCT to accomplish this?

Need to sum a range based on if a value is in a column and one of a set of values is in another column, or vice versa.
e.g. I have The following table:
A B C D
M C C 1
F C C 2
S N C 3
S N N 4
M - C 5
N C C 6
M C N 7
If (Column A contains "M" or "S") AND ((Column B contains "C" AND Column C Contains "C" Or "N" Or "-") OR (Column C contains "C" AND Column B Contains "C" Or "N" Or "-")) Then Sum column D
So from my table my results would be
1 + 3 + 5 + 7 = 16
You can use SUMPRODUCT like this:
=SUMPRODUCT(ISNUMBER(MATCH(A2:A10,{"M","S"},0)*MATCH(B2:B10&"^"&C2:C10,{"C^C","C^N","C^-","N^C","-^C"},0))+0,D2:D10)
MATCH is used to check for both valid possibilities in column A and then all 5 possibilities for concatenated columns B and C - if those conditions are met then column D will be summed. Extend column ranges as required but preferably don't use whole columns
.....or shorter with SUMIFS like this:
=SUM(SUMIFS(D:D,A:A,{"M";"S"},B:B,{"C","C","C","N","-"},C:C,{"C","N","-","C","C"}))
For that version you can use whole columns with no loss of efficiency.
Note that in this version all the separators in the array constants are commas EXCEPT for the semi-colon in {"M";"S"} which needs to be that way
I would add a fifth column with a condition for the current line returning the value in D if all conditions are true or 0 otherwise.
=Iif(AND(Or($A1 = "M", $A1 = "S"),OR(AND($B1 = "C",Or($C1 = "C",$C1 = "N",$C1 = "-")), AND($C1 = "C",OR($B1 = "C",$B1 = "N",$B1 = "-")))),$D1,0)
Then in a cell somewhere write =sum($E:$E). With your example, I get 16, as intended.

Excel Formula comparing two columns

Below is a sample of the data I have. I want to match the data in Column A and B. If column B is not matching column A, I want to add a row and copy the data from Column A to B. For example, "4" is missing in column B, so I want to add a space and add "4" to column B so it will match column A. I have a large set of data, so I am trying to find a different way instead of checking for duplicate values in the two columns and manually adding one row at a time. Thanks!
A B C D
3 3 Y B
4 5 G B
5 6 B G
6 8 P G
7 9 Y P
8 11 G Y
9 12 B Y
10
11
12
11
12
I would move col B,C,D to a separate columns, say E,F,G, then using index matches against col A and col B identify which records are missing.
For col C: =IFERROR(INDEX(F:F,Match(A1,E:E,0)),"N/A")
For col D: =IFERROR(INDEX(G:G,Match(A1,E:E,0)),"N/A")
Following this you can filter for C="N/A" to identify cases where a B value is missing for an A value, and manually edit. Since you want A & B to be matching here col B is unnecessary, final result w/ removing col B and C->B, D->C:
A B C
3 Y B
4 N/A N/A
5 G B
6 B G
7 N/A N/A
Hope this helps!

Resources