Remove duplicate within column depending on 2 lists in python - python-3.x

I have a dataframe such as :
Groups NAME LETTER
G1 Canis_lupus A
G1 Canis_lupus B
G1 Canis_lupus F
G1 Cattus_cattus C
G1 Cattus_cattus C
G2 Canis_lupus C
G2 Zebra_fish A
G2 Zebra_fish D
G2 Zebra-fish B
G2 Cattus_cattus D
G2 Cattus_cattus E
and the idea is that I would like within Groups to keep only two duplicated NAME where LETTER is within the list1=['A','B','C'] and list2=['D','E','F']
When there is for instance duplicate having A and B, I keep the A in the alphabet order
In the example I should then get :
Groups NAME LETTER
G1 Canis_lupus A
G1 Canis_lupus F
G1 Cattus_cattus C
G2 Canis_lupus C
G2 Zebra_fish A
G2 Zebra_fish D
G2 Cattus_cattus D
Here is tha dataframe
{'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G1', 4: 'G1', 5: 'G2', 6: 'G2', 7: 'G2', 8: 'G2', 9: 'G2', 10: 'G2'}, 'NAME': {0: 'Canis_lupus', 1: 'Canis_lupus', 2: 'Canis_lupus', 3: 'Cattus_cattus', 4: 'Cattus_cattus', 5: 'Canis_lupus', 6: 'Zebra_fish', 7: 'Zebra_fish', 8: 'Zebra-fish', 9: 'Cattus_cattus', 10: 'Cattus_cattus'}, 'LETTER': {0: 'A', 1: 'B', 2: 'F', 3: 'C', 4: 'C', 5: 'C', 6: 'A', 7: 'D', 8: 'B', 9: 'D', 10: 'E'}}

Idea is first sorting by LETTER if necessary and then filter by both lists with Series.isin, remove duplicates by DataFrame.drop_duplicates and last join together by concat:
#sorting per groups
df = df.sort_values(['Groups','LETTER'])
#sorting by one column
#df = df.sort_values('LETTER')
list1=['A','B','C']
list2=['D','E','F']
df1 = df[df['LETTER'].isin(list1)].drop_duplicates(['Groups','NAME'])
df2 = df[df['LETTER'].isin(list2)].drop_duplicates(['Groups','NAME'])
df = pd.concat([df1, df2]).sort_index(ignore_index=True)
print (df)
Groups NAME LETTER
0 G1 Canis_lupus A
1 G1 Canis_lupus F
2 G1 Cattus_cattus C
3 G2 Canis_lupus C
4 G2 Zebra_fish A
5 G2 Zebra_fish D
6 G2 Cattus_cattus D
Idea with mapping values to new column with merge dictionaries, similar like another solution, only also removed rows if no match both lists by DataFrame.dropna, last remove helper column and sorting:
d = {**dict.fromkeys(list1, 'a'),
**dict.fromkeys(list2, 'b')}
df = (df.assign(new = df.LETTER.map(d))
.dropna(subset=['new'])
.drop_duplicates(subset=['Groups', 'NAME', 'new'])
.sort_index(ignore_index=True)
.drop('new', 1)
)
print (df)
Groups NAME LETTER
0 G1 Canis_lupus A
1 G1 Canis_lupus F
2 G1 Cattus_cattus C
3 G2 Canis_lupus C
4 G2 Zebra_fish A
5 G2 Zebra_fish D
6 G2 Cattus_cattus D

Create a list_id column to identify which list a particular letter belongs to. Then just drop the duplicates using the subset parameter.
condlist = [df.LETTER.isin(list1),
df.LETTER.isin(list2)]
choicelist = [
'list1',
'list2'
]
df['list_id'] = np.select(condlist, choicelist)
df = df.sort_values('LETTER').drop_duplicates(
subset=['Groups', 'NAME', 'list_id']).drop('list_id', 1).sort_values(['Groups', 'NAME'])
OUTPUT:
Groups NAME LETTER
0 G1 Canis_lupus A
2 G1 Canis_lupus F
3 G1 Cattus_cattus C
5 G2 Canis_lupus C
9 G2 Cattus_cattus D
6 G2 Zebra_fish A
7 G2 Zebra_fish D

Related

Add count of element for each groups within a list in pandas

I have a dataframe such as:
The_list=["A","B","D"]
Groups Values
G1 A
G1 B
G1 C
G1 D
G2 A
G2 B
G2 A
G3 A
G3 D
G4 Z
G4 D
G4 E
G4 C
And I would like to add for each Groups the number of Values element that are within The_list, and add this number within a New_column
Here I should then get;
Groups Values New_column
G1 A 3
G1 B 3
G1 C 3
G1 D 3
G2 A 2
G2 B 2
G2 A 2
G3 A 1
G3 D 1
G4 Z 0
G4 D 0
G4 E 0
G4 C 0
Thanks a lot for your help
Here is the table in dict format if it can helps:
{'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G1', 4: 'G2', 5: 'G2', 6: 'G2', 7: 'G3', 8: 'G3', 9: 'G4', 10: 'G4', 11: 'G4', 12: 'G4'}, 'Values': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'A', 5: 'B', 6: 'A', 7: 'A', 8: 'D', 9: 'Z', 10: 'D', 11: 'E', 12: 'C'}}
In your case do transform after isin check
df['new'] = df['Values'].isin(The_list).groupby(df['Groups']).transform('sum')
Out[37]:
0 3
1 3
2 3
3 3
4 3
5 3
6 3
7 2
8 2
9 1
10 1
11 1
12 1
Name: Values, dtype: int64

Check if dictionaries are equal in df

I have a df, in which a column contains dictionaries:
a b c d
0 a1 b1 c1 {0.0: 'a', 1.0: 'b'}
1 a2 b2 c2 NaN
2 a3 b3 c3 {0.0: 'cs', 1.0: 'ef', 2.0: 'efg'}
and another dict:
di = {0.0: 'a', 1.0: 'b'}
I want to add a new column with 'yes' in it when d=di, and 'no' or 'NaN' or just empty when it's not. I tried the below but it doesnt work:
df.loc[df['d'] == di, 'e'] = 'yes'
The result would be:
a b c d e
0 a1 b1 c1 {0.0: 'a', 1.0: 'b'} yes
1 a2 b2 c2 NaN
2 a3 b3 c3 {0.0: 'cs', 1.0: 'ef', 2.0: 'efg'}
Anyone being able to help me here?
Thank you in advance!
You can try
df['new'] = df['d'].eq(di).map({True:'yes',False:''})

How to subset a DataFrame based on similar column names

How to subset similar columns in pandas based on keywords like A B C D. Now I have taken this as example is there any better way if new columns were given and logic should work.
df
A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 D3 D4
1 a x 1 a x 3 c 7 d s 4
2 b 5 2 b 5 4 d s c 7 d
3 c 7 3 c 7 1 a x 1 a x
4 d s 4 d s b 5 2 b s 7
You can use pandas.Index.groupby
groups = df.columns.groupby(df.columns.str[0])
#{'A': ['A1', 'A2', 'A3'],
# 'B': ['B1', 'B2', 'B3'],
# 'C': ['C1', 'C2'],
# 'D': ['D1', 'D2', 'D3', 'D4']}
Then you can access data this way:
df[groups['B']]
# B1 B2 B3
#0 1 a x
#1 2 b 5
#2 3 c 7
#3 4 d s
Keep in mind groups is a dict, so you can use any dict method too.
Another approach can be to use df.columns in conjuction with str.contains
a_col_lst = df.columns[df.columns.str.contains('A')]
b_col_lst = df.columns[df.columns.str.contains('B')]
df_A = df.loc[a_col_lst]
df_B = df.loc[b_col_lst]
You can apply regex as well within columns.str.contains
You could use filter along with a regex pattern, e.g.
df_A = df.filter(regex=(r'^A.*'))
You could also use select along with startswith:
df_A = df.select(lambda col: col.startswith('A'), axis=1)

Assigning a single key to all similar products/rows in a data frame based on product description and one other key

Based on 3 keys/columns uniqueid , uniqueid2 and uniqueid3 I need to generate a column new_key that will tag all associated products/rows with a single key.
```python
df = pd.DataFrame({'uniqueid': {0: 'a', 1: 'b', 2: 'b', 3: 'c',
4: 'd', 5: 'd', 6: 'e', 7: 'e',8:'g',9:'g',10:'h',11:'l',12:'m'},
'uniqueid2': {0: 'a', 1: 'b', 2: 'b', 3: 'c',
4: 'd', 5: 'd', 6: 'e', 7: 'e',8:'g',9:'g',10:'h',11:'l',12:'l'},
'uniqueid3': {0: 'z', 1: 'y', 2: 'x', 3: 'y',
4: 'x', 5: 'v', 6: 'x', 7: 'u',8:'h',9:'i',10:'k',11:'k',12:'n'}})
```
Data that I have based on columns uniqueid ,uniqueid2 and uniqueid3. I need to create new_key as already there. Here in this dummy data all the rows except first belong to a same product based on associations in column 1 and column2.
But I am unsure on how to proceed further. Quick help needed please
Expected Output:
[1]: https://i.stack.imgur.com/yAl56.png
This will give you the correct output, but I'm not sure this is exactly what you want to do in order to generate the new_key column. This solution checks uniqueid2 to see if all values are unique within each uniqueid group as well as the entire uniqueid2 column..
import pandas as pd
import numpy as np
df = pd.DataFrame({'uniqueid': {0: 'a', 1: 'b', 2: 'b', 3: 'c',
4: 'd', 5: 'd', 6: 'e', 7: 'e',8:'g',9:'g',10:'h',11:'l'},
'uniqueid2': {0: 'z', 1: 'y', 2: 'x', 3: 'y',
4: 'x', 5: 'v', 6: 'x', 7: 'u',8:'h',9:'i',10:'k',11:'k'}})
df['m1'] = (df.groupby('uniqueid2')['uniqueid2'].transform('count') == 1)
df['m2'] = (df.groupby('uniqueid')['m1'].transform(sum))
df['m3'] = (df.groupby('uniqueid')['uniqueid2'].transform('size'))
df['m4'] = (df.groupby('uniqueid')['uniqueid'].transform('count') == 1)
df['new_key'] = np.where((df['m2'] == df['m3']) | df['m4'], df['uniqueid'], 'b')
df
Out[13]:
uniqueid uniqueid2 m1 m2 m3 m4 new_key
0 a z True 1.0 1 True a
1 b y False 0.0 2 False b
2 b x False 0.0 2 False b
3 c y False 0.0 1 True c
4 d x False 1.0 2 False b
5 d v True 1.0 2 False b
6 e x False 1.0 2 False b
7 e u True 1.0 2 False b
8 g h True 2.0 2 False g
9 g i True 2.0 2 False g
10 h k False 0.0 1 True h
11 l k False 0.0 1 True l
I kept m1, m2 and m3, so that you could see the progression of the logic. You can drop these columns with:
df = df.drop(['m1','m2','m3'], axis=1)
This looks like a networkx problem, lets try:
import networkx as nx
G = nx.Graph()
#get first value of uniqueid based on uniqueid2
s = df.groupby('uniqueid2')['uniqueid'].transform('first')
#get connected components from unique id and the above variable s
G.add_edges_from(df[['uniqueid']].assign(k=s).to_numpy().tolist())
cc = list(nx.connected_components(G))
#[{'a'}, {'b', 'c', 'd', 'e'}, {'g'}, {'h', 'l'}]
idx = [dict.fromkeys(y,x) for x, y in enumerate(cc)]
d={k: v for d in idx for k, v in d.items()}
df['new_key'] = s.groupby(s.map(d)).transform('first')
print(df)
uniqueid uniqueid2 new_key
0 a z a
1 b y b
2 b x b
3 c y b
4 d x b
5 d v b
6 e x b
7 e u b
8 g h g
9 g i g
10 h k h
11 l k h

How to compare a string of one column of pandas with rest of the columns and if value is found in any column of the row append a new row?

I want to compare the Category column with all the predicted_site and if value matches with anyone column, append a column named rank and insert 1 if value is found or else insert 0
Use DataFrame.filter for predicted columns compared by DataFrame.eq with Category column, convert to integers, change columns names by DataFrame.add_prefix and last add new columns by DataFrame.join:
df = pd.DataFrame({
'category':list('abcabc'),
'B':[4,5,4,5,5,4],
'predicted1':list('adadbd'),
'predicted2':list('cbarac')
})
df1 = df.filter(like='predicted').eq(df['category'], axis=0).astype(int).add_prefix('new_')
df = df.join(df1)
print (df)
category B predicted1 predicted2 new_predicted1 new_predicted2
0 a 4 a c 1 0
1 b 5 d b 0 1
2 c 4 a a 0 0
3 a 5 d r 0 0
4 b 5 b a 1 0
5 c 4 d c 0 1
This solution is much less elegant than that proposed by #jezrael, however you can try it.
#sample dataframe
d = {'cat': ['comp-el', 'el', 'comp', 'comp-el', 'el', 'comp'], 'predicted1': ['com', 'al', 'p', 'col', 'el', 'comp'], 'predicted2': ['a', 'el', 'p', 'n', 's', 't']}
df = pd.DataFrame(data=d)
#iterating through rows
for i, row in df.iterrows():
#assigning values
cat = df.loc[i,'cat']
predicted1 = df.loc[i,'predicted1']
predicted2 = df.loc[i,'predicted2']
#condition
if (cat == predicted1 or cat == predicted2):
df.loc[i,'rank'] = 1
else:
df.loc[i,'rank'] = 0
output:
cat predicted1 predicted2 rank
0 comp-el com a 0.0
1 el al el 1.0
2 comp p p 0.0
3 comp-el col n 0.0
4 el el s 1.0
5 comp comp t 1.0

Resources