Filter pandas group with if else condition - python-3.x

I have a pandas dataframe like this:
ID
Tier1
Tier2
1111
RF
B
1111
OK
B
2222
RF
B
2222
RF
E
3333
OK
B
3333
LO
B
I need to cut down the table so the IDs are unique, but do so with the following hierarchy: RF>OK>LO for Tier1. Then B>E for Tier2.
So the expected output will be:
ID
Tier1
Tier2
1111
RF
B
2222
RF
B
2222
RF
E
3333
OK
B
then:
ID
Tier1
Tier2
1111
RF
B
2222
RF
B
3333
OK
B
I am struggling to figure out how to this. My initial attempt is to group the table with grouped = df.groupby('ID') and then:
grouped = df.groupby('ID')
for key, group in grouped:
check_rf = group['Tier1']=='RF'
check_ok = group['Tier1']=='OK'
if check_rf.any():
group = group[group['Tier1']=='RF']
elif check_ok.any():
#and so on
I think this is working to filter each group, but I have no idea how the groups can then relate back to the parent table (df). And I am sure there is a better way to do this.
Thanks!

Let's use pd.Categorical & drop_duplicates
df['Tier1'] = pd.Categorical(df['Tier1'],['RF','OK','LO'],ordered=True)
df['Tier2'] = pd.Categorical(df['Tier2'],['B','E'],ordered=True)
df1 = df.sort_values(['Tier1','Tier2']).drop_duplicates(subset=['ID'],keep='first')
print(df1)
ID Tier1 Tier2
0 1111 RF B
2 2222 RF B
4 3333 OK B
Looking at Tier1 you can see the ordering.
print(df['Tier1'])
0 RF
1 OK
2 RF
3 RF
4 OK
5 LO
Name: Tier1, dtype: category
Categories (3, object): ['RF' < 'OK' < 'LO']

You can use two groupby+agg Pandas calls. Since the ordering RF>OK>LO and B>E are respectively compliant the (reverse) lexicographic ordering, you can use the trivial min/max functions for the aggregation (otherwise you can write your own custom min-max functions).
Here is how to do that (using a 2-pass filtering):
tmp = df.groupby(['ID', 'Tier2']).agg(max).reset_index() # Step 1
output = tmp.groupby(['ID', 'Tier1']).agg(min).reset_index() # Step 2
Here is the result in output:
ID Tier1 Tier2
0 1111 RF B
1 2222 RF B
2 3333 OK B

Related

iteratively merging varying number of rows

earlier discussion with help of #Joe Ferndz here:
merging varying number of rows and columns by multiple conditions in python
how the dataset looks like
connector type q_text a_text var1
1 1111 1 aaaa None xxxx
2 9999 2 None tttt jjjj
3 1111 2 None uuuu None
4 9999 1 bbbb None yyyy
5 9999 1 cccc None zzzz
Logic merge every row with type = 1 to its corresponding (same value in connector) type = 2. Code that does this:
df.loc[df['type'] == 2, 'var1.1'] = df['var1']
my_cols = ['q_text','a_text','var1']
df[my_cols] = df.sort_values(['connector','type']).groupby('connector')[my_cols].transform(lambda x: x.bfill())
df.dropna(subset=['q_text'], inplace=True)
df.reset_index(drop=True,inplace=True)
how the dataset then looks like
connector q_text a_text var1 var1.1
1 1111 aaaa uuuu xxxx None
2 9999 bbbb tttt yyyy jjjj
3 9999 cccc None zzzz zzzz
Problem with multiple rows having type = 1 but only one row having type = 2 (same connector value). therefore i need to merge type = 2 row eventually multiple times.
Question Why does it merge only one row?
how the dataset should look like (compare row 3 values and you will see what i mean)
connector q_text a_text var1 var1.1
1 1111 aaaa uuuu xxxx None
2 9999 bbbb tttt yyyy jjjj
3 9999 cccc tttt zzzz jjjj
a_text follows left-join logic, values can be overridden without adding an extra column. Contrary, var1 values are non-exclusionary with regard to the rows connector value, that is why i want to have extra column (var1.1) for those values (jjjj). There are rows with a unique connector value that will never be merged, but I want to keep those.
You want to merge rows with type = 1 to rows having type = 2, but in the code/logic you showed doesn't involve use of pandas.merge method, which will actually do what you desire.
First segregate the rows with type = 1 and type = 2 into 2 different dataframes df1 and df2. Then simply merge these 2 dataframes on connector values. It will automatically map multiple rows having type = 1 in df1 with only one row having type = 2 in df2 (with same connector values). Also since you want to keep rows with a unique connector value that will never be merged, use how='outer' param to perform an outer merge and keep all values.
After merge, select what all columns you finally want and rename them accordingly:
df1 = df.loc[df.type == 1].copy()
df2 = df.loc[df.type == 2].copy()
merged_df = pd.merge(df1, df2, on='connector', how='outer')
merged_df = merged_df.loc[:,['connector','q_text_x','a_text_y','var1_x','var1_y']]
merged_df.rename(columns={'q_text_x':'q_text','a_text_y':'a_text','var1_x':'var1','var1_y':'var1.1'}, inplace=True)
>>> merged_df
connector q_text a_text var1 var1.1
0 1111 aaaa uuuu xxxx None
1 9999 bbbb tttt yyyy jjjj
2 9999 cccc tttt zzzz jjjj

If a value in a column has multiple values in another column, how to filter based on priority in pandas

If I have a data frame like this:
id descrip
0 0000 x
1 0000 y
2 0000 z
3 1111 x
4 1111 z
5 2222 z
6 3333 x
7 3333 y
And I want to basically keep rows based on a priority of the descrip column, where if there is a z, then that is preferred over a y, which is preferred over an x.
So I basically want this:
id descrip
0 0000 z
1 1111 z
2 2222 z
3 3333 y
Not sure how I would approach this
df.groupby('id')['descrip'].max().reset_index()
id descrip
0 0 z
1 1111 z
2 2222 z
3 3333 y
Its always good to keep a track of what is exactly preferred over what.
Lets say the ordering was different ie: y<z<x where x is the most prefered. Then we could do:
df['descrip'] = df.descrip.astype('category').cat.reorder_categories(['y', 'z', 'x']).\
cat.as_ordered()
df.groupby('id')['descrip'].max().reset_index()
id descrip
0 0 x
1 1111 x
2 2222 z
3 3333 x

merging varying number of rows and columns by multiple conditions in python

updated Problem: Why does it not merge a_date, a_par, a_cons, a_ment and a_le. These are appended as columns without values but in the original dataset they have values.
Here is how the dataset looks like
connector type q_text a_text var1 var2
1 1111 1 aa None xx ps
2 9999 2 None tt jjjj pppp
3 1111 2 None uu None oo
4 9999 1 bb None yy Rt
5 9999 1 cc None zz tR
Goal: how the dataset should look like
connector q_text a_text var1 var1.1 var2 var2.1
1 1111 aa uu xx None ps oo
2 9999 bb tt yy jjjj Rt pppp
3 9999 cc tt zz jjjj tR pppp
Logic: Column type has either value 1 or 2 with multiple rows having value 1 but only one row (with the same value in connector) has value 2
Here are the main merging rules:
Merge every row of type=1 with its corresponding (connector) type=2 row.
Since multiple rows of type=1 have the same connector value, I don't want to merge solely one row of type=1 but all of them, each with the sole type==2 row.
Since some columns (e.g. a_text) follow left-join logic, values can be overridden without adding an extra column.
Since var2 values cannot be merged by left-join because they are non-exclusionary with regard to the rows connector value, i want to have extra columns (var1.1, var2.1) for those values (pppp, jjjj).
In summary (and having in mind that i only speak of rows that have the same connector values): If q_text is None i first, want to replace the values in a_text with the a_text value (see above table tt and uu) of the corresponding row (same connector value) and secondly, want to append some other values (var1 and var2) of the very same corresponding row as new columns.
Also, there are rows with a unique connector value that is not going to be matched. I want to keep those rows though.
I only want to "drop" the type=2 rows that get merged with their corresponding type=1 row**(s)**. In other words: I dont want to keep the rows of type=2 that have a match and get merged into their corresponding (connector) type=1 rows. I want to keep all other rows though.
Solution by #victor__von__doom here
merging varying number of rows by multiple conditions in python
was answered when i originally wanted to keep all of the "type"=2 columns(values).
Code i used: merged Perso, q_text and a_text
df.loc[df['type'] == 2, 'a_date'] = df['q_date']
df.loc[df['type'] == 2, 'a_par'] = df['par']
df.loc[df['type'] == 2, 'a_cons'] = df['cons']
df.loc[df['type'] == 2, 'a_ment'] = df['pret']
df.loc[df['type'] == 2, 'a_le'] = df['q_le']
my_cols = ['Perso', 'q_text','a_text', 'a_le', 'q_le', 'q_date', 'par', 'cons', 'pret', 'q_le', 'a_date','a_par', 'a_cons', 'a_ment', 'a_le']
df[my_cols] = df.sort_values(['connector','type']).groupby('connector')[my_cols].transform(lambda x: x.bfill())
df.dropna(subset=['a_text', 'Perso'],inplace=True)
df.reset_index(drop=True,inplace=True)
Data: This is a representation of the core dataset. Unfortunately i cannot share the actual data due to privacy laws.
Perso
ID
per
q_le
a_le
pret
par
form
q_date
name
IO_ID
part
area
q_text
a_text
country
cons
dig
connector
type
J Ws
1-1/4/2001-11-12/1
1999-2009
None
4325
'Mi, h', 'd'
Cew
Thre
2001-11-12
None
345
rede
s — H
None
wr ede
Terd e
e r
2001-11-12.1.g9
999999999
2
S ts
9-3/6/2003-10-14/1
1994-2004
None
23
'sd, h'
d-g
Thre
2003-10-14
None
34555
The
l? I
None
Tre
Thr ede
re
2001-04-16.1.a9
333333333
2
On d
6-1/6/2005-09-03/1
1992-2006
None
434
'uu h'
d-g
Thre
2005-09-03
None
7313
Thde
l? I
None
T e
Th rede
dre
2001-08-07.1.e4
111111111
2
None
3-4/4/2000-07-07/1
1992-2006
1223
None
'uu h'
dfs
Thre
2000-07-07
Th r
7413
Thde
Tddde
Thd de
None
Thre de
2001-07-06.1.j3
111111111
1
None
2-1/6/2001-11-12/1
1999-2009
1444
None
'Mi, h', 'd'
d-g
Thre
2001-11-12
T rj
7431
Thde
l? I
Th dde
None
Thr ede
2001-11-12.1.s7
999999999
1
None
1-6/4/2007-11-01/1
1993-2010
2353
None
None
d-g
Thre
2007-11-01
Thrj
444
Thed
l. I
Tgg gg
None
Thre de
we e
2001-06-11.1.g9
654982984
1
EDIT v2 with additional columns
This version ensures the values in the additional columns are not impacted.
c = ['connector','type','q_text','a_text','var1','var2','cumsum','country','others']
d = [[1111, 1, 'aa', None, 'xx', 'ps', 0, 'US', 'other values'],
[9999, 2, None, 'tt', 'jjjj', 'pppp', 0, 'UK', 'no values'],
[1111, 2, None, 'uu', None, 'oo', 1, 'US', 'some values'],
[9999, 1, 'bb', None, 'yy', 'Rt', 1, 'UK', 'more values'],
[9999, 1, 'cc', None, 'zz', 'tR', 2, 'UK', 'less values']]
import pandas as pd
pd.set_option('display.max_columns', None)
df = pd.DataFrame(d,columns=c)
print (df)
df.loc[df['type'] == 2, 'var1.1'] = df['var1']
df.loc[df['type'] == 2, 'var2.1'] = df['var2']
my_cols = ['q_text','a_text','var1','var2','var1.1','var2.1']
df[my_cols] = df.sort_values(['connector','type']).groupby('connector')[my_cols].transform(lambda x: x.bfill())
df.dropna(subset=['q_text'],inplace=True)
df.reset_index(drop=True,inplace=True)
print (df)
Original DataFrame:
connector type q_text a_text var1 var2 cumsum country others
0 1111 1 aa None xx ps 0 US other values
1 9999 2 None tt jjjj pppp 0 UK no values
2 1111 2 None uu None oo 1 US some values
3 9999 1 bb None yy Rt 1 UK more values
4 9999 1 cc None zz tR 2 UK less values
Updated DataFrame
connector type q_text a_text var1 var2 cumsum country others var1.1 var2.1
0 1111 1 aa uu xx ps 0 US other values None oo
1 9999 1 bb tt yy Rt 1 UK more values jjjj pppp
2 9999 1 cc tt zz tR 2 UK less values jjjj pppp

Unable to understand DataFrame method "loc" logic , If we use incorrect names of labels

I am using the method loc for extracting the columns with the use of labels. I encountered an issue while using incorrect names of labels resulting in some output as follows. PLease help me to understand the logic behind the loc method in terms of labels use.
import pandas as pd
Dic={'empno':(101,102,103,104),'name':('a','b','c','d'),'salary':(3000,5000,8000,9000)}
df=pd.DataFrame(Dic)
print(df)
print()
print(df.loc[0:2,'empsfgsdzfsdfsdaf':'salary'])
print(df.loc[0:2,'empno':'salarysadfsa'])
print(df.loc[0:2,'name':'asdfsdafsdaf'])
print(df.loc[0:2,'sadfsadfsadf':'sasdfsdflasdfsdfsdry'])
print(df.loc[0:2,'':'nasdfsd'])
OUTPUT:
empno name salary
0 101 a 3000
1 102 b 5000
2 103 c 8000
3 104 d 9000
name salary
0 a 3000
1 b 5000
2 c 8000
empno name salary
0 101 a 3000
1 102 b 5000
2 103 c 8000
Empty DataFrame
Columns: []
Index: [0, 1, 2]
salary
0 3000
1 5000
2 8000
empno name
0 101 a
1 102 b
2 103 c
.loc[A : B, C : D] will select:
index (row) labels from (and including) A to (and including) B; and
column labels from (and including) C to (and including) D.
Let's look at the column label slice 'a' : 'salary'. Since a is before the first column label, we get empno, name, salary.
print(df.loc[0:2, 'a':'salary'])
empno name salary
0 101 a 3000
1 102 b 5000
2 103 c 8000
It works the same way at the upper end of the slice:
print(df.loc[0:2, 'name':'z'])
name salary
0 a 3000
1 b 5000
2 c 8000
Here is a list comprehension that shows how the second slice works:
# code
[col for col in df.columns if 'name' <= col <= 'z']
# result
['name', 'salary']
There is a good description for all most used subsetting methods here:
https://www.kdnuggets.com/2019/06/select-rows-columns-pandas.html

Is there a way to hide the same values in MultiIndex level 1?

I have the following dataframe (named test) in pandas:
Group 1 Group 2 Species Adj. P-value
0 a b Parabacteroides goldsteinii 7
1 a b Parabacteroides johnsonii 8
2 a b Parabacteroides merdae 9
3 a b Parabacteroides sp 10
4 c d Bacteroides coprocola 1
5 c d Bacteroides dorei 2
I would like to transform this table in latex format, but with the repeated values in Group 1 and Group 2 centred (see figure below for an example). In latex this is done with the package \multirow, and df.to_latex has a parameter called multirow to enable this (to_latex)
However, a MultiIndex has to be created in order to use the multirow option in to_latex.
So I did this:
test.index = pd.MultiIndex.from_frame(test[["Group 1","Group 2"]])
test = test.drop(["Group 1","Group 2"], axis=1)
test
Species Adj. P-value
Group 1 Group 2
a b Parabacteroides goldsteinii 7
b Parabacteroides johnsonii 8
b Parabacteroides merdae 9
b Parabacteroides sp 10
c d Bacteroides coprocola 1
d Bacteroides dorei 2
And finally I stored the table:
test.to_latex("la_tex_tab.txt",multirow=True, index=True,float_format="{:0.3f}".format).
However, this yields:
It works just for level 0 (Group 1) but not for level 1 (Group 2) of the MultiIndex. Do you have any suggestions about how to avoid the repetitions of the values b and d in the MultiIndex?
Thank you.
Kind of a hack if you want:
test['Group 2'] = test['Group 2'].mask(test['Group 2'].duplicated(),'')
test.set_index(["Group 1","Group 2"])
Species Adj. P-value
Group 1 Group 2
a b Parabacteroides goldsteinii 7
Parabacteroides johnsonii 8
Parabacteroides merdae 9
Parabacteroides sp 10
c d Bacteroides coprocola 1
Bacteroides dorei 2
We can do it for display only by use assign with blank column
test = test.assign(help='').set_index('help',append=True).drop(["Group 1","Group 2"], axis=1)

Resources