If a value in a column has multiple values in another column, how to filter based on priority in pandas - python-3.x

If I have a data frame like this:
id descrip
0 0000 x
1 0000 y
2 0000 z
3 1111 x
4 1111 z
5 2222 z
6 3333 x
7 3333 y
And I want to basically keep rows based on a priority of the descrip column, where if there is a z, then that is preferred over a y, which is preferred over an x.
So I basically want this:
id descrip
0 0000 z
1 1111 z
2 2222 z
3 3333 y
Not sure how I would approach this

df.groupby('id')['descrip'].max().reset_index()
id descrip
0 0 z
1 1111 z
2 2222 z
3 3333 y
Its always good to keep a track of what is exactly preferred over what.
Lets say the ordering was different ie: y<z<x where x is the most prefered. Then we could do:
df['descrip'] = df.descrip.astype('category').cat.reorder_categories(['y', 'z', 'x']).\
cat.as_ordered()
df.groupby('id')['descrip'].max().reset_index()
id descrip
0 0 x
1 1111 x
2 2222 z
3 3333 x

Related

Finding max value in another column for each unique value in a column in pandas

I am trying to get the max start for each id, this is the table that I have:
id descrip start
0 0000 x 4
1 0000 y 60
2 1111 x 7
3 1111 x 0
4 2222 z 452
5 3333 x 36622
6 3333 t 32
And this is what I want:
id descrip start
0 0000 y 60
1 1111 x 7
2 2222 z 452
3 3333 x 36622
I tried doing this
df.loc[df.reset_index().groupby(['id'])['start'].idxmax()]
But i have been getting this error:
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

Filter pandas group with if else condition

I have a pandas dataframe like this:
ID
Tier1
Tier2
1111
RF
B
1111
OK
B
2222
RF
B
2222
RF
E
3333
OK
B
3333
LO
B
I need to cut down the table so the IDs are unique, but do so with the following hierarchy: RF>OK>LO for Tier1. Then B>E for Tier2.
So the expected output will be:
ID
Tier1
Tier2
1111
RF
B
2222
RF
B
2222
RF
E
3333
OK
B
then:
ID
Tier1
Tier2
1111
RF
B
2222
RF
B
3333
OK
B
I am struggling to figure out how to this. My initial attempt is to group the table with grouped = df.groupby('ID') and then:
grouped = df.groupby('ID')
for key, group in grouped:
check_rf = group['Tier1']=='RF'
check_ok = group['Tier1']=='OK'
if check_rf.any():
group = group[group['Tier1']=='RF']
elif check_ok.any():
#and so on
I think this is working to filter each group, but I have no idea how the groups can then relate back to the parent table (df). And I am sure there is a better way to do this.
Thanks!
Let's use pd.Categorical & drop_duplicates
df['Tier1'] = pd.Categorical(df['Tier1'],['RF','OK','LO'],ordered=True)
df['Tier2'] = pd.Categorical(df['Tier2'],['B','E'],ordered=True)
df1 = df.sort_values(['Tier1','Tier2']).drop_duplicates(subset=['ID'],keep='first')
print(df1)
ID Tier1 Tier2
0 1111 RF B
2 2222 RF B
4 3333 OK B
Looking at Tier1 you can see the ordering.
print(df['Tier1'])
0 RF
1 OK
2 RF
3 RF
4 OK
5 LO
Name: Tier1, dtype: category
Categories (3, object): ['RF' < 'OK' < 'LO']
You can use two groupby+agg Pandas calls. Since the ordering RF>OK>LO and B>E are respectively compliant the (reverse) lexicographic ordering, you can use the trivial min/max functions for the aggregation (otherwise you can write your own custom min-max functions).
Here is how to do that (using a 2-pass filtering):
tmp = df.groupby(['ID', 'Tier2']).agg(max).reset_index() # Step 1
output = tmp.groupby(['ID', 'Tier1']).agg(min).reset_index() # Step 2
Here is the result in output:
ID Tier1 Tier2
0 1111 RF B
1 2222 RF B
2 3333 OK B

Create a new Id column which start with 0000 and increments one by one in python

I want to create a new Id column to a data frame which should start from 0000 and increments
Expecting output:
You can use this
df['id'] = pd.Series(np.arange(len(df))).astype(str).str.zfill(4)
input
Place Number Code
0 X A 1
1 Y B 2
2 X C 3
3 Y D 0
4 X F 1
5 Y G 2
6 X H 5
7 Y I 4
output
Place Number Code id
0 X A 1 0000
1 Y B 2 0001
2 X C 3 0002
3 Y D 0 0003
4 X F 1 0004
5 Y G 2 0005
6 X H 5 0006
7 Y I 4 0007

Using apply function in Pandas while referencing and looping through another df?

I have a drug reference as a df called df_drug_ref (below). There are three drugs (A, B, and C). The corresponding ATC is listed in the second column. However, if patient has a DIN within the Drug_BIN_Id_Exclusion list, then s/he would not be considered as using that drug (ie. 011235 for Drug A).
Drug Drug_ATC_Id Drug_DIN_Id_Exclusion
A N123 [011235]
B B5234 [65413, 654351]
C N32456 []
The following is the other df called df_row. This captures all the drugs dispensed by each individual. And each individual has his own People_Id.
People_Id Drug_ATC Drug_DIN A B C
1001 N123
1001 N123 011235
1001 N32456 011232
1001 N111
1002 B5234 65413
1002 B5234 654090
1002 N123 011235
I would like to assign '1' for the corresponding drug (looping iteratively to check for A, B, or C and assigning to the corresponding columns) if, in that row, the ATC code matches with the drug reference and the DIN is not contained within the exclusion list. The result should be:
People_Id Drug_ATC Drug_DIN A B C
1001 N123 1 0 0
1001 N123 011235 0 0 0
1001 N32456 011232 0 0 1
1001 N111 0 0 0
1002 B5234 65413 0 0 0
1002 B5234 654090 0 1 0
1002 N123 011235 0 0 0
I understand how to use apply function within the same df itself, but I don't know how to also use an external df as reference.
First you can split your lists to several columns with apply(pd.Series) and join them to df_drug_ref:
print (df_drug_ref.join(df_drug_ref['Drug_DIN_Id_Exclusion'].apply(pd.Series)))
Drug Drug_ATC_Id Drug_DIN_Id_Exclusion 0 1
0 A N123 [011235] 011235 NaN
1 B B5234 [65413, 654351] 65413 654351
2 C N32456 [] NaN NaN
Then you can merge on the column 'Drug_ATC' the above joined dataframe to People_Id, after doing some cleaning on columns:
df_merge = People_Id.merge(df_drug_ref[['Drug', 'Drug_ATC_Id']]
.join(df_drug_ref['Drug_DIN_Id_Exclusion']
.apply(pd.Series)
.add_prefix('Drug_DIN_'))
.rename(columns={'Drug_ATC_Id':'Drug_ATC'}),
how='left')
to get df_merge:
People_Id Drug_ATC Drug_DIN Drug Drug_DIN_0 Drug_DIN_1
0 1001 N123 A 011235 NaN
1 1001 N123 011235 A 011235 NaN
2 1001 N32456 011235 C NaN NaN
3 1001 N111 NaN NaN NaN
4 1002 B5234 65413 B 65413 654351
5 1002 B5234 654090 B 65413 654351
6 1002 N123 011235 A 011235 NaN
Now you can replace the column 'Drug' with NaN where the value in 'Drug_DIN' is in one of the columns 'Drug_DIN_i' with np.any:
mask = np.any(df_merge.filter(like='Drug_DIN').iloc[:,:1].values ==
df_merge.filter(like='Drug_DIN').iloc[:,1:].values, axis=1)
df_merge.loc[mask,'Drug'] = np.nan
Finally, to create the columns A, B, C ... you can use pd.get_dummies with set_index and then reset_index:
new_People_Id = pd.get_dummies(df_merge.set_index(['People_Id','Drug_ATC','Drug_DIN'])['Drug']).reset_index()
print (new_People_Id)
People_Id Drug_ATC Drug_DIN A B C
0 1001 N123 1 0 0
1 1001 N123 011235 0 0 0
2 1001 N32456 011235 0 0 1
3 1001 N111 0 0 0
4 1002 B5234 65413 0 0 0
5 1002 B5234 654090 0 1 0
6 1002 N123 011235 0 0 0
Note here you can also use join such as:
new_People_Id = df_merge[['People_Id','Drug_ATC','Drug_DIN']].join(df_merge['Drug'].str.get_dummies())
maybe faster.
This is a working solution using function and iterrows:
def check_rx_condition(row):
for index, col in df_drug_ref.iterrows():
if ((col['Drug_ATC_Id'] in row['Drug_ATC'])&
(row['DRUG_DIN'] not in col['Drug_DIN_Id_Exclusion'])):
row[col['Drug']] = 1
else:
row[col['Drug']] = 0
return row
df_row = df_row.apply(check_rx_condition, axis=1)

Pandas: filtering recurring payments

Im stuck with this problem for a while. I want to filter out regular monthly payments from table where the Beneficient and the Payer is the same and Amount is equal. I am filtering out salaries.
Date Beneficient Payer Amount
2014-09-10 X A 3000
2014-09-15 X A 4000
2014-10-10 X A 3000
2014-10-11 X A 5500
2014-11-10 X A 3000
2014-09-11 Y B 7000
2014-09-14 Y B 8500
2014-10-11 Y B 7000
2014-10-16 Y B 8900
2014-11-11 Y B 7000
2014-11-17 Y B 8200
the desirable result:
Date Beneficient Payer Amount
2014-09-10 X A 3000
2014-10-10 X A 3000
2014-11-10 X A 3000
2014-09-11 Y B 7000
2014-10-11 Y B 7000
2014-11-11 Y B 7000
Use duplicated by specifying columns for check dupes and keep=False for return all dupe rows for boolean mask and filter by boolean indexing:
df = df[df.duplicated(subset=['Beneficient','Payer','Amount'], keep=False)]
print (df)
Date Beneficient Payer Amount
0 2014-09-10 X A 3000
2 2014-10-10 X A 3000
4 2014-11-10 X A 3000
5 2014-09-11 Y B 7000
7 2014-10-11 Y B 7000
9 2014-11-11 Y B 7000
Detail:
print (df.duplicated(subset=['Beneficient','Payer','Amount'], keep=False))
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 True
8 False
9 True
10 False
dtype: bool
More general solution:
Idea is get differences between datetimes, first NaNs replace by 30 and compare.
And here is a bit problem - there is different count of days between months, the worst is February - possible difference less as 30, 31.
So in my opinion general solution with difference always +-1 days is not so easy.
df = df[df.duplicated(subset=['Beneficient','Payer','Amount'], keep=False)]
df = df.sort_values(['Beneficient','Payer','Amount','Date'])
cols = [df['Beneficient'], df['Payer'], df['Amount']]
df = df[df['Date'].groupby(cols).diff().dt.days.fillna(30).isin([30,31])]
print (df)
Date Beneficient Payer Amount
0 2014-09-10 X A 3000
2 2014-10-10 X A 3000
4 2014-11-10 X A 3000
5 2014-09-11 Y B 7000
7 2014-10-11 Y B 7000
9 2014-11-11 Y B 7000
To filter out those results into their own dataframe, while also keeping the original records, you want to use duplicated():
sub_df = df[df.duplicated(subset=['Beneficient','Payer','Amount'], keep=False)]

Resources