Highlight dataframe cells based on multiple conditions in Python - python-3.x

Given a small dataset as follows:
id room area situation
0 1 A-102 world under construction
1 2 NaN 24 under construction
2 3 B309 NaN NaN
3 4 C·102 25 under decoration
4 5 E_1089 hello under decoration
5 6 27 NaN under plan
6 7 27 NaN NaN
Thanks to the code from #jezrael at this link, I'm able to get the result I needed:
a = np.where(df.room.str.match('^[a-zA-Z\d\-]*$', na = False), None,
'incorrect room name')
b = np.where(df.area.str.contains('^\d+$', na = True), None,
'area is not a numbers')
c = np.where(df.situation.str.contains('under decoration', na = False),
'decoration is in the content', None)
f = (lambda x: '; '.join(y for y in x if pd.notna(y))
if any(pd.notna(np.array(x))) else np.nan )
df['check'] = [f(x) for x in zip(a,b,c)]
print(df)
id room area situation \
0 1 A-102 world under construction
1 2 NaN 24 under construction
2 3 B309 NaN NaN
3 4 C·102 25 under decoration
4 5 E_1089 hello under decoration
5 6 27 NaN under plan
6 7 27 NaN NaN
check
0 area is not a numbers
1 incorrect room name
2 NaN
3 incorrect room name;decoration is in the content
4 incorrect room name;area is not a numbers;deco...
5 NaN
6 NaN
But now I would like to go further and hightlight the problematic cells from room, area, situation columns, then save the dataframe as excel file.
How could I do that in Pandas (better) or other Python packages?
Thanks at advance.

Idea is to create customized function for return DataFrame of styles and reuse m1, m2, m3 boolean masks:
m1 = df.room.str.match('^[a-zA-Z\d\-]*$', na = False)
m2 = df.area.str.contains('^\d+$', na = True)
m3 = df.situation.str.contains('under decoration', na = False)
a = np.where(m1, None, 'incorrect room name')
b = np.where(m2, None, 'area is not a numbers')
c = np.where(m3, 'decoration is in the content', None)
f = (lambda x: '; '.join(y for y in x if pd.notna(y))
if any(pd.notna(np.array(x))) else np.nan )
df['check'] = [f(x) for x in zip(a, b, c)]
print(df)
def highlight(x):
c1 = 'background-color: yellow'
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
df1['room'] = np.where(m1, '', c1)
df1['area'] = np.where(m2, '', c1)
df1['situation'] = np.where(m3, c1, '')
# print(df1)
return df1
df.style.apply(highlight, axis = None).to_excel('test.xlsx', index = False)

Related

Apply customized function for extracting numbers from string to multiple columns in Python

Given a dataset as follows:
id name value1 value2 value3
0 1 gz 1 6 st
1 2 sz 7-tower aka 278 3
2 3 wh NaN 67 s6.1
3 4 sh x3 45 34
I'd like to write a customized function to extract numbers from values columns.
Here is the pseudocode I have written:
def extract_nums(row):
return row.str.extract('(\d*\.?\d+)', expand=True)
df[['value1', 'value2', 'value3']] = df[['value1', 'value2', 'value3']].apply(extract_nums)
It raise an error:
ValueError: If using all scalar values, you must pass an index
Code for manipulating columns one by one works, but not concise:
df['value1'] = df['value1'].str.extract('(\d*\.?\d+)', expand=True)
df['value2'] = df['value2'].str.extract('(\d*\.?\d+)', expand=True)
df['value3'] = df['value3'].str.extract('(\d*\.?\d+)', expand=True)
How could write the code correctly? Thanks.
You can filter the value like columns then stack the columns and use str.extract to extract the numbers followed by unstack to reshape:
c = df.filter(like='value').columns
df[c] = df[c].stack().str.extract('(\d*\.?\d+)', expand=False).unstack()
Alternatively you can try str.extract with apply:
c = df.filter(like='value').columns
df[c] = df[c].apply(lambda s: s.str.extract('(\d*\.?\d+)', expand=False))
Result:
id name value1 value2 value3
0 1 gz 1 6 NaN
1 2 sz 7 278 3
2 3 wh NaN 67 6.1
3 4 sh 3 45 34
The following code also works out:
cols = ['value1', 'value2', 'value3']
for col in cols:
df[col] = df[col].str.extract('(\d*\.?\d+)', expand=True)
Alternative solution with function:
def extract_nums(row):
return row.str.extract('(\d*\.?\d+)', expand=False)
df[cols] = df[cols].apply(extract_nums)
Out:
id name value1 value2 value3
0 1 gz 1 6 NaN
1 2 sz 7 278 3
2 3 wh NaN 67 6.1
3 4 sh 3 45 34

pandas groupby and widen dataframe with ordered columns

I have a long form dataframe that contains multiple samples and time points for each subject. The number of samples and timepoint can vary, and the days between time points can also vary:
test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
"sample":["A", "B", "C", "D", "E", "F"],
"timepoint":[19,11,8,6,2,12],
"time_order":[3,2,1,2,1,1]
})
subject_id sample timepoint time_order
0 1 A 19 3
1 1 B 11 2
2 1 C 8 1
3 2 D 6 2
4 2 E 2 1
5 3 F 12 1
I need to figure out a way to generalize grouping this dataframe by subject_id and putting all samples and time points on the same row, in time order.
DESIRED OUTPUT:
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8 B 11 A 19
1 2 E 2 D 6 null null
5 3 F 12 null null null null
Pivot gets me close, but I'm stuck on how to proceed from there:
test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')
Use DataFrame.set_index with DataFrame.unstack for pivoting, sorting MultiIndex in columns, flatten it and last convert subject_id to column:
df = (test_df.set_index(['subject_id', 'time_order'])
.unstack()
.sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8.0 B 11.0 A 19.0
1 2 E 2.0 D 6.0 NaN NaN
2 3 F 12.0 NaN NaN NaN NaN
a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)
sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
subject_id
1 C 8 B 11.0 A 19.0
2 E 2 D 6.0 NaN NaN
3 F 12 NaN NaN NaN NaN

MELT: multiple values without duplication

Cant be this hard. I Have
df=pd.DataFrame({'id':[1,2,3],'name':['j','l','m'], 'mnt':['f','p','p'],'nt':['b','w','e'],'cost':[20,30,80],'paid':[12,23,45]})
I need
import numpy as np
df1=pd.DataFrame({'id':[1,2,3,1,2,3],'name':['j','l','m','j','l','m'], 't':['f','p','p','b','w','e'],'paid':[12,23,45,np.nan,np.nan,np.nan],'cost':[20,30,80,np.nan,np.nan,np.nan]})
I have 45 columns to invert.
I tried
(df.set_index(['id', 'name'])
.rename_axis(['paid'], axis=1)
.stack().reset_index())
EDIT: I think simpliest here is set missing values by variable column in DataFrame.melt:
df2 = df.melt(['id', 'name','cost','paid'], value_name='t')
df2.loc[df2.pop('variable').eq('nt'), ['cost','paid']] = np.nan
print (df2)
id name cost paid t
0 1 j 20.0 12.0 f
1 2 l 30.0 23.0 p
2 3 m 80.0 45.0 p
3 1 j NaN NaN b
4 2 l NaN NaN w
5 3 m NaN NaN e
Use lreshape working with dictionary of lists for specified which columns are 'grouped' together:
df2 = pd.lreshape(df, {'t':['mnt','nt'], 'mon':['cost','paid']})
print (df2)
id name t mon
0 1 j f 20
1 2 l p 30
2 3 m p 80
3 1 j b 12
4 2 l w 23
5 3 m e 45

Copy and Paste Values Based on a Condition in Python

I am trying to populate column 'C' with values from column 'A' based on conditions in column 'B'. Example: If column 'B' equals 'nan', then row under column 'C' equals the row in column 'A'. If column 'B' does NOT equal 'nan', then leave column 'C' as is (ie 'nan'). Next, the values in column 'A' to be removed (only the values that were copied from column A to C).
Original Dataset:
index A B C
0 6 nan nan
1 6 nan nan
2 9 3 nan
3 9 3 nan
4 2 8 nan
5 2 8 nan
6 3 4 nan
7 3 nan nan
8 4 nan nan
Output:
index A B C
0 nan nan 6
1 nan nan 6
2 9 3 nan
3 9 3 nan
4 2 8 nan
5 2 8 nan
6 3 4 nan
7 nan nan 3
8 nan nan 4
Below is what I have tried so far, but its not working.
def impute_unit(cols):
Legal_Block = cols[0]
Legal_Lot = cols[1]
Legal_Unit = cols[2]
if pd.isnull(Legal_Lot):
return 3
else:
return Legal_Unit
bk_Final_tax['Legal_Unit'] = bk_Final_tax[['Legal_Block', 'Legal_Lot',
'Legal_Unit']].apply(impute_unit, axis = 1)
Seems like you need
df['C'] = np.where(df.B.isna(), df.A, df.C)
df['A'] = np.where(df.B.isna(), np.nan, df.A)
A different, maybe fancy way to do it would be to swap A and C values only when B is np.nan
m = df.B.isna()
df.loc[m, ['A', 'C']] = df.loc[m, ['C', 'A']].values
In other words, change
bk_Final_tax['Legal_Unit'] = bk_Final_tax[['Legal_Block', 'Legal_Lot',
'Legal_Unit']].apply(impute_unit, axis = 1)
for
bk_Final_tax['Legal_Unit'] = np.where(df.Legal_Lot.isna(), df.Legal_Block, df.Legal_Unit)
bk_Final_tax['Legal_Block'] = np.where(df.Legal_Lot.isna(), np.nan, df.Legal_Block)

How to use pandas df column value in if-else expression to calculate additional columns

I am trying to calculate additional metrics from existing pandas dataframe by using an if/else condition on existing column values.
if(df['Sell_Ind']=='N').any():
df['MarketValue'] = df.apply(lambda row: row.SharesUnits * row.CurrentPrice, axis=1).astype(float).round(2)
elif(df['Sell_Ind']=='Y').any():
df['MarketValue'] = df.apply(lambda row: row.SharesUnits * row.Sold_price, axis=1).astype(float).round(2)
else:
df['MarketValue'] = df.apply(lambda row: 0)
For the if condition the MarketValue is calculated correctly but for the elif condition, its not giving the correct value.
Can anyone point me as what wrong I am doing in this code.
I think you need numpy.select, apply can be removed and multiple columns by mul:
m1 = df['Sell_Ind']=='N'
m2 = df['Sell_Ind']=='Y'
a = df.SharesUnits.mul(df.CurrentPrice).astype(float).round(2)
b = df.SharesUnits.mul(df.Sold_price).astype(float).round(2)
df['MarketValue'] = np.select([m1, m2], [a,b], default=0)
Sample:
df = pd.DataFrame({'Sold_price':[7,8,9,4,2,3],
'SharesUnits':[1,3,5,7,1,0],
'CurrentPrice':[5,3,6,9,2,4],
'Sell_Ind':list('NNYYTT')})
#print (df)
m1 = df['Sell_Ind']=='N'
m2 = df['Sell_Ind']=='Y'
a = df.SharesUnits.mul(df.CurrentPrice).astype(float).round(2)
b = df.SharesUnits.mul(df.Sold_price).astype(float).round(2)
df['MarketValue'] = np.select([m1, m2], [a,b], default=0)
print (df)
CurrentPrice Sell_Ind SharesUnits Sold_price MarketValue
0 5 N 1 7 5.0
1 3 N 3 8 9.0
2 6 Y 5 9 45.0
3 9 Y 7 4 28.0
4 2 T 1 2 0.0
5 4 T 0 3 0.0

Resources