Apply customized function for extracting numbers from string to multiple columns in Python - python-3.x

Given a dataset as follows:
id name value1 value2 value3
0 1 gz 1 6 st
1 2 sz 7-tower aka 278 3
2 3 wh NaN 67 s6.1
3 4 sh x3 45 34
I'd like to write a customized function to extract numbers from values columns.
Here is the pseudocode I have written:
def extract_nums(row):
return row.str.extract('(\d*\.?\d+)', expand=True)
df[['value1', 'value2', 'value3']] = df[['value1', 'value2', 'value3']].apply(extract_nums)
It raise an error:
ValueError: If using all scalar values, you must pass an index
Code for manipulating columns one by one works, but not concise:
df['value1'] = df['value1'].str.extract('(\d*\.?\d+)', expand=True)
df['value2'] = df['value2'].str.extract('(\d*\.?\d+)', expand=True)
df['value3'] = df['value3'].str.extract('(\d*\.?\d+)', expand=True)
How could write the code correctly? Thanks.

You can filter the value like columns then stack the columns and use str.extract to extract the numbers followed by unstack to reshape:
c = df.filter(like='value').columns
df[c] = df[c].stack().str.extract('(\d*\.?\d+)', expand=False).unstack()
Alternatively you can try str.extract with apply:
c = df.filter(like='value').columns
df[c] = df[c].apply(lambda s: s.str.extract('(\d*\.?\d+)', expand=False))
Result:
id name value1 value2 value3
0 1 gz 1 6 NaN
1 2 sz 7 278 3
2 3 wh NaN 67 6.1
3 4 sh 3 45 34

The following code also works out:
cols = ['value1', 'value2', 'value3']
for col in cols:
df[col] = df[col].str.extract('(\d*\.?\d+)', expand=True)
Alternative solution with function:
def extract_nums(row):
return row.str.extract('(\d*\.?\d+)', expand=False)
df[cols] = df[cols].apply(extract_nums)
Out:
id name value1 value2 value3
0 1 gz 1 6 NaN
1 2 sz 7 278 3
2 3 wh NaN 67 6.1
3 4 sh 3 45 34

Related

loops application in dataframe to find output

I have the following data:
dict={'A':[1,2,3,4,5],'B':[10,20,233,29,2],'C':[10,20,3040,230,238]...................}
and
df= pd.Dataframe(dict)
In this manner I have 20 columns with 5 numerical entry in each column
I want to have a new column where the value should come as the following logic:
0 A[0]*B[0]+A[0]*C[0] + A[0]*D[0].......
1 A[1]*B[1]+A[1]*C[1] + A[1]*D[1].......
2 A[2]*B[2]+A[2]*B[2] + A[2]*D[2].......
I tried in the following manner but manually I can not put 20 columns, so I wanted to know the way to apply a loop to get the desired output
:
lst=[]
for i in range(0,5):
j=df.A[i]*df.B[i]+ df.A[i]*df.C[i]+.......
lst.append(j)
i=i+1
A potential solution is the following. I am only taking the example you posted but is works fine for more. Your data is df
A B C
0 1 10 10
1 2 20 20
2 3 233 3040
3 4 29 230
4 5 2 238
You can create a new column, D by first subsetting your dataframe
add = df.loc[:, df.columns != 'A']
and then take the sum over all multiplications of the columns in D with column A in the following way:
df['D'] = df['A']*add.sum(axis=1)
which returns
A B C D
0 1 10 10 20
1 2 20 20 80
2 3 233 3040 9819
3 4 29 230 1036
4 5 2 238 1200

Highlight dataframe cells based on multiple conditions in Python

Given a small dataset as follows:
id room area situation
0 1 A-102 world under construction
1 2 NaN 24 under construction
2 3 B309 NaN NaN
3 4 C·102 25 under decoration
4 5 E_1089 hello under decoration
5 6 27 NaN under plan
6 7 27 NaN NaN
Thanks to the code from #jezrael at this link, I'm able to get the result I needed:
a = np.where(df.room.str.match('^[a-zA-Z\d\-]*$', na = False), None,
'incorrect room name')
b = np.where(df.area.str.contains('^\d+$', na = True), None,
'area is not a numbers')
c = np.where(df.situation.str.contains('under decoration', na = False),
'decoration is in the content', None)
f = (lambda x: '; '.join(y for y in x if pd.notna(y))
if any(pd.notna(np.array(x))) else np.nan )
df['check'] = [f(x) for x in zip(a,b,c)]
print(df)
id room area situation \
0 1 A-102 world under construction
1 2 NaN 24 under construction
2 3 B309 NaN NaN
3 4 C·102 25 under decoration
4 5 E_1089 hello under decoration
5 6 27 NaN under plan
6 7 27 NaN NaN
check
0 area is not a numbers
1 incorrect room name
2 NaN
3 incorrect room name;decoration is in the content
4 incorrect room name;area is not a numbers;deco...
5 NaN
6 NaN
But now I would like to go further and hightlight the problematic cells from room, area, situation columns, then save the dataframe as excel file.
How could I do that in Pandas (better) or other Python packages?
Thanks at advance.
Idea is to create customized function for return DataFrame of styles and reuse m1, m2, m3 boolean masks:
m1 = df.room.str.match('^[a-zA-Z\d\-]*$', na = False)
m2 = df.area.str.contains('^\d+$', na = True)
m3 = df.situation.str.contains('under decoration', na = False)
a = np.where(m1, None, 'incorrect room name')
b = np.where(m2, None, 'area is not a numbers')
c = np.where(m3, 'decoration is in the content', None)
f = (lambda x: '; '.join(y for y in x if pd.notna(y))
if any(pd.notna(np.array(x))) else np.nan )
df['check'] = [f(x) for x in zip(a, b, c)]
print(df)
def highlight(x):
c1 = 'background-color: yellow'
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
df1['room'] = np.where(m1, '', c1)
df1['area'] = np.where(m2, '', c1)
df1['situation'] = np.where(m3, c1, '')
# print(df1)
return df1
df.style.apply(highlight, axis = None).to_excel('test.xlsx', index = False)

pandas groupby and widen dataframe with ordered columns

I have a long form dataframe that contains multiple samples and time points for each subject. The number of samples and timepoint can vary, and the days between time points can also vary:
test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
"sample":["A", "B", "C", "D", "E", "F"],
"timepoint":[19,11,8,6,2,12],
"time_order":[3,2,1,2,1,1]
})
subject_id sample timepoint time_order
0 1 A 19 3
1 1 B 11 2
2 1 C 8 1
3 2 D 6 2
4 2 E 2 1
5 3 F 12 1
I need to figure out a way to generalize grouping this dataframe by subject_id and putting all samples and time points on the same row, in time order.
DESIRED OUTPUT:
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8 B 11 A 19
1 2 E 2 D 6 null null
5 3 F 12 null null null null
Pivot gets me close, but I'm stuck on how to proceed from there:
test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')
Use DataFrame.set_index with DataFrame.unstack for pivoting, sorting MultiIndex in columns, flatten it and last convert subject_id to column:
df = (test_df.set_index(['subject_id', 'time_order'])
.unstack()
.sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8.0 B 11.0 A 19.0
1 2 E 2.0 D 6.0 NaN NaN
2 3 F 12.0 NaN NaN NaN NaN
a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)
sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
subject_id
1 C 8 B 11.0 A 19.0
2 E 2 D 6.0 NaN NaN
3 F 12 NaN NaN NaN NaN

Pandas print missing value column names and count only

I am using the following code to print the missing value count and the column names.
#Looking for missing data and then handling it accordingly
def find_missing(data):
# number of missing values
count_missing = data_final.isnull().sum().values
# total records
total = data_final.shape[0]
# percentage of missing
ratio_missing = count_missing/total
# return a dataframe to show: feature name, # of missing and % of missing
return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing},
index=data.columns.values)
find_missing(data_final).head(5)
What I want to do is to only print those columns where there is a missing value as I have a huge data set of about 150 columns.
The data set looks like this
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
In the output I would just want to see :
missing_count missing_ratio
C 4 0.66
D 4 0.66
and not the columns A and B as there are no missing values there
Use DataFrame.isna with DataFrame.sum
to count by columns. We can also use DataFrame.isnull instead DataFrame.isna.
new_df = (df.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(df))
.loc[df.isna().any()] )
print(new_df)
We can also use pd.concat instead DataFrame.assign
count = df.isna().sum()
new_df = (pd.concat([count.rename('missing_count'),
count.div(len(df))
.rename('missing_ratio')],axis = 1)
.loc[count.ne(0)])
Output
missing_count missing_ratio
A 1 0.166667
C 4 0.666667
D 4 0.666667
IIUC, we can assign the missing and total count to two variables do some basic math and assign back to a df.
a = df.isnull().sum(axis=0)
b = np.round(df.isnull().sum(axis=0) / df.fillna(0).count(axis=0),2)
missing_df = pd.DataFrame({'missing_vals' : a,
'missing_ratio' : b})
print(missing_df)
missing_vals ratio
A 1 0.17
B 0 0.00
C 4 0.67
D 4 0.67
you can filter out columns that don't have any missing vals
missing_df = missing_df[missing_df.missing_vals.ne(0)]
print(missing_df)
missing_vals ratio
A 1 0.17
C 4 0.67
D 4 0.67
You can also use concat:
s = df.isnull().sum()
result = pd.concat([s,s/len(df)],1)
result.columns = ["missing_count","missing_ratio"]
print (result)
missing_count missing_ratio
A 1 0.166667
B 0 0.000000
C 4 0.666667
D 4 0.666667

Checking if the value is there in any specified column of the same table

I wanted to check if the value of a particular row of a column is present in the other column.
df:
sno id1 id2 id3
1 1,2 7 1,2,7,22
2 2 8,9 2,8,9,15,17
3 1,5 6 1,5,6,17,33
4 4 4,12,18
5 9 9,14
output:
for a particular given row,
for i in sno:
if id1 in id3 :
score = 50
elif id2 in id3:
score = 50
if id1 in id3 and id2 in id3:
score = 75
I finally want my score out of that logic.
You can convert all values to sets with split and then compare by issubset, also and bool(a) is used for omit empty sets (created from missing values):
print (df)
sno id1 id2 id3
0 1 1,2 7 1,20,70,22
1 2 2 8,9 2,8,9,15,17
2 3 1,5 6 1,5,6,17,33
3 4 4 NaN 4,12,18
4 5 NaN 9 9,14
def convert(x):
return set(x.split(',')) if isinstance(x, str) else set([])
cols = ['id1', 'id2', 'id3']
df1 = df[cols].applymap(convert)
m1 = np.array([a.issubset(b) and bool(a) for a, b in zip(df1['id1'], df1['id3'])])
m2 = np.array([a.issubset(b) and bool(a) for a, b in zip(df1['id2'], df1['id3'])])
df['new'] = np.select([m1 & m2, m1 | m2], [75, 50], np.nan)
print (df)
sno id1 id2 id3 new
0 1 1,2 7 1,20,70,22 NaN
1 2 2 8,9 2,8,9,15,17 75.0
2 3 1,5 6 1,5,6,17,33 75.0
3 4 4 NaN 4,12,18 50.0
4 5 NaN 9 9,14 50.0

Resources