Populate text in empty rows above in Pandas dataframe - python-3.x

Below is an example dataframe using Pandas. Sorry about the table format.
I would like to take the text 'RF-543' in column 'Test Case' and populate it in the two empty rows above it.
Likewise for text 'RT-483' in column 'Test Case', I would like to be populate it in the three empty rows above it.
I have over 900 rows of data with varying empty rows to be populated with the next non-empty row that follows. Any suggestions are appreciated.
Req#
Test Case
745
AB-003
348
AA-006
932
335
675
RF-543
348
988
444
124
RT-483
Regards,
Gus

If they are NaN something as below:
Creating dataframe test
>>> df = pd.DataFrame({"A": ["apple",np.nan, np.nan, "banana", np.nan, "grape"]})
>>> df
A
0 apple
1 NaN
2 NaN
3 banana
4 NaN
5 grape
Use [fillna] method
>>> df["A"].fillna(method='bfill', inplace=True)
>>> df
0 apple
1 banana
2 banana
3 banana
4 grape
5 grape
Name: A, dtype: object
If the fields are empty:
>>> df = pd.DataFrame({"A": ["apple", "", "", "banana", "", "grape"]})
>>> df
A
0 apple
1
2
3 banana
4
5 grape
>>> df = df.apply(lambda x: x["A"] if x["A"] != "" else np.nan, axis=1)
>>> df
0 apple
1 NaN
2 NaN
3 banana
4 NaN
5 grape
dtype: object
Then, use the fillna()

Try something like this:
df.loc[df["Test Case"] == '', "Test Case"] = np.nan
df.bfill()
Output
Req# Test Case
0 745 AB-003
1 348 AA-006
2 932 RF-543
3 335 RF-543
4 675 RF-543
5 348 RT-483
6 988 RT-483
7 444 RT-483
8 124 RT-483

Related

Apply customized function for extracting numbers from string to multiple columns in Python

Given a dataset as follows:
id name value1 value2 value3
0 1 gz 1 6 st
1 2 sz 7-tower aka 278 3
2 3 wh NaN 67 s6.1
3 4 sh x3 45 34
I'd like to write a customized function to extract numbers from values columns.
Here is the pseudocode I have written:
def extract_nums(row):
return row.str.extract('(\d*\.?\d+)', expand=True)
df[['value1', 'value2', 'value3']] = df[['value1', 'value2', 'value3']].apply(extract_nums)
It raise an error:
ValueError: If using all scalar values, you must pass an index
Code for manipulating columns one by one works, but not concise:
df['value1'] = df['value1'].str.extract('(\d*\.?\d+)', expand=True)
df['value2'] = df['value2'].str.extract('(\d*\.?\d+)', expand=True)
df['value3'] = df['value3'].str.extract('(\d*\.?\d+)', expand=True)
How could write the code correctly? Thanks.
You can filter the value like columns then stack the columns and use str.extract to extract the numbers followed by unstack to reshape:
c = df.filter(like='value').columns
df[c] = df[c].stack().str.extract('(\d*\.?\d+)', expand=False).unstack()
Alternatively you can try str.extract with apply:
c = df.filter(like='value').columns
df[c] = df[c].apply(lambda s: s.str.extract('(\d*\.?\d+)', expand=False))
Result:
id name value1 value2 value3
0 1 gz 1 6 NaN
1 2 sz 7 278 3
2 3 wh NaN 67 6.1
3 4 sh 3 45 34
The following code also works out:
cols = ['value1', 'value2', 'value3']
for col in cols:
df[col] = df[col].str.extract('(\d*\.?\d+)', expand=True)
Alternative solution with function:
def extract_nums(row):
return row.str.extract('(\d*\.?\d+)', expand=False)
df[cols] = df[cols].apply(extract_nums)
Out:
id name value1 value2 value3
0 1 gz 1 6 NaN
1 2 sz 7 278 3
2 3 wh NaN 67 6.1
3 4 sh 3 45 34

pandas groupby and widen dataframe with ordered columns

I have a long form dataframe that contains multiple samples and time points for each subject. The number of samples and timepoint can vary, and the days between time points can also vary:
test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
"sample":["A", "B", "C", "D", "E", "F"],
"timepoint":[19,11,8,6,2,12],
"time_order":[3,2,1,2,1,1]
})
subject_id sample timepoint time_order
0 1 A 19 3
1 1 B 11 2
2 1 C 8 1
3 2 D 6 2
4 2 E 2 1
5 3 F 12 1
I need to figure out a way to generalize grouping this dataframe by subject_id and putting all samples and time points on the same row, in time order.
DESIRED OUTPUT:
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8 B 11 A 19
1 2 E 2 D 6 null null
5 3 F 12 null null null null
Pivot gets me close, but I'm stuck on how to proceed from there:
test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')
Use DataFrame.set_index with DataFrame.unstack for pivoting, sorting MultiIndex in columns, flatten it and last convert subject_id to column:
df = (test_df.set_index(['subject_id', 'time_order'])
.unstack()
.sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8.0 B 11.0 A 19.0
1 2 E 2.0 D 6.0 NaN NaN
2 3 F 12.0 NaN NaN NaN NaN
a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)
sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
subject_id
1 C 8 B 11.0 A 19.0
2 E 2 D 6.0 NaN NaN
3 F 12 NaN NaN NaN NaN

Dropping columns with high missing values

I have a situation where I need to drop a lot of my dataframe columns where there are high missing values. I have created a new dataframe that gives me the missing values and the ratio of missing values from my original data set.
My original data set - data_merge2 looks like this :
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
The count data set looks like this that gives me the missing count and ratio:
missing_count missing_ratio
C 4 0.10
D 4 0.66
The code that I used to create the count dataset looks like :
#Only check those columns where there are missing values as we have got a lot of columns
new_df = (data_merge2.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(data_merge2)*100)
.loc[data_merge2.isna().any()] )
print(new_df)
Now I want to drop the columns from the original dataframe whose missing ratio is >50%
How should I achieve this?
Use:
data_merge2.loc[:,data_merge2.count().div(len(data_merge2)).ge(0.5)]
#Alternative
#df[df.columns[df.count().mul(2).gt(len(df))]]
or DataFrame.drop using new_df DataFrame
data_merge2.drop(columns = new_df.index[new_df['missing_ratio'].gt(50)])
Output
A B
0 123.0 ABC
1 123.0 ABC
2 NaN ABC
3 123.0 ABC
4 245.0 ABC
5 345.0 ABC
Adding another way with query and XOR:
data_merge2[data_merge2.columns ^ new_df.query('missing_ratio>50').index]
Or pandas way using Index.difference
data_merge2[data_merge2.columns.difference(new_df.query('missing_ratio>50').index)]
A B
0 123.0 ABC
1 123.0 ABC
2 NaN ABC
3 123.0 ABC
4 245.0 ABC
5 345.0 ABC

Checking if the value is there in any specified column of the same table

I wanted to check if the value of a particular row of a column is present in the other column.
df:
sno id1 id2 id3
1 1,2 7 1,2,7,22
2 2 8,9 2,8,9,15,17
3 1,5 6 1,5,6,17,33
4 4 4,12,18
5 9 9,14
output:
for a particular given row,
for i in sno:
if id1 in id3 :
score = 50
elif id2 in id3:
score = 50
if id1 in id3 and id2 in id3:
score = 75
I finally want my score out of that logic.
You can convert all values to sets with split and then compare by issubset, also and bool(a) is used for omit empty sets (created from missing values):
print (df)
sno id1 id2 id3
0 1 1,2 7 1,20,70,22
1 2 2 8,9 2,8,9,15,17
2 3 1,5 6 1,5,6,17,33
3 4 4 NaN 4,12,18
4 5 NaN 9 9,14
def convert(x):
return set(x.split(',')) if isinstance(x, str) else set([])
cols = ['id1', 'id2', 'id3']
df1 = df[cols].applymap(convert)
m1 = np.array([a.issubset(b) and bool(a) for a, b in zip(df1['id1'], df1['id3'])])
m2 = np.array([a.issubset(b) and bool(a) for a, b in zip(df1['id2'], df1['id3'])])
df['new'] = np.select([m1 & m2, m1 | m2], [75, 50], np.nan)
print (df)
sno id1 id2 id3 new
0 1 1,2 7 1,20,70,22 NaN
1 2 2 8,9 2,8,9,15,17 75.0
2 3 1,5 6 1,5,6,17,33 75.0
3 4 4 NaN 4,12,18 50.0
4 5 NaN 9 9,14 50.0

Convert string "nan" to numpy NaN in pandas Dataframe that contains Strings [duplicate]

I am working on a large dataset with many columns of different types. There are a mix of numeric values and strings with some NULL values. I need to change the NULL Value to Blank or 0 depending on the type.
1 John 2 Doe 3 Mike 4 Orange 5 Stuff
9 NULL NULL NULL 8 NULL NULL Lemon 12 NULL
I want it to look like this,
1 John 2 Doe 3 Mike 4 Orange 5 Stuff
9 0 8 0 Lemon 12
I can do this for each individual, but since I am going to be pulling several extremely large datasets with hundreds of columns, I'd like to do this some other way.
Edit:
Types from Smaller Dataset,
Field1 object
Field2 object
Field3 object
Field4 object
Field5 object
Field6 object
Field7 object
Field8 object
Field9 object
Field10 float64
Field11 float64
Field12 float64
Field13 float64
Field14 float64
Field15 object
Field16 float64
Field17 object
Field18 object
Field19 float64
Field20 float64
Field21 int64
Use DataFrame.select_dtypes for numeric columns, filter by subset and replace values to 0, then repalce all another columns to empty string:
print (df)
0 1 2 3 4 5 6 7 8 9
0 1 John 2.0 Doe 3 Mike 4.0 Orange 5 Stuff
1 9 NaN NaN NaN 8 NaN NaN Lemon 12 NaN
print (df.dtypes)
0 int64
1 object
2 float64
3 object
4 int64
5 object
6 float64
7 object
8 int64
9 object
dtype: object
c = df.select_dtypes(np.number).columns
df[c] = df[c].fillna(0)
df = df.fillna("")
print (df)
0 1 2 3 4 5 6 7 8 9
0 1 John 2.0 Doe 3 Mike 4.0 Orange 5 Stuff
1 9 0.0 8 0.0 Lemon 12
Another solution is create dictionary for replace:
num_cols = df.select_dtypes(np.number).columns
d1 = dict.fromkeys(num_cols, 0)
d2 = dict.fromkeys(df.columns.difference(num_cols), "")
d = {**d1, **d2}
print (d)
{0: 0, 2: 0, 4: 0, 6: 0, 8: 0, 1: '', 3: '', 5: '', 7: '', 9: ''}
df = df.fillna(d)
print (df)
0 1 2 3 4 5 6 7 8 9
0 1 John 2.0 Doe 3 Mike 4.0 Orange 5 Stuff
1 9 0.0 8 0.0 Lemon 12
You could try this to substitute a different value for each different column (A to C are numeric, while D is a string):
import pandas as pd
import numpy as np
df_pd = pd.DataFrame([[np.nan, 2, np.nan, '0'],
[3, 4, np.nan, '1'],
[np.nan, np.nan, np.nan, '5'],
[np.nan, 3, np.nan, np.nan]],
columns=list('ABCD'))
df_pd.fillna(value={'A':0.0,'B':0.0,'C':0.0,'D':''})

Resources