Dropping columns with high missing values - python-3.x

I have a situation where I need to drop a lot of my dataframe columns where there are high missing values. I have created a new dataframe that gives me the missing values and the ratio of missing values from my original data set.
My original data set - data_merge2 looks like this :
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
The count data set looks like this that gives me the missing count and ratio:
missing_count missing_ratio
C 4 0.10
D 4 0.66
The code that I used to create the count dataset looks like :
#Only check those columns where there are missing values as we have got a lot of columns
new_df = (data_merge2.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(data_merge2)*100)
.loc[data_merge2.isna().any()] )
print(new_df)
Now I want to drop the columns from the original dataframe whose missing ratio is >50%
How should I achieve this?

Use:
data_merge2.loc[:,data_merge2.count().div(len(data_merge2)).ge(0.5)]
#Alternative
#df[df.columns[df.count().mul(2).gt(len(df))]]
or DataFrame.drop using new_df DataFrame
data_merge2.drop(columns = new_df.index[new_df['missing_ratio'].gt(50)])
Output
A B
0 123.0 ABC
1 123.0 ABC
2 NaN ABC
3 123.0 ABC
4 245.0 ABC
5 345.0 ABC

Adding another way with query and XOR:
data_merge2[data_merge2.columns ^ new_df.query('missing_ratio>50').index]
Or pandas way using Index.difference
data_merge2[data_merge2.columns.difference(new_df.query('missing_ratio>50').index)]
A B
0 123.0 ABC
1 123.0 ABC
2 NaN ABC
3 123.0 ABC
4 245.0 ABC
5 345.0 ABC

Related

pyspark - assign non-null columns to new columns

I have a dataframe of the following scheme in pyspark:
user_id datadate page_1.A page_1.B page_1.C page_2.A page_2.B \
0 111 20220203 NaN NaN NaN NaN NaN
1 222 20220203 5 5 5 5.0 5.0
2 333 20220203 3 3 3 3.0 3.0
page_2.C page_3.A page_3.B page_3.C
0 NaN 1.0 1.0 2.0
1 5.0 NaN NaN NaN
2 4.0 NaN NaN NaN
So it contains columns like user_id, datadate, and few columns for each page (got 3 pages), which are the result of 2 joins. In this example, i have page_1, page_2, page_3, and each has 3 columns: A,B,C. Additionally, for each page columns, for each row, they will either be all null or all full, like in my example.
I don't care about the values of each of the columns per page, I just want to get for each row, the [A,B,C] values that are not null.
example for a wanted result table:
user_id datadate A B C
0 111 20220203 1 1 2
1 222 20220203 5 5 5
2 333 20220203 3 3 3
so the logic will be something like:
df[A] = page_1.A or page_2.A or page_3.A, whichever is not null
df[B] = page_1.B or page_2.B or page_3.B, whichever is not null
df[C] = page_1.C or page_2.C or page_3.C, whichever is not null
for all of the rows..
and of course, I would like to do it in an efficient way.
Thanks a lot.
You can use the sql functions greatest to extract the greatest values in a list of columns.
You can find the documentation here: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.greatest.html
from pyspark.sql import functions as F
(df.withColumn('A', F.greates(F.col('page_1.A'), F.col('page_2.A), F.col('page_3.A'))
.withColumn('B', F.greates(F.col('page_1.B'), F.col('page_2.B), F.col('page_3.B'))
.select('userid', 'datadate', 'A', 'B'))

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?
Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN
You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

pandas groupby and widen dataframe with ordered columns

I have a long form dataframe that contains multiple samples and time points for each subject. The number of samples and timepoint can vary, and the days between time points can also vary:
test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
"sample":["A", "B", "C", "D", "E", "F"],
"timepoint":[19,11,8,6,2,12],
"time_order":[3,2,1,2,1,1]
})
subject_id sample timepoint time_order
0 1 A 19 3
1 1 B 11 2
2 1 C 8 1
3 2 D 6 2
4 2 E 2 1
5 3 F 12 1
I need to figure out a way to generalize grouping this dataframe by subject_id and putting all samples and time points on the same row, in time order.
DESIRED OUTPUT:
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8 B 11 A 19
1 2 E 2 D 6 null null
5 3 F 12 null null null null
Pivot gets me close, but I'm stuck on how to proceed from there:
test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')
Use DataFrame.set_index with DataFrame.unstack for pivoting, sorting MultiIndex in columns, flatten it and last convert subject_id to column:
df = (test_df.set_index(['subject_id', 'time_order'])
.unstack()
.sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8.0 B 11.0 A 19.0
1 2 E 2.0 D 6.0 NaN NaN
2 3 F 12.0 NaN NaN NaN NaN
a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)
sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
subject_id
1 C 8 B 11.0 A 19.0
2 E 2 D 6.0 NaN NaN
3 F 12 NaN NaN NaN NaN

Pandas print missing value column names and count only

I am using the following code to print the missing value count and the column names.
#Looking for missing data and then handling it accordingly
def find_missing(data):
# number of missing values
count_missing = data_final.isnull().sum().values
# total records
total = data_final.shape[0]
# percentage of missing
ratio_missing = count_missing/total
# return a dataframe to show: feature name, # of missing and % of missing
return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing},
index=data.columns.values)
find_missing(data_final).head(5)
What I want to do is to only print those columns where there is a missing value as I have a huge data set of about 150 columns.
The data set looks like this
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
In the output I would just want to see :
missing_count missing_ratio
C 4 0.66
D 4 0.66
and not the columns A and B as there are no missing values there
Use DataFrame.isna with DataFrame.sum
to count by columns. We can also use DataFrame.isnull instead DataFrame.isna.
new_df = (df.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(df))
.loc[df.isna().any()] )
print(new_df)
We can also use pd.concat instead DataFrame.assign
count = df.isna().sum()
new_df = (pd.concat([count.rename('missing_count'),
count.div(len(df))
.rename('missing_ratio')],axis = 1)
.loc[count.ne(0)])
Output
missing_count missing_ratio
A 1 0.166667
C 4 0.666667
D 4 0.666667
IIUC, we can assign the missing and total count to two variables do some basic math and assign back to a df.
a = df.isnull().sum(axis=0)
b = np.round(df.isnull().sum(axis=0) / df.fillna(0).count(axis=0),2)
missing_df = pd.DataFrame({'missing_vals' : a,
'missing_ratio' : b})
print(missing_df)
missing_vals ratio
A 1 0.17
B 0 0.00
C 4 0.67
D 4 0.67
you can filter out columns that don't have any missing vals
missing_df = missing_df[missing_df.missing_vals.ne(0)]
print(missing_df)
missing_vals ratio
A 1 0.17
C 4 0.67
D 4 0.67
You can also use concat:
s = df.isnull().sum()
result = pd.concat([s,s/len(df)],1)
result.columns = ["missing_count","missing_ratio"]
print (result)
missing_count missing_ratio
A 1 0.166667
B 0 0.000000
C 4 0.666667
D 4 0.666667

DataFrame difference between rows based on multiple columns

I am trying to calculate the difference between rows based on multiple columns. The data set is very large and I am pasting dummy data below that describes the problem:
if I want to calculate the daily difference in weight at a pet+name level. So far I have only come up with the solution of concatenating these columns and creating multiindex based on the new column and the date column. But I think there should be a better way. In the real dataset I have more than 3 columns I am using calculate row difference.
df['pet_name']=df.pet + df.name
df.set_index(['pet_name','date'],inplace = True)
df.sort_index(inplace=True)
df['diffs']=np.nan
for idx in t.index.levels[0]:
df.diffs[idx] = df.weight[idx].diff()
Base on your description , you can try groupby
df['pet_name']=df.pet + df.name
df.groupby('pet_name')['weight'].diff()
Use groupby by 2 columns:
df.groupby(['pet', 'name'])['weight'].diff()
All together:
#convert dates to datetimes
df['date'] = pd.to_datetime(df['date'])
#sorting
df = df.sort_values(['pet', 'name','date'])
#get differences per groups
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
Sample:
np.random.seed(123)
N = 100
L = list('abc')
df = pd.DataFrame({'pet': np.random.choice(L, N),
'name': np.random.choice(L, N),
'date': pd.Series(pd.date_range('2015-01-01', periods=int(N/10)))
.sample(N, replace=True),
'weight':np.random.rand(N)})
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['pet', 'name','date'])
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
df['pet_name'] = df.pet + df.name
df = df.sort_values(['pet_name','date'])
df['diffs1'] = df.groupby(['pet_name', 'date'])['weight'].diff()
print (df.head(20))
date name pet weight diffs pet_name diffs1
1 2015-01-02 a a 0.105446 NaN aa NaN
2 2015-01-03 a a 0.845533 NaN aa NaN
2 2015-01-03 a a 0.980582 0.135049 aa 0.135049
2 2015-01-03 a a 0.443368 -0.537214 aa -0.537214
3 2015-01-04 a a 0.375186 NaN aa NaN
6 2015-01-07 a a 0.715601 NaN aa NaN
7 2015-01-08 a a 0.047340 NaN aa NaN
9 2015-01-10 a a 0.236600 NaN aa NaN
0 2015-01-01 b a 0.777162 NaN ab NaN
2 2015-01-03 b a 0.871683 NaN ab NaN
3 2015-01-04 b a 0.988329 NaN ab NaN
4 2015-01-05 b a 0.918397 NaN ab NaN
4 2015-01-05 b a 0.016119 -0.902279 ab -0.902279
5 2015-01-06 b a 0.095530 NaN ab NaN
5 2015-01-06 b a 0.894978 0.799449 ab 0.799449
5 2015-01-06 b a 0.365719 -0.529259 ab -0.529259
5 2015-01-06 b a 0.887593 0.521874 ab 0.521874
7 2015-01-08 b a 0.792299 NaN ab NaN
7 2015-01-08 b a 0.313669 -0.478630 ab -0.478630
7 2015-01-08 b a 0.281235 -0.032434 ab -0.032434

Resources