Getting average in pandas dataframe when there are NaN values

Getting average in pandas dataframe when there are NaN values - python-3.x

I have a pandas dataframe and I would like to add a row at the end of dataframe to show the average of each column; however, due to NaN values in Col2, Col3, and Col4, the mean function cannot return the correct average of the columns. How can I fix this issue?
Col1 Col2 Col3 Col4
1 A 11 10 NaN
2 B 14 NaN 15
3 C 45 16 0
4 D NaN 16 NaN
5 E 12 23 5
P.S. This is the dataframe after getting average (df.loc["mean"] = df.mean()):
Col1 Col2 Col3 Col4
1 A 11 10 NaN
2 B 14 NaN 15
3 C 45 16 0
4 D NaN 16 NaN
5 E 12 23 5
Mean NaN Nan NaN NaN

Problem is columns are not filled by numbers, but string repr, so first convert them to numeric by DataFrame.astype:
cols = ['Col2','Col3','Col4']
df[cols] = df[cols].astype(float)
df.loc["mean"] = df.mean()
print (df)
Col1 Col2 Col3 Col4
1 A 11.0 10.00 NaN
2 B 14.0 NaN 15.000000
3 C 45.0 16.00 0.000000
4 D NaN 16.00 NaN
5 E 12.0 23.00 5.000000
mean NaN 20.5 16.25 6.666667
Or if some non numeric values use to_numeric with errors='coerce':
cols = ['Col2','Col3','Col4']
df[cols] = df[cols].apply(lambda x: pd.to_numeric(x, errors='coerce'))
df.loc["mean"] = df.mean()

You can set skipna=True when calculating the mean:
df = df.mean(axis=0, skipna=True).rename('Mean').pipe(df.append)
print(df)
Col1 Col2 Col3 Col4
0 A 11.0 10.00 NaN
1 B 14.0 NaN 15.000000
2 C 45.0 16.00 0.000000
3 D NaN 16.00 NaN
4 E 12.0 23.00 5.000000
Mean NaN 20.5 16.25 6.666667

Related

Python formatting decimal points on specific columns which contain 'nan'

I am loading a dataframe from a csv file using,
df_c = pd.read_csv(os.path.join(dir_w, csv_c))
Which gives me,
SubCase Row1 Row2 Row3 Row4
0 1003001 NaN 0 NaN 10.0
1 1003002 NaN 0 NaN 10.5
2 1003003 NaN 0 NaN 11.3
3 2000001 110001.0 10 1.0 9.81
4 2000002 110001.0 10 1.0 5.06
For columns 'Row1' and 'Row2' I want to remove the decimal points, while keeping the decimal points in 'Row4'. So it looks like this,
SubCase Row1 Row2 Row3 Row4
0 1003001 NaN 0 NaN 10.0
1 1003002 NaN 0 NaN 10.5
2 1003003 NaN 0 NaN 11.3
3 2000001 110001 10 1 9.81
4 2000002 110001 10 1 5.06
I have tried the following code with no luck,
na_mask = df_c['Row1'].notnull()
df_c.loc[na_mask, 'Row1'] = df_c.loc[na_mask, 'Row1'].round(decimals=0)
and
na_mask = df_c['Row1'].notnull()
df_c.loc[na_mask, 'Row1'] = df_c.loc[na_mask, 'Row1'].astype(int)
Any ideas? Thanks in advance for any help.

How to reshape dataframe by creating a grouping multiheader from specific column?

I like to reshape a dataframe thats first column should be used to group the other columns by an additional header row.
Initial dataframe
df = pd.DataFrame(
{
'col1':['A','A','A','B','B','B'],
'col2':[1,2,3,4,5,6],
'col3':[1,2,3,4,5,6],
'col4':[1,2,3,4,5,6],
'colx':[1,2,3,4,5,6]
}
)
Trial:
Using pd.pivot() I can create an example, but this do not fit my expected one, it seems to be flipped in grouping:
df.pivot(columns='col1', values=['col2','col3','col4','colx'])
col2 col3 col4 colx
col1 A B A B A B A B
0 1.0 NaN 1.0 NaN 1.0 NaN 1.0 NaN
1 2.0 NaN 2.0 NaN 2.0 NaN 2.0 NaN
2 3.0 NaN 3.0 NaN 3.0 NaN 3.0 NaN
3 NaN 4.0 NaN 4.0 NaN 4.0 NaN 4.0
4 NaN 5.0 NaN 5.0 NaN 5.0 NaN 5.0
5 NaN 6.0 NaN 6.0 NaN 6.0 NaN 6.0
Expected output:
A B
col1 col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6

Create counter column by GroupBy.cumcount, then use DataFrame.pivot with swapping level of MultiIndex in columns by DataFrame.swaplevel, sorting it and last remove index and columns names by DataFrame.rename_axis:
df = (df.assign(g = df.groupby('col1').cumcount())
.pivot(index='g', columns='col1')
.swaplevel(0,1,axis=1)
.sort_index(axis=1)
.rename_axis(index=None, columns=[None, None]))
print(df)
A B
col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6

As an alternative to the classical pivot, you can concat the output of groupby with a dictionary comprehension, ensuring alignment with reset_index:
out = pd.concat({k: d.drop(columns='col1').reset_index(drop=True)
for k,d in df.groupby('col1')}, axis=1)
output:
A B
col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6

Identifying if a value against 2 months in the past exists and display it

Hello I have data as follows:
Col1 Col2 col3
A 2020-01-08 25
A 2020-01-11 26
B 2020-01-06 32
B 2020-01-08 45
I want to create another column(col 4) which will have the value for each category in col1 with the 2 months prior col-3 values as below:
Col1 Col2 col3 col4
A 2020-01-08 25 NaN
A 2020-01-10 56 25
A 2020-01-11 26 NaN
B 2020-01-06 32 NaN
B 2020-01-08 45 32
I tried pd.shift, but its not working If I have missing months in the data. Can anyone please help?

Use np.where to conditionally identify groups in which consecutive difference are greater than or equal to 60 days
df['col4'] = np.where(df.groupby('Col1')['Col2'].diff().dt.days.ge(60),df['col3'].shift(), np.nan)
Col1 Col2 col3 col4
0 A 2020-08-01 25 NaN
1 A 2020-10-01 56 25.0
2 A 2020-11-01 26 NaN
3 B 2020-06-01 32 NaN
4 B 2020-08-01 45 32.0

Drop columns tha thave a header but all rows are empty Python 3 & Pandas

I just could not figure this one out:
df.dropna(axis = 1, how="all").dropna(axis= 0 ,how="all")
All headers have data. How can I exclude the headers form a df.dropna(how="all") command.
I am afraid this is going to be trivial, but help me out guys.
Thanks,
Levi

Okay, as I understand what you want is as follows:
drop any column where all rows contain NaN
drop any row in which one or more NaN appear
So for example, given a dataframe df like:
Id Col1 Col2 Col3 Col4
0 1 25.0 A NaN 6
1 2 15.0 B NaN 7
2 3 23.0 C NaN 8
3 4 5.0 D NaN 9
4 5 NaN E NaN 10
convert the dataframe by:
df.dropna(axis = 1, how="all", inplace= True)
df.dropna(axis = 0, how='all', inplace= True)
which yields:
Id Col1 Col2 Col4
0 1 25.0 A 6
1 2 15.0 B 7
2 3 23.0 C 8
3 4 5.0 D 9
4 5 NaN E 10

Number of NaN values before first non NaN value Python dataframe

I have a dataframe with several columns, some of them contain NaN values. I would like for each row to create another column containing the total number of columns minus the number of NaN values before the first non NaN value.
Original dataframe:
ID Value0 Value1 Value2 Value3
1 10 10 8 15
2 NaN 45 52 NaN
3 NaN NaN NaN NaN
4 NaN NaN 100 150
The extra column would look like:
ID NewColumn
1 4
2 3
3 0
4 2
Thanks in advance!

Set the index to ID
Attach a non-null column to stop/catch the argmax
Use argmax to find the first non-null value
Subtract those values from the length of the relevant columns
df.assign(
NewColumn=
df.shape[1] - 1 -
df.set_index('ID').assign(notnull=1).notnull().values.argmax(1)
)
ID Value0 Value1 Value2 Value3 NewColumn
0 1 10.0 10.0 8.0 15.0 4
1 2 NaN 45.0 52.0 NaN 3
2 3 NaN NaN NaN NaN 0
3 4 NaN NaN 100.0 150.0 2

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Getting average in pandas dataframe when there are NaN values - python-3.x

You can set skipna=True when calculating the mean: df = df.mean(axis=0, skipna=True).rename('Mean').pipe(df.append) print(df) Col1 Col2 Col3 Col4 0 A 11.0 10.00 NaN 1 B 14.0 NaN 15.000000 2 C 45.0 16.00 0.000000 3 D NaN 16.00 NaN 4 E 12.0 23.00 5.000000 Mean NaN 20.5 16.25 6.666667

Related

Python formatting decimal points on specific columns which contain 'nan'

How to reshape dataframe by creating a grouping multiheader from specific column?

Identifying if a value against 2 months in the past exists and display it

Drop columns tha thave a header but all rows are empty Python 3 & Pandas

Number of NaN values before first non NaN value Python dataframe

Categories

Resources