How to reshape dataframe by creating a grouping multiheader from specific column? - python-3.x

I like to reshape a dataframe thats first column should be used to group the other columns by an additional header row.
Initial dataframe
df = pd.DataFrame(
{
'col1':['A','A','A','B','B','B'],
'col2':[1,2,3,4,5,6],
'col3':[1,2,3,4,5,6],
'col4':[1,2,3,4,5,6],
'colx':[1,2,3,4,5,6]
}
)
Trial:
Using pd.pivot() I can create an example, but this do not fit my expected one, it seems to be flipped in grouping:
df.pivot(columns='col1', values=['col2','col3','col4','colx'])
col2 col3 col4 colx
col1 A B A B A B A B
0 1.0 NaN 1.0 NaN 1.0 NaN 1.0 NaN
1 2.0 NaN 2.0 NaN 2.0 NaN 2.0 NaN
2 3.0 NaN 3.0 NaN 3.0 NaN 3.0 NaN
3 NaN 4.0 NaN 4.0 NaN 4.0 NaN 4.0
4 NaN 5.0 NaN 5.0 NaN 5.0 NaN 5.0
5 NaN 6.0 NaN 6.0 NaN 6.0 NaN 6.0
Expected output:
A B
col1 col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6

Create counter column by GroupBy.cumcount, then use DataFrame.pivot with swapping level of MultiIndex in columns by DataFrame.swaplevel, sorting it and last remove index and columns names by DataFrame.rename_axis:
df = (df.assign(g = df.groupby('col1').cumcount())
.pivot(index='g', columns='col1')
.swaplevel(0,1,axis=1)
.sort_index(axis=1)
.rename_axis(index=None, columns=[None, None]))
print(df)
A B
col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6

As an alternative to the classical pivot, you can concat the output of groupby with a dictionary comprehension, ensuring alignment with reset_index:
out = pd.concat({k: d.drop(columns='col1').reset_index(drop=True)
for k,d in df.groupby('col1')}, axis=1)
output:
A B
col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6

Related

Drop columns tha thave a header but all rows are empty Python 3 & Pandas

I just could not figure this one out:
df.dropna(axis = 1, how="all").dropna(axis= 0 ,how="all")
All headers have data. How can I exclude the headers form a df.dropna(how="all") command.
I am afraid this is going to be trivial, but help me out guys.
Thanks,
Levi
Okay, as I understand what you want is as follows:
drop any column where all rows contain NaN
drop any row in which one or more NaN appear
So for example, given a dataframe df like:
Id Col1 Col2 Col3 Col4
0 1 25.0 A NaN 6
1 2 15.0 B NaN 7
2 3 23.0 C NaN 8
3 4 5.0 D NaN 9
4 5 NaN E NaN 10
convert the dataframe by:
df.dropna(axis = 1, how="all", inplace= True)
df.dropna(axis = 0, how='all', inplace= True)
which yields:
Id Col1 Col2 Col4
0 1 25.0 A 6
1 2 15.0 B 7
2 3 23.0 C 8
3 4 5.0 D 9
4 5 NaN E 10

Replacing constant values with nan

import pandas as pd
data={'col1':[1,3,3,1,2,3,2,2]}
df=pd.DataFrame(data,columns=['col1'])
print df
col1
0 1
1 3
2 3
3 1
4 2
5 3
6 2
7 2
Expected result:
Col1 newCol1
0 1. 1
1 3. 3
2 3. NaN
3. 1. 1
4 2. 2
5 3. 3
6 2. 2
7. 2. Nan
Try where combine with shift
df['col2'] = df.col1.where(df.col1.ne(df.col1.shift()))
df
Out[191]:
col1 col2
0 1 1.0
1 3 3.0
2 3 NaN
3 1 1.0
4 2 2.0
5 3 3.0
6 2 2.0
7 2 NaN

Getting average in pandas dataframe when there are NaN values

I have a pandas dataframe and I would like to add a row at the end of dataframe to show the average of each column; however, due to NaN values in Col2, Col3, and Col4, the mean function cannot return the correct average of the columns. How can I fix this issue?
Col1 Col2 Col3 Col4
1 A 11 10 NaN
2 B 14 NaN 15
3 C 45 16 0
4 D NaN 16 NaN
5 E 12 23 5
P.S. This is the dataframe after getting average (df.loc["mean"] = df.mean()):
Col1 Col2 Col3 Col4
1 A 11 10 NaN
2 B 14 NaN 15
3 C 45 16 0
4 D NaN 16 NaN
5 E 12 23 5
Mean NaN Nan NaN NaN
Problem is columns are not filled by numbers, but string repr, so first convert them to numeric by DataFrame.astype:
cols = ['Col2','Col3','Col4']
df[cols] = df[cols].astype(float)
df.loc["mean"] = df.mean()
print (df)
Col1 Col2 Col3 Col4
1 A 11.0 10.00 NaN
2 B 14.0 NaN 15.000000
3 C 45.0 16.00 0.000000
4 D NaN 16.00 NaN
5 E 12.0 23.00 5.000000
mean NaN 20.5 16.25 6.666667
Or if some non numeric values use to_numeric with errors='coerce':
cols = ['Col2','Col3','Col4']
df[cols] = df[cols].apply(lambda x: pd.to_numeric(x, errors='coerce'))
df.loc["mean"] = df.mean()
You can set skipna=True when calculating the mean:
df = df.mean(axis=0, skipna=True).rename('Mean').pipe(df.append)
print(df)
Col1 Col2 Col3 Col4
0 A 11.0 10.00 NaN
1 B 14.0 NaN 15.000000
2 C 45.0 16.00 0.000000
3 D NaN 16.00 NaN
4 E 12.0 23.00 5.000000
Mean NaN 20.5 16.25 6.666667

shifting a column down in a pandas dataframe

I have data in the following way
A B C
1 2 3
2 5 6
7 8 9
I want to change the dataframe into
A B C
2 3
1 5 6
2 8 9
3
One way would be to add a blank row to the dataframe and then use shift
# input df:
A B C
0 1 2 3
1 2 5 6
2 7 8 9
df.loc[len(df.index), :] = None
df['A'] = df.A.shift(1)
print (df)
A B C
0 NaN 2.0 3.0
1 1.0 5.0 6.0
2 2.0 8.0 9.0
3 7.0 NaN NaN

Number of NaN values before first non NaN value Python dataframe

I have a dataframe with several columns, some of them contain NaN values. I would like for each row to create another column containing the total number of columns minus the number of NaN values before the first non NaN value.
Original dataframe:
ID Value0 Value1 Value2 Value3
1 10 10 8 15
2 NaN 45 52 NaN
3 NaN NaN NaN NaN
4 NaN NaN 100 150
The extra column would look like:
ID NewColumn
1 4
2 3
3 0
4 2
Thanks in advance!
Set the index to ID
Attach a non-null column to stop/catch the argmax
Use argmax to find the first non-null value
Subtract those values from the length of the relevant columns
df.assign(
NewColumn=
df.shape[1] - 1 -
df.set_index('ID').assign(notnull=1).notnull().values.argmax(1)
)
ID Value0 Value1 Value2 Value3 NewColumn
0 1 10.0 10.0 8.0 15.0 4
1 2 NaN 45.0 52.0 NaN 3
2 3 NaN NaN NaN NaN 0
3 4 NaN NaN 100.0 150.0 2

Resources