How to stop sort_values sorting by column names alphabetically? - python-3.x

I am working with a pandas dataframe, in which some of the columns have no entries. I want to put all columns at the end and I manage to do it (see code below), but I also notice that after sorting the remaining columns were also sorted alphabetically by column names in descending order. Can I prevent this from happening?
Input dataframe:
,colA,colB,colC,colD,colF
rowA,X,nan,nan,X,nan
rowB,nan,X,nan,nan,X
rowC,X,nan,nan,X,X
rowD,X,nan,nan,nan,nan
rowE,nan,X,nan,nan,X
Code:
import pandas as pd
df = pd.read_csv (r'q1.csv', dtype= 'str', index_col=0, na_values = 'nan')
ind = df.notnull().astype('int').any().sort_values(ascending= False).index
out = df.loc[:,ind]
out.to_csv(r'out.csv', na_rep= 'nan')
Output dataframe:
,colF,colD,colB,colA,colC
rowA,nan,X,nan,X,nan
rowB,X,nan,X,nan,nan
rowC,X,X,nan,X,nan
rowD,nan,nan,nan,X,nan
rowE,X,nan,X,nan,nan
Essentially, I want to keep order as it is for all other columns.
Thanks.

If I understand correctly, you may try this.
m = df.isna().all().sort_values(kind='mergesort')
df_new = df[m.index]
Out[243]:
colA colB colD colF colC
rowA X NaN X NaN NaN
rowB NaN X NaN X NaN
rowC X NaN X X NaN
rowD X NaN NaN NaN NaN
rowE NaN X NaN X NaN

Related

Assign array values to NaN Dataframe Pandas

I am trying to fill a dataframe which originally has NaN values with the same number of values taken from an array. All the values in the dictionary leagueList (NFL,NBA, etc.) are individual dataframes.
Sorry, I can't place them here as the post will become too long.
The idea behind the loop below is to get the series of paired t-tests (p_value) between all leagues in the dataframe and compare them based on columns called 'win_loss_ratio'.
The resulting array with the same number of values as in the empty dataframe should be used to replace the NaN values in the dataframe but I am stuck on this part. How this could be accomplished?
leagueList={'NFL':NFL,'NBA':NBA,'NHL':NHL,'MLB':MLB}
df = pd.DataFrame(columns = leagueList, index = leagueList)
print(df)
NFL NBA NHL MLB
NFL NaN NaN NaN NaN
NBA NaN NaN NaN NaN
NHL NaN NaN NaN NaN
MLB NaN NaN NaN NaN
#Double loop for making all possible league combinations
for a in leagueList.values():
for b in leagueList.values():
df_comb=pd.merge(a,b,left_index=True,right_index=True,how='inner')
teststat,p_value=stats.ttest_rel(df_comb[['win_loss_ratio_x']],df_comb[['win_loss_ratio_y']])
print(p_value)
[nan]
[0.94179205]
[0.03088317]
[0.80206949]
[0.94179205]
[nan]
[0.02229705]
[0.95053998]
[0.03088317]
[0.02229705]
[nan]
[0.00070784]
[0.80206949]
[0.95053998]
[0.00070784]
[nan]
Put the p-values into a list to either use .fillna, or just construct it straight a way:
import pandas as pd
from scipy import stats
#some sample data
NFL = pd.DataFrame([.5,.6,.7], columns=['win_loss_ratio'])
NBA = pd.DataFrame([.7,.5,.3], columns=['win_loss_ratio'])
NHL = pd.DataFrame([.4,.3,.2], columns=['win_loss_ratio'])
MLB = pd.DataFrame([.9,.8,.9], columns=['win_loss_ratio'])
leagueList={'NFL':NFL,'NBA':NBA,'NHL':NHL,'MLB':MLB}
#Double loop for making all possible league combinations
rows = []
for a in leagueList.values():
for b in leagueList.values():
df_comb=pd.merge(a,b,left_index=True,right_index=True,how='inner')
teststat,p_value=stats.ttest_rel(df_comb[['win_loss_ratio_x']],df_comb[['win_loss_ratio_y']])
rows.append(p_value[0])
n=len(leagueList)
data = [rows[i * n:(i + 1) * n] for i in range((len(rows) + n - 1) // n )]
df = pd.DataFrame(data, columns = leagueList, index = leagueList)
Output:
print (df.to_string())
NFL NBA NHL MLB
NFL NaN 0.622036 0.12169 0.057191
NBA 0.622036 NaN 0.07418 0.092735
NHL 0.121690 0.074180 NaN 0.013560
MLB 0.057191 0.092735 0.01356 NaN

Summing up two columns of pandas dataframe ignoring NaN

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A"], 'col1':[np.nan, np.nan], 'col2':[np.nan, 5]})
df
ORDER col1 col2
0 A NaN NaN
1 A NaN 5.0
I want to create a column 'new' as sum(col1, col2) ignoring Nan only if one of the column as Nan,
If both of the columns have NaN value, it should return NaN as below
I tried the below code and it works fine. Is there any way to achieve the same with just one line of code.
df['new'] = df[['col1', 'col2']].sum(axis = 1)
df['new'] = np.where(pd.isnull(df['col1']) & pd.isnull(df['col2']), np.nan, df['new'])
df
ORDER col1 col2 new
0 A NaN NaN NaN
1 A NaN 5.0 5.0
Do sum with min_count
df['new'] = df[['col1','col2']].sum(axis=1,min_count=1)
Out[78]:
0 NaN
1 5.0
dtype: float64
Use the add function on the two columns, which takes a fill_value argument that lets you replace NaN:
df['col1'].add(df['col2'], fill_value=0)
0 NaN
1 5.0
dtype: float64
Is this ok?
df['new'] = df[['col1', 'col2']].sum(axis = 1).replace(0,np.nan)

Trying to append a single row of data to a pandas DataFrame, but instead adds rows for each field of input

I am trying to add a row of data to a pandas DataFrame, but it keeps adding a separate row for each piece of data. I feel I am missing something very simple and obvious, but what it is I do not know.
import pandas
colNames = ["ID", "Name", "Gender", "Height", "Weight"]
df1 = pandas.DataFrame(columns = colNames)
df1.set_index("ID", inplace=True, drop=False)
i = df1.shape[0]
person = [{"ID":i},{"Name":"Jack"},{"Gender":"Male"},{"Height":177},{"Weight":75}]
df1 = df1.append(pandas.DataFrame(person, columns=colNames))
print(df1)
Output:
ID Name Gender Height Weight
0 0.0 NaN NaN NaN NaN
1 NaN Jack NaN NaN NaN
2 NaN NaN Male NaN NaN
3 NaN NaN NaN 177.0 NaN
4 NaN NaN NaN NaN 75.0
You are using too many squiggly brackets. All of your data should be inside one pair of squiggly brackets. This creates a single python dictionary. Change that line to:
person = [{"ID":i,"Name":"Jack","Gender":"Male","Height":177,"Weight":75}]

How reindex_like function works with method "ffill" & "bfill"?

I have two dataframe of shape (6,3) & (2,3). Now I want to reindex second dataframe like first dataframe and also fill na values with either ffill method or bfill method. my code is as follows:
df1 = pd.DataFrame(np.random.randn(6,3),columns = ['Col1','Col2','Col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns = ['Col1','Col2','Col3'])
df2 = df2.reindex_like(df1,method='ffill')
But this code is not working well as I am getting following result:
Col1 Col2 Col3
0 0.578282 -0.199872 0.468505
1 1.086811 -0.707933 -0.924984
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
Any suggestion would be great

Pandas take value from columns if not NaN

Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['One','Two',np.nan],
'B':[np.nan,np.nan,'Three'],
})
df
A B
0 One NaN
1 Two NaN
2 NaN Three
I'd like to create one column ('C') that takes the value of either 'A' or 'B' if it is not NaN like this:
A B C
0 One NaN One
1 Two NaN Two
2 NaN Three Three
Thanks in advance!
You can use combine_first:
df['C'] = df.A.combine_first(df.B)
print df
A B C
0 One NaN One
1 Two NaN Two
2 NaN Three Three
Or fillna:
df['C']= df.A.fillna(df.B)
print df
A B C
0 One NaN One
1 Two NaN Two
2 NaN Three Three
Or np.where and add value if both conditions are False e.g. 1:
df['C'] = np.where(df.A.notnull(), df.A,np.where(df.B.notnull(), df.B, 1))
print df
A B C
0 One NaN One
1 Two NaN Two
2 NaN Three Three

Resources