I want to find the 3% percentile of the following data and then average the data.
Given below is the data structure.
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
96927 NaN
96928 NaN
96929 NaN
96930 NaN
96931 NaN
Over here the concerned data lies exactly between the data from 13240:61156.
Given below is my code:
enter code here
import pandas as pd
import numpy as np
load_var=pd.read_excel(r'path\file name.xlsx')
load_var
a=pd.DataFrame(load_var['column whose percentile is to be found'])
print(a)
b=np.nanpercentile(a,3)
print(b)
Please suggest the changes in the code.
Thank you.
Use Series.quantile with mean in Series.agg:
df = pd.DataFrame({
'col':[7,8,9,4,2,3, np.nan],
})
f = lambda x: x.quantile(0.03)
f.__name__ = 'q'
s = df['col'].agg(['mean', f])
print (s)
mean 5.50
q 2.15
Name: col, dtype: float64
Related
I am trying to fill a dataframe which originally has NaN values with the same number of values taken from an array. All the values in the dictionary leagueList (NFL,NBA, etc.) are individual dataframes.
Sorry, I can't place them here as the post will become too long.
The idea behind the loop below is to get the series of paired t-tests (p_value) between all leagues in the dataframe and compare them based on columns called 'win_loss_ratio'.
The resulting array with the same number of values as in the empty dataframe should be used to replace the NaN values in the dataframe but I am stuck on this part. How this could be accomplished?
leagueList={'NFL':NFL,'NBA':NBA,'NHL':NHL,'MLB':MLB}
df = pd.DataFrame(columns = leagueList, index = leagueList)
print(df)
NFL NBA NHL MLB
NFL NaN NaN NaN NaN
NBA NaN NaN NaN NaN
NHL NaN NaN NaN NaN
MLB NaN NaN NaN NaN
#Double loop for making all possible league combinations
for a in leagueList.values():
for b in leagueList.values():
df_comb=pd.merge(a,b,left_index=True,right_index=True,how='inner')
teststat,p_value=stats.ttest_rel(df_comb[['win_loss_ratio_x']],df_comb[['win_loss_ratio_y']])
print(p_value)
[nan]
[0.94179205]
[0.03088317]
[0.80206949]
[0.94179205]
[nan]
[0.02229705]
[0.95053998]
[0.03088317]
[0.02229705]
[nan]
[0.00070784]
[0.80206949]
[0.95053998]
[0.00070784]
[nan]
Put the p-values into a list to either use .fillna, or just construct it straight a way:
import pandas as pd
from scipy import stats
#some sample data
NFL = pd.DataFrame([.5,.6,.7], columns=['win_loss_ratio'])
NBA = pd.DataFrame([.7,.5,.3], columns=['win_loss_ratio'])
NHL = pd.DataFrame([.4,.3,.2], columns=['win_loss_ratio'])
MLB = pd.DataFrame([.9,.8,.9], columns=['win_loss_ratio'])
leagueList={'NFL':NFL,'NBA':NBA,'NHL':NHL,'MLB':MLB}
#Double loop for making all possible league combinations
rows = []
for a in leagueList.values():
for b in leagueList.values():
df_comb=pd.merge(a,b,left_index=True,right_index=True,how='inner')
teststat,p_value=stats.ttest_rel(df_comb[['win_loss_ratio_x']],df_comb[['win_loss_ratio_y']])
rows.append(p_value[0])
n=len(leagueList)
data = [rows[i * n:(i + 1) * n] for i in range((len(rows) + n - 1) // n )]
df = pd.DataFrame(data, columns = leagueList, index = leagueList)
Output:
print (df.to_string())
NFL NBA NHL MLB
NFL NaN 0.622036 0.12169 0.057191
NBA 0.622036 NaN 0.07418 0.092735
NHL 0.121690 0.074180 NaN 0.013560
MLB 0.057191 0.092735 0.01356 NaN
I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A"], 'col1':[np.nan, np.nan], 'col2':[np.nan, 5]})
df
ORDER col1 col2
0 A NaN NaN
1 A NaN 5.0
I want to create a column 'new' as sum(col1, col2) ignoring Nan only if one of the column as Nan,
If both of the columns have NaN value, it should return NaN as below
I tried the below code and it works fine. Is there any way to achieve the same with just one line of code.
df['new'] = df[['col1', 'col2']].sum(axis = 1)
df['new'] = np.where(pd.isnull(df['col1']) & pd.isnull(df['col2']), np.nan, df['new'])
df
ORDER col1 col2 new
0 A NaN NaN NaN
1 A NaN 5.0 5.0
Do sum with min_count
df['new'] = df[['col1','col2']].sum(axis=1,min_count=1)
Out[78]:
0 NaN
1 5.0
dtype: float64
Use the add function on the two columns, which takes a fill_value argument that lets you replace NaN:
df['col1'].add(df['col2'], fill_value=0)
0 NaN
1 5.0
dtype: float64
Is this ok?
df['new'] = df[['col1', 'col2']].sum(axis = 1).replace(0,np.nan)
How do i remove nan values from dataframe in Python? I already tried with dropna(), but that did not work for me. Also is NaN diffferent from nan. I am using Pandas.
While printing the data frame it does not print as NaN but instead as nan.
1 2.11358 0.649067060588935
2 nan 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 nan
You can change nan values with NaN using replace() and then use dropna().
import numpy as np
df = df.replace('nan', np.nan)
df = df.dropna()
Update:
Original dataframe:
1 2.11358 0.649067060588935
2 nan 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 nan
Applied df.replace('nan', np.nan):
1 2.11358 0.649067060588935
2 NaN 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 NaN
Applied df.dropna():
1 2.11358 0.649067060588935
3 2.10066 0.3653980276694516
I am trying to create a dataframe which columns from 2 different datarame.
import pandas as pd
import numpy as np
from statsmodels import api as sm
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2016,12,2)
end = datetime.datetime.today()
df = web.get_data_yahoo(['F'], start, end)
df1 = web.get_data_yahoo(['^GSPC'], start, end)
df3 = pd.concat([df['Adj Close'], df1['Adj Close']])
With this i wanted to get df3 with 2 columns containing data of [Adj Close]. What i got instead is :
F ^GSPC
Date
2016-12-01 10.297861 NaN
2016-12-02 10.140451 NaN
2016-12-05 10.306145 NaN
2016-12-06 10.405562 NaN
2016-12-07 10.819797 NaN
... ... ...
2019-11-22 NaN 3110.290039
2019-11-25 NaN 3133.639893
2019-11-26 NaN 3140.520020
2019-11-27 NaN 3153.629883
2019-11-29 NaN 3140.979980
1508 rows × 2 columns
What do i need to do to get rid of NaN values and why is it there?
Add parameter axis=1 for concanecate by columns in concat:
df3 = pd.concat([df['Adj Close'], df1['Adj Close']], axis=1)
But I think your solution should be simplify with pass list to get_data_yahoo:
df3 = web.get_data_yahoo(['F', '^GSPC'], start, end)
Given is a Pandas DataFrame with a numerical index representing the relative number of months:
df = pd.DataFrame(columns=['A', 'B'], index=np.arange(1,100))
df
A B
1 NaN NaN
2 NaN NaN
3 NaN NaN
...
How can the index be converted to a DateTimeIndex by specifying a start date (e.g., 2018-11-01)?
magic_function(df, start='2018-11-01', delta='month')
A B
2018-11-01 NaN NaN
2018-12-01 NaN NaN
2019-01-01 NaN NaN
...
I would favor a general solution that also works with arbitrary deltas, e.g. daily or yearly series.
Using date_range
idx=pd.date_range(start='2018-11-01',periods =len(df),freq='MS')
df.index=idx
I'm not sure with Pandas, but with plain datetime can't you just do this?
import datetime
start=datetime.date(2018,1,1)
months = 15
adjusted = start.replace(year=start.year + int(months/12), month=months%12)