I have calculated cdf for a data set in pandas df and want to determine the respective percentile from the cdf chart.
code for cdf:
def cdf(x):
df_1=pmf(x)
df1 = pd.DataFrame()
df1['pmf'] = df_1['pmf'].sort_index()
df1['x'] = df_1['x']
df1['cdf'] = np.cumsum(df1['pmf'])
return df1
This is the generated cdf df:
Now i want to write a simple logic to fetch the "x" data corresponding to a cdf for determining percentile.
Appreciate any help in this regard.
you can do it as below(use df name in place of df below):
df.loc[df['cdf'] == 0.999083, 'x']
output:
12.375
Related
I have simple Pandas DataFrame with 3 columns. I am trying to Transpose it into and then rename that new dataframe and I am having bit trouble.
df = pd.DataFrame({'TotalInvoicedPrice': [123],
'TotalProductCost': [18],
'ShippingCost': [5]})
I tried using
df =df.T
which transpose the DataFrame into:
TotalInvoicedPrice,123
TotalProductCost,18
ShippingCost,5
So now i have to add column names to this data frame "Metrics" and "Values"
I tried using
df.columns["Metrics","Values"]
but im getting errors.
What I need to get is DataFrame that looks like:
Metrics Values
0 TotalInvoicedPrice 123
1 TotalProductCost 18
2 ShippingCost 5
Let's reset the index then set the column labels
df.T.reset_index().set_axis(['Metrics', 'Values'], axis=1)
Metrics Values
0 TotalInvoicedPrice 123
1 TotalProductCost 18
2 ShippingCost 5
Maybe you can avoid transpose operation (little performance overhead)
#YOUR DATAFRAME
df = pd.DataFrame({'TotalInvoicedPrice': [123],
'TotalProductCost': [18],
'ShippingCost': [5]})
#FORM THE LISTS FROM YOUR COLUMNS AND FIRST ROW VALUES
l1 = df.columns.values.tolist()
l2 = df.iloc[0].tolist()
#CREATE A DATA FRAME.
df2 = pd.DataFrame(list(zip(l1, l2)),columns = ['Metrics', 'Values'])
print(df2)
I am new to python and doing a time series analysis of stocks.I created a data frame of rolling average of 5 stocks according to their percentage change in close price.Therefore this df has 5 columns and i have another df index rolling average of percentage change of closing price.I want to plot individual stock column of the df with the index df. I wrote this code
fig.add_subplot(5,1,1)
plt.plot(pctchange_RA['HUL'])
plt.plot(N50_RA)
fig.add_subplot(5,1,2)
plt.plot(pctchange_RA['IRCON'])
plt.plot(N50_RA)
fig.add_subplot(5,1,3)
plt.plot(pctchange_RA['JUBLFOOD'])
plt.plot(N50_RA)
fig.add_subplot(5,1,4)
plt.plot(pctchange_RA['PVR'])
plt.plot(N50_RA)
fig.add_subplot(5,1,5)
plt.plot(pctchange_RA['VOLTAS'])
plt.plot(N50_RA)
NOTE:pctchange_RA is a pandas df of 5 stocks and N50_RA is a index df of one column
You can put your column names in a list and then just loop over it and create subplots dynamically. A pseudocode would look like the following
cols = ['HUL', 'IRCON', 'JUBLFOOD', 'PVR', 'VOLTAS']
for i, col in enumerate(cols):
ax = fig.add_subplot(5, 1, i+1)
ax.plot(pctchange_RA[col])
ax.plot(N50_RA)
I am importing columns in Excel and finding correlation coefficient between these columns.
import pandas as pd
data = pd.read_excel('ExcelFileName.xlsx')
df = pd.DataFrame(data)
df.corr()
I need to show only correlation coefficient fitting between +0.6 to +1 and -0.5 to -1.
I am working on the titanic dataset.
I compute the mean of df['Age'] based on 'Sex' and 'Pclass', to fill NaNs in the df['Age'].
the code is the following:
import pandas as pd
df = pd.read_csv('train.csv')
df['Age'] = df.groupby(['Sex','Pclass'])['Age'].transform(lambda x:x.fillna(x.mean()))
This works fine but now, on the test set, I want to fill NaNs with the values of mean Age grouped by 'Sex' and 'Pclass' from the training set.
I can easily get the values by df.groupby(['Sex', 'Pclass'])['Age].mean(), but I can not figure out how to reuse theses values to fill NaNs in test dataframe.
Can anyone help me?
Use DataFrame.merge with left join and then replace missing values by Series.fillna with DataFrame.pop:
mean = df1.groupby(['Sex', 'Pclass'], as_index=False)['Age'].mean()
df2 = df2.merge(mean, on=['Sex','Pclass'], how='left', suffixes=('','_'))
df2['Age'] = df2['Age'].fillna(df2.pop('Age_'))
Problem Attempting a groupby on a simple dataframe (downloadable csv) and then agg to return aggregate values for columns (size, sum, mean, std deviation). What seems like a simple problem is giving an unexpectedly challenging error.
Top15.groupby('Continent')['Pop Est'].agg(np.mean, np.std...etc)
# returns
ValueError: No axis named <function std at 0x7f16841512f0> for object type <class 'pandas.core.series.Series'>
What I am trying to get is a df with index set to continents and columns ['size', 'sum', 'mean', 'std']
Example Code
import pandas as pd
import numpy as np
# Create df
df = pd.DataFrame({'Country':['Australia','China','America','Germany'],'Pop Est':['123','234','345','456'],'Continent':['Asia','Asia','North America','Europe']})
# group and agg
df = df.groupby('Continent')['Pop Est'].agg('size','sum','np.mean','np.std')
You can only aggregate size and sum on numeric values so when you create your dataframe don't input your numbers as stings:
df = pd.DataFrame({'Country':['Australia','China','America','Germany'],'PopEst':[123,234,345,456],'Continent':['Asia','Asia','North America','Europe']})
I think that this will get you what you want?
grouped = df.groupby('Continent')
grouped['PopEst'].agg(['size','sum','mean','std'])
size sum mean std
Continent
Asia 2 357 178.5 78.488853
Europe 1 456 456.0 NaN
North America 1 345 345.0 NaN