Calculate 95th percentile of values with grouping variable - excel

I'm trying to calculate the 95th percentile for multiple water quality values grouped by watershed, for example:
Watershed WQ
50500101 62.370661
50500101 65.505046
50500101 58.741477
50500105 71.220034
50500105 57.917249
I reviewed this question posted - Percentile for Each Observation w/r/t Grouping Variable. It seems very close to what I want to do but it's for EACH observation. I need it for each grouping variable. so ideally,
Watershed WQ - 95th
50500101 x
50500105 y

This can be achieved using the plyr library. We specify the grouping variable Watershed and ask for the 95% quantile of WQ.
library(plyr)
#Random seed
set.seed(42)
#Sample data
dat <- data.frame(Watershed = sample(letters[1:2], 100, TRUE), WQ = rnorm(100))
#plyr call
ddply(dat, "Watershed", summarise, WQ95 = quantile(WQ, .95))
and the results
Watershed WQ95
1 a 1.353993
2 b 1.461711

I hope I understand your question correctly. Is this what you're looking for?
my.df <- data.frame(group = gl(3, 5), var = runif(15))
aggregate(my.df$var, by = list(my.df$group), FUN = function(x) quantile(x, probs = 0.95))
Group.1 x
1 1 0.6913747
2 2 0.8067847
3 3 0.9643744
EDIT
Based on Vincent's answer,
aggregate(my.df$var, by = list(my.df$group), FUN = quantile, probs = 0.95)
also works (you can skin a cat 1001 ways - I've been told). A side note, you can specify a vector of desired -iles, say c(0.1, 0.2, 0.3...) for deciles. Or you can try function summary for some predefined statistics.
aggregate(my.df$var, by = list(my.df$group), FUN = summary)

Use a combination of the tapply and quantile functions. For example, if your dataset looks like this:
DF <- data.frame('watershed'=sample(c('a','b','c','d'), 1000, replace=T), wq=rnorm(1000))
Use this:
with(DF, tapply(wq, watershed, quantile, probs=0.95))

In Excel, you're going to want to use an array formula to make this easy. I suggest the following:
{=PERCENTILE(IF($A2:$A6 = Watershed ID, $B$2:$B$6), 0.95)}
Column A would be the Watershed ids, and Column B would be the WQ values.
Also, be sure to enter the formula as an array formula. Do so by pressing Ctrl+Shift+Enter when entering the formula.

Using the data.table-package you can do:
set.seed(42)
#Sample data
dt <- data.table(Watershed = sample(letters[1:2], 100, TRUE), WQ = rnorm(100))
dt[ ,
j = .(WQ95 = quantile(WQ, .95, na.rm = TRUE),
by = Watershed]

Related

Implementing a cointegration portfolio in Python for 3 ETFs (EWA, EWC, IGE)

I'm trying to implement a mean-reverting portfolio using the strategies described in "Algorithmic Trading" by Dr. P.E. Chan. However, since the examples he uses are programmed in MATLAB, I'm having trouble translating them correctly to Python. I'm completely stuck trying to create a cointegrating portfolio using 3 ETFs. I think my problems begin when trying to determine the hedges, and then building the desired portfolio.
Any help or tips would be enormously useful.
So, I start by downloading the Adjusted prices and creating the W, X and Y Data Series. The time period I selected is 2007/07/22 through 2012/3/28.
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
import datetime
start = datetime.datetime(2007, 7, 22)
end = datetime.datetime(2012, 3, 28)
EWA = web.DataReader('EWA', 'yahoo', start, end)
EWC = web.DataReader('EWC', 'yahoo', start, end)
IGE = web.DataReader('IGE', 'yahoo', start, end)
w = IGE['Adj Close']
x = EWA['Adj Close']
y = EWC['Adj Close']
df = pd.DataFrame([w,x,y]).transpose()
df.columns = ['W','X','Y']
df.plot(figsize=(20,12))
from statsmodels.tsa.vector_ar.vecm import coint_johansen
y3 = df
j_results = coint_johansen(y3,0,1)
print(j_results.lr1)
print(j_results.cvt)
print(j_results.eig)
print(j_results.evec)
print(j_results.evec[:,0])
So then I'm supposed to build a portfolio by multiplying the eigenvector [0.30.., 1.36.., -1.35..] times the share prices of each instrument to get the y_port value. Afterwards I run a correlation test to determine the correlation between daily change in price of this portfolio and the last day's price change, to be able to determine the half-life for the series.
I did this by just multiplying the eigenvector times the close prices, I don't know if this is where I went wrong.
hedge_ratios = j_results.evec[:,0]
y_port = (hedge_ratios * df).sum(axis=1)
y_port.plot(figsize=(20,12))
y_port_lag = y_port.shift(1)
y_port_lag[0]= 0
delta_y = y_port-y_port_lag
X = y_port_lag
Y = delta_y
X = sm.add_constant(X)
model = OLS(Y,X)
regression_results = model.fit()
regression_results.summary()
So then I calculate the half-life, which is around 19 days.
halflife = -np.log(2)/regression_results.params[0]
halflife
And I define the number of units to hold based on the instructions on the book (the -Z value of the portfolio value, with a lookback window of 19 days based on the half-life).
num_units = -(y_port-y_port.rolling(19).mean())/y_port.rolling(19).std()
num_units.plot(figsize=(20,12))
So the next steps I take are:
Check to see if the dataframe is still correct.
Add the "Number of units to hold", which was calculated previously and is the negative Z score of the y_port value.
There was probably an easier way to multiply or do this, but I calculated the amount of $ I should hold for each instrument by multiplying the instrument price, by the hedge ratio given by the eigenvector, by the number of portfolio units to hold.
Finally I calculated each instrument's PNL by multiplying the daily change * the number of units I was holding.
The results are abysmal. Just losing all the way from beginning to end.
¿Where did I mess up? ¿how can I properly multiply the values in the eigenvector, determine the number of positions to hold, and create the portfolio correctly?
Any assistance would be massively appreciated.
I don't know why but the num_units series was "Horizontal" and I had to transpose it before attaching it to the DataFrame.
num_units = num_units.transpose()
df['Portfolio Units'] = num_units
df
df['W $ Units'] = df['W']*hedge_ratios[0]*df['Portfolio Units']
df['X $ Units'] = df['X']*hedge_ratios[1]*df['Portfolio Units']
df['Y $ Units'] = df['Y']*hedge_ratios[2]*df['Portfolio Units']
positions = df[['W $ Units','X $ Units','Y $ Units']]
positions
pnl = pd.DataFrame()
pnl['W Pnl'] = (df['W']/df['W'].shift(1)-1)*df['W $ Units']
pnl['X Pnl'] = (df['X']/df['X'].shift(1)-1)*df['X $ Units']
pnl['Y Pnl'] = (df['Y']/df['Y'].shift(1)-1)*df['Y $ Units']
pnl['Total PNL'] = pnl.sum(axis=1)
pnl['Total PNL'].cumsum().plot(figsize=(20,12))
I know that if I just revert my positions (not use -1 in the y_port), the results will change and I'll get a positive return. However, I want to know what I did wrong. Using -Z for a mean-reversion strategy makes sense, and I would like to know where I made the mistake, so I can keep up with the rest of the book,
I think that you need to shift df['W $ Units'], df['X $ Units'] and df['Y $ Units'] with 1 as well. So to use df['Y $ Units'].shift(1) instead of df['Y $ Units'], for example.
The result you receive is not abysmal - it is unrealistic. Without shifting df['... $ Units'] you are looking ahead and using data that is not yet available.
I found some problems in part4 and changed it as below:
positions = df[['W $ Units','X $ Units','Y $ Units']]
df5=df.iloc[:,0:3]
pnl=np.sum((positions.shift().values)*(df5.pct_change().values), axis=1)
ret=pnl/np.sum(np.abs(positions.shift()), axis=1)
plt.figure(figsize=(8,5))
plt.plot(np.cumprod(1+ret)-1)
print('APR=%f Sharpe=%f' % (np.prod(1+ret)**(252/len(ret))-1, np.sqrt(252)*np.mean(ret)/np.std(ret)))
As a result we have APR=0.130122 Sharpe=1.518595 daily rets plot

Functions to prevent repetition of code ( Pandas for Python using Jupyter Notebook)

I am new to programming and would appreciate your help.
Trying to avoid repetition of code for querying on a pandas dataframe.
x1 is the dataframe with various column names such as Hypertension, Diabetes, Alcoholism, Handicap, Age_Group, Date_Appointment
Each of the disease column listed above contains 0 - not having disease, 2/3/4 - has different stages of disease
So when I filter on ' != 0 ' it will list records for patients with that specific disease. As such each disease will filter out different sets of records.
I wrote below query 4 times and replaced the word Hypertension with the other diseases to get 4 different graphs for each of the diseases.
But it is not clean coding. I need help to understand how any which function could be used and how to use it to write just 1 query instead of 4.
hyp1 = x1.query('Hypertension != 0')
i1 = hyp1.groupby('Age_Group')['Hypertension'].value_counts().plot(kind = 'bar',label = 'Hypertension',figsize=(6, 6))
plt.title('Appointments Missed by Patients with Hypertension')
plt.xlabel('Hypertension Age_Group')
plt.ylabel('Appointments missed');
Below is another set I don't know how to condense.
`print('Details of all appointments')
`print('')`
`print(df.Date_Appointment.value_counts().sort_index())`
`print('')`
`print(df.Date_Appointment.describe())`
`print('')`
`print(df.Date_Appointment.value_counts().describe())`
`print('')`
`print('Median = ', (round(df.Date_Appointment.value_counts().mean())))`
`print('Median = ', (round (df.Date_Appointment.value_counts().median())))`
`print('Mode = ', (df.Date_Appointment.value_counts().mode()))`
Would appreciate your detailed response. Thank you in advance.
Create a list of the desired columns
Iterate through them
Use f-strings (e.g. f'{...})
diseases = {'Hypertension': 'red', 'Diabetes': 'blue', 'Alcoholism': 'green', 'Handicap': 'yellow'}
for disease, color in diseases.items():
subset = x1.query(f'{disease} != 0')
i1 = subset.groupby('Age_Group')[f'{disease}'].value_counts().plot(kind='bar', label=f'{disease}', figsize=(6, 6), color=color)
plt.title(f'Appointments Missed by Patients with {disease}')
plt.xlabel(f'{disease} Age Group')
plt.ylabel('Appointments missed')
plt.show()
Incidentally, this would be easier with sample data to work with
For the second half, it's not clear what you want to condense or replace Date_Appointment with.

Get feature names for dataframe.corr

I am using the cancer data set from sklearn and I need to find the correlations between features. I am able to find the correlated columns, but I am not able to present them in a "nice" way, so that they will be an input for Dataframe.drop.
Here is my code:
cancer_data = load_breast_cancer()
df=pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
corr = df.corr()
#filter to find correlations above 0.6
corr_triu = corr.where(~pd.np.tril(pd.np.ones(corr.shape)).astype(pd.np.bool))
corr_triu = corr_triu.stack()
corr_result = corr_triu[corr_triu > 0.6]
print(corr_result)
df.drop(columns=[?])
IIUC, you want the columns that correlate with some other column in the dataset, ie drop columns that don't appear in corr_result. So you'll want to get the unique variables from the index of corr_result, from each level. There may be repeats so take care of that as well, such as with sets:
corr_result.index = corr_result.index.remove_unused_levels()
corr_vars = set()
corr_vars.update(corr_result.index.unique(level=0))
corr_vars.update(corr_result.index.unique(level=1))
all_vars = set(df.columns)
df.drop(columns=all_vars - corr_vars)

Plot the distance between every two points in 2 D

If I have a table with three columns where the first column represents the name of each point, the second column represent numerical data (mean) and the last column represent (second column + fixed number). The following an example how is the data looks like:
I want to plot this table so I have the following figure
If it is possible how I can plot it using either Microsoft Excel or python or R (Bokeh).
Alright, I only know how to do it in ggplot2, I will answer regarding R here.
These method only works if the data-frame is in the format you provided above.
I rename your column to Name.of.Method, Mean, Mean.2.2
Preparation
Loading csv data into R
df <- read.csv('yourdata.csv', sep = ',')
Change column name (Do this if you don't want to change the code below or else you will need to go through each parameter to match your column names.
names(df) <- c("Name.of.Method", "Mean", "Mean.2.2")
Method 1 - Using geom_segment()
ggplot() +
geom_segment(data=df,aes(x = Mean,
y = Name.of.Method,
xend = Mean.2.2,
yend = Name.of.Method))
So as you can see, geom_segment allows us to specify the end position of the line (Hence, xend and yend)
However, it does not look similar to the image you have above.
The line shape seems to represent error bar. Therefore, ggplot provides us with an error bar function.
Method 2 - Using geom_errorbarh()
ggplot(df, aes(y = Name.of.Method, x = Mean)) +
geom_errorbarh(aes(xmin = Mean, xmax = Mean.2.2), linetype = 1, height = .2)
Usually we don't use this method just to draw a line. However, its functionality fits your requirement. You can see that we use xmin and ymin to specify the head and the tail of the line.
The height input is to adjust the height of the bar at the end of the line in both ends.
I would use hbar for this:
from bokeh.io import show, output_file
from bokeh.plotting import figure
output_file("intervals.html")
names = ["SMB", "DB", "SB", "TB"]
p = figure(y_range=names, plot_height=350)
p.hbar(y=names, left=[4,3,2,1], right=[6.2, 5.2, 4.2, 3.2], height=0.3)
show(p)
However Whisker would also be an option if you really want whiskers instead of interval bars.

Spark - Optimize calculation time over a data frame, by using groupBy() instead of filter()

I have a data frame which contains different columns ('features').
My goal is to calculate column X statistical measures:
Mean, Standart-Deviation, Variance
But, to calculate all of those, with dependency on column Y.
e.g. Get all rows which Y = 1, and for them calculate mean,stddev, var,
then do the same for all rows which Y = 2 for them.
My current implementation is:
print "For CONGESTION_FLAG = 0:"
log_df.filter(log_df[flag_col] == 0).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 1:"
log_df.filter(log_df[flag_col] == 1).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 2:"
log_df.filter(log_df[flag_col] == 2).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
I was told the filter() way is wasteful in terms of computation times, and received an advice that for making those calculation run faster (i'm using this on 1GB data file), it would be better use groupBy() method.
Can someone please help me transform those lines to do the same calculations by using groupBy instead?
I got mixed up with the syntax and didn't manage to do so correctly.
Thanks.
Filter by itself is not wasteful. The problem is that you are calling it multiple times (once for each value) meaning you are scanning the data 3 times. The operation you are describing is best achieved by groupby which basically aggregates data per value of the grouped column.
You could do something like this:
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"), pow(stddev(size_col),2).alias("pow"))
You might also get better performance by calculating stddev^2 after the aggregation (you should try it on your data):
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"))
agg_df2 = agg_df.withColumn("pow", agg_df["stddev"] * agg_df["stddev"])
You can:
log_df.groupBy(log_df[flag_col]).agg(
mean(size_col), stddev(size_col), pow(stddev(size_col), 2)
)

Resources