Pyspark Kernel Density Estimation over multiple groups in parallel

Pyspark Kernel Density Estimation over multiple groups in parallel - apache-spark

Given a dataset consisting of money transactions, I am trying to use kernel density estimation to form clusters of transactions by their transaction amount. To do this, I identify the local minima of the density and use these as boundaries for the different clusters. I am able to do this on the whole dataset.
However, now I want to again use KDE, but use it on groups of data. That is, I want to estimate separate kernel densities for each group of transactions. The transactions are grouped on the basis of the counter party bank account from which they are sent. Currently, I use a naïve approach where I just loop over all counter parties. However, this is very inefficient, and as I am using spark I would like to be able to do this in parallel. I am not sure how to do this, as I am quite new to pyspark.
Any suggestions on how to do this?
Code that executes KDE over all data
from pyspark.mllib.stat import KernelDensity
from scipy.signal import argrelextrema
from matplotlib.pyplot import plot
from bisect import bisect
dat_rdd = sdf_pos.select("amount").rdd
dat_rdd_amounts = dat_rdd.map(lambda x: float(x[0]))
kd = KernelDensity()
kd.setBandwidth(10.0)
kd.setSample(dat_rdd_amounts)
s = np.linspace(0, 3000, num=50)
e = kd.estimate(s)
mi = argrelextrema(e, np.less)[0]
print("Minima:", s[mi])
minima_array = f.array([f.lit(i) for i in s[mi]])
user_func = f.udf(bisect)
sdf_pos = sdf_pos.withColumn("amount_group",
user_func(minima_array, f.col("amount")).cast('integer'))
Code that executes KDE separately for each group
counter_parties = sdf_pos.select("CP").distinct().collect()
sdf_pos = sdf_pos.withColumn("minima_array", f.array(f.lit(-1)))
dat_rdd = sdf_pos.select(["amount", "CP"]).rdd
for cp in counter_parties:
dat_rdd_amounts = dat_rdd.filter(lambda y: y[1] == cp[0]).map(lambda x: float(x[0]))
kd = KernelDensity()
kd.setBandwidth(10.0)
kd.setSample(dat_rdd_amounts)
s = np.linspace(0, 3000, num=50)
e = kd.estimate(s)
mi = argrelextrema(e, np.less)[0]
minima_array = f.array([f.lit(i) for i in s[mi]])
sdf_pos = sdf_pos.withColumn("minima_array",
f.when(f.col("CP") == cp[0], minima_array).otherwise(f.col("minima_array")))
user_func = f.udf(bisect)
sdf_pos = sdf_pos.withColumn("amount_group", user_func(f.col("minima_array"), f.col("amount")))

Related

Implementing a cointegration portfolio in Python for 3 ETFs (EWA, EWC, IGE)

I'm trying to implement a mean-reverting portfolio using the strategies described in "Algorithmic Trading" by Dr. P.E. Chan. However, since the examples he uses are programmed in MATLAB, I'm having trouble translating them correctly to Python. I'm completely stuck trying to create a cointegrating portfolio using 3 ETFs. I think my problems begin when trying to determine the hedges, and then building the desired portfolio.
Any help or tips would be enormously useful.
So, I start by downloading the Adjusted prices and creating the W, X and Y Data Series. The time period I selected is 2007/07/22 through 2012/3/28.
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
import datetime
start = datetime.datetime(2007, 7, 22)
end = datetime.datetime(2012, 3, 28)
EWA = web.DataReader('EWA', 'yahoo', start, end)
EWC = web.DataReader('EWC', 'yahoo', start, end)
IGE = web.DataReader('IGE', 'yahoo', start, end)
w = IGE['Adj Close']
x = EWA['Adj Close']
y = EWC['Adj Close']
df = pd.DataFrame([w,x,y]).transpose()
df.columns = ['W','X','Y']
df.plot(figsize=(20,12))
from statsmodels.tsa.vector_ar.vecm import coint_johansen
y3 = df
j_results = coint_johansen(y3,0,1)
print(j_results.lr1)
print(j_results.cvt)
print(j_results.eig)
print(j_results.evec)
print(j_results.evec[:,0])
So then I'm supposed to build a portfolio by multiplying the eigenvector [0.30.., 1.36.., -1.35..] times the share prices of each instrument to get the y_port value. Afterwards I run a correlation test to determine the correlation between daily change in price of this portfolio and the last day's price change, to be able to determine the half-life for the series.
I did this by just multiplying the eigenvector times the close prices, I don't know if this is where I went wrong.
hedge_ratios = j_results.evec[:,0]
y_port = (hedge_ratios * df).sum(axis=1)
y_port.plot(figsize=(20,12))
y_port_lag = y_port.shift(1)
y_port_lag[0]= 0
delta_y = y_port-y_port_lag
X = y_port_lag
Y = delta_y
X = sm.add_constant(X)
model = OLS(Y,X)
regression_results = model.fit()
regression_results.summary()
So then I calculate the half-life, which is around 19 days.
halflife = -np.log(2)/regression_results.params[0]
halflife
And I define the number of units to hold based on the instructions on the book (the -Z value of the portfolio value, with a lookback window of 19 days based on the half-life).
num_units = -(y_port-y_port.rolling(19).mean())/y_port.rolling(19).std()
num_units.plot(figsize=(20,12))
So the next steps I take are:
Check to see if the dataframe is still correct.
Add the "Number of units to hold", which was calculated previously and is the negative Z score of the y_port value.
There was probably an easier way to multiply or do this, but I calculated the amount of $ I should hold for each instrument by multiplying the instrument price, by the hedge ratio given by the eigenvector, by the number of portfolio units to hold.
Finally I calculated each instrument's PNL by multiplying the daily change * the number of units I was holding.
The results are abysmal. Just losing all the way from beginning to end.
¿Where did I mess up? ¿how can I properly multiply the values in the eigenvector, determine the number of positions to hold, and create the portfolio correctly?
Any assistance would be massively appreciated.
I don't know why but the num_units series was "Horizontal" and I had to transpose it before attaching it to the DataFrame.
num_units = num_units.transpose()
df['Portfolio Units'] = num_units
df
df['W $ Units'] = df['W']*hedge_ratios[0]*df['Portfolio Units']
df['X $ Units'] = df['X']*hedge_ratios[1]*df['Portfolio Units']
df['Y $ Units'] = df['Y']*hedge_ratios[2]*df['Portfolio Units']
positions = df[['W $ Units','X $ Units','Y $ Units']]
positions
pnl = pd.DataFrame()
pnl['W Pnl'] = (df['W']/df['W'].shift(1)-1)*df['W $ Units']
pnl['X Pnl'] = (df['X']/df['X'].shift(1)-1)*df['X $ Units']
pnl['Y Pnl'] = (df['Y']/df['Y'].shift(1)-1)*df['Y $ Units']
pnl['Total PNL'] = pnl.sum(axis=1)
pnl['Total PNL'].cumsum().plot(figsize=(20,12))
I know that if I just revert my positions (not use -1 in the y_port), the results will change and I'll get a positive return. However, I want to know what I did wrong. Using -Z for a mean-reversion strategy makes sense, and I would like to know where I made the mistake, so I can keep up with the rest of the book,

I think that you need to shift df['W $ Units'], df['X $ Units'] and df['Y $ Units'] with 1 as well. So to use df['Y $ Units'].shift(1) instead of df['Y $ Units'], for example.
The result you receive is not abysmal - it is unrealistic. Without shifting df['... $ Units'] you are looking ahead and using data that is not yet available.

I found some problems in part4 and changed it as below:
positions = df[['W $ Units','X $ Units','Y $ Units']]
df5=df.iloc[:,0:3]
pnl=np.sum((positions.shift().values)*(df5.pct_change().values), axis=1)
ret=pnl/np.sum(np.abs(positions.shift()), axis=1)
plt.figure(figsize=(8,5))
plt.plot(np.cumprod(1+ret)-1)
print('APR=%f Sharpe=%f' % (np.prod(1+ret)**(252/len(ret))-1, np.sqrt(252)*np.mean(ret)/np.std(ret)))
As a result we have APR=0.130122 Sharpe=1.518595 daily rets plot

efficient way of calculating Monte Carlo results for different impact assessment methods in Brightway

I am trying to do a comparative monte carlo calculation with brightway2 using different impact assessment methods. I thought about using the switch_method method to be more efficient, since the technosphere matrix is the same for a given iteration. However, I am getting an assertion error. A code to reproduce it could be something like this
import brighway as bw
bw.projects.set_current('ei35') # project with ecoinvent 3.5
db = bw.Database("ei_35cutoff")
# select two different transport activities to compare
activity_name = 'transport, freight, lorry >32 metric ton, EURO4'
for activity in bw.Database("ei_35cutoff"):
if activity['name'] == activity_name:
truckE4 = bw.Database("ei_35cutoff").get(activity['code'])
print(truckE4['name'])
break
activity_name = 'transport, freight, lorry >32 metric ton, EURO6'
for activity in bw.Database("ei_35cutoff"):
if activity['name'] == activity_name:
truckE6 = bw.Database("ei_35cutoff").get(activity['code'])
print(truckE6['name'])
break
demands = [{truckE4: 1}, {truckE6: 1}]
# impact assessment method:
recipe_midpoint=[method for method in bw.methods.keys()
if method[0]=="ReCiPe Midpoint (H)"]
mc_mm = bw.MonteCarloLCA(demands[0], recipe_midpoint[0])
next(mc_mm)
If I try switch method I get the assertion error.
mc_mm.switch_method(recipe_midpoint[1])
assert mc_mm.method==recipe_midpoint[1]
mc_mm.redo_lcia()
next(mc_mm)
Am I doing something wrong here?

I usually store characterization factor matrices in a temporary dict and multiply these cfs with the LCI resulting from MonteCarloLCA directly.
import brightway2 as bw
import numpy as np
# Generate objects for analysis
bw.projects.set_current("my_mcs")
my_db = bw.Database('db')
my_act = my_db.random()
my_demand = {my_act:1}
my_methods = [bw.methods.random() for _ in range(2)]
I wrote this simple function to get characterization factor matrices for the product system I will generate in the MonteCarloLCA. It uses a temporara "sacrificial LCA" object that will have the same A and B matrices as the MonteCarloLCA.
This may seem like a waste of time, but it is only done once, and will make MonteCarlo quicker and simpler.
def get_C_matrices(demand, list_of_methods):
""" Return a dict with {method tuple:cf_matrix} for a list of methods
Uses a "sacrificial LCA" with exactly the same demand as will be used
in the MonteCarloLCA
"""
C_matrices = {}
sacrificial_LCA = bw.LCA(demand)
sacrificial_LCA.lci()
for method in list_of_methods:
sacrificial_LCA.switch_method(method)
C_matrices[method] = sacrificial_LCA.characterization_matrix
return C_matrices
Then:
# Create array that will store mc results.
# Shape is (number of methods, number of iteration)
my_iterations = 10
mc_scores = np.empty(shape=[len(my_methods), my_iterations])
# Instantiate MonteCarloLCA object
my_mc = bw.MonteCarloLCA(my_demand)
# Get characterization factor matrices
my_C_matrices = get_C_matrices(my_demand, my_methods)
# Generate results
for iteration in range(my_iterations):
lci = next(my_mc)
for i, m in enumerate(my_methods):
mc_scores[i, iteration] = (my_C_matrices[m]*my_mc.inventory).sum()
All your results are in mc_scores. Each row corresponds to a method, each column to an MC iteration.

Not very elegant, but try this:
iterations = 10
simulations = []
for _ in range(iterations):
mc_mm = MonteCarloLCA(demands[0], recipe_midpoint[0])
next(mc_mm)
mcresults = []
for i in demands:
print(i)
for m in recipe_midpoint[0:3]:
mc_mm.switch_method(m)
print(mc_mm.method)
mc_mm.redo_lcia(i)
print(mc_mm.score)
mcresults.append(mc_mm.score)
simulations.append(mcresults)
CC_truckE4 = [i[1] for i in simulations] # Climate Change, truck E4
CC_truckE6 = [i[1+3] for i in simulations] # Climate Change, truck E6
from matplotlib import pyplot as plt
plt.plot(CC_truckE4 , CC_truckE6, 'o')
If you then make a test and do twice the simulation for the same demand vector, by setting demands = [{truckE4: 1}, {truckE4: 1}] and plot the result you should get a straight line. This means that you are doing dependent sampling and re-using the same tech matrix for each demand vector and for each LCIA. I am not 100% sure of this but I hope it answers your question.

Using weighted adjacency matrices to calculate global efficiency of said matrix using networkx

I have been trying to study the impact on a network by looking at deletions of different combinations of nodes.
To study this I have used the networkx graph theory metric, global efficiency. But, I figured that the networkx code ignores weight when calculating global efficiency. So, I went in and changed the source code and added weight as a metric. It seems to be working and is giving me different values than the non-weighted approach but is exceptionally slow (about 20 times).
How can I speed up these computations?
##The code I am running
import networkx
import numpy as np
from networkx import algorithms
from networkx.algorithms import efficiency
from networkx.algorithms.efficiency import global_efficiency
import pandas
data=pandas.read_csv("ones.csv")
lol = data.values.tolist()
data=pandas.read_csv("twos.csv")
lol2 = data.values.tolist()
combo=[["10pp", "10d"]]
GE_list=[]
for row in combo:
values = row
datasafe=pandas.read_csv("b1.csv", index_col=0)
datasafe.loc[values, :] = 0
datasafe[values] = 0
g=networkx.from_pandas_adjacency(datasafe)
ge=global_efficiency(g)
GE_list.append(ge)
extra=[""]
extra2=["full"]
combo.append(extra)
combo.append(extra2)
datasafe=pandas.read_csv("b1.csv", index_col=0)
g=networkx.from_pandas_adjacency(datasafe)
ge=global_efficiency(g)
GE_list.append(ge)
values = ["s6-8","p9-46v","p47r","p10p","IFSp","IFSa",'IFJp','IFJa','i6-8','a9-46v','a47r','a10p','9p','9a','9-46d','8C','8BL','8AV','8AD','47s','47L','10pp','10d','46','45','44']
datasafe=pandas.read_csv("b1.csv", index_col=0)
datasafe.loc[values, :] = 0
datasafe[values] = 0
g=networkx.from_pandas_adjacency(datasafe)
ge=global_efficiency(g)
GE_list.append(ge)
output=pandas.DataFrame(list(zip(combo, GE_list)))
output.to_csv('delete 1.csv',index=None)
##The change I made to the original networkx code
try:
eff = 1 / nx.shortest_path_length(G, u, v)
## changed to
try:
eff = 1 / nx.shortest_path_length(G, u, v, weight='weight')
Previously with my unweighted graphs I was able to process my data in 2 hours, currently its taking the same time to do a twentieth of the data. Please do suggest any improvements to my code or any other pieces of code that I can run.
Ps-I don't have a great understanding of python, so please do bear with me :)

Using weights, you exchange breadth-first search with Dijkstra algorithm, which increases the runtime by log|V|, see second comment of https://stackoverflow.com/a/25449911
If you have problem with the runtime, you should rather exchange networkx, which is implemented in python, with a C implementation like graph-tool or igraph, see e.g. for a (probably biased) comparison of performance: https://graph-tool.skewed.de/performance

Reprojecting Xarray Dataset

I'm trying to reproject a Lambert Conformal dataset to Plate Carree. I know that this can easily be done visually using cartopy. However, I'm trying to create a new dataset rather than just show a reprojected image. Below is methodology I have mapped out but I'm unable to subset the dataset properly (Python 3.5, MacOSx).
from siphon.catalog import TDSCatalog
import xarray as xr
from xarray.backends import NetCDF4DataStore
import numpy as np
import cartopy.crs as ccrs
from scipy.interpolate import griddata
import numpy.ma as ma
from pyproj import Proj, transform
import metpy
# Declare bounding box
min_lon = -78
min_lat = 36
max_lat = 40
max_lon = -72
boundinglat = [min_lat, max_lat]
boundinglon = [min_lon, max_lon]
# Load the dataset
cat = TDSCatalog('https://thredds.ucar.edu/thredds/catalog/grib/NCEP/HRRR/CONUS_2p5km/latest.xml')
dataset_name = sorted(cat.datasets.keys())[-1]
dataset = cat.datasets[dataset_name]
ds = dataset.remote_access(service='OPENDAP')
ds = NetCDF4DataStore(ds)
ds = xr.open_dataset(ds)
# parse the temperature at 850 and # 0z reftime
tempiso = ds.metpy.parse_cf('Temperature_isobaric')
t850 = tempiso[0][2]
# transform bounding lat/lons to src_proj
src_proj = tempiso.metpy.cartopy_crs #aka lambert conformal conical
extents = src_proj.transform_points(ccrs.PlateCarree(), np.array(boundinglon), np.array(boundinglat))
# subset the data using the indexes of the closest values to the src_proj extents
t850_subset = t850[(np.abs(tempiso.y.values - extents[1][0])).argmin():(np.abs(tempiso.y.values - extents[1][1])).argmin()][(np.abs(tempiso.x.values - extents[0][1])).argmin():(np.abs(tempiso.x.values - extents[0][0])).argmin()]
# t850_subset should be a small, reshaped dataset, but it's shape is 0x2145
# now use nplinspace, npmeshgrid & scipy interpolate to reproject
My transform point > find nearest value subsetting isn't working. It's claiming the closest points are outside the realm of the dataset. As noted, I plan to use nplinspace, npmeshgrid and scipy interpolate to create a new, square lat/lon dataset from t850_subset.
Is there an easier way to resize & reproject an xarray dataset?

Your easiest path forward is to take advantage of xarray's ability to do pandas-like data selection; this is IMO the best part of xarray. Replace your last two lines with:
# By transposing the result of transform_points, we can unpack the
# x and y coordinates into individual arrays.
x_lim, y_lim, _ = src_proj.transform_points(ccrs.PlateCarree(),
np.array(boundinglon), np.array(boundinglat)).T
t850_subset = t850.sel(x=slice(*x_lim), y=slice(*y_lim))
You can find more information in the documentation on xarray's selection and indexing functionality. You would probably also be interested in xarray's built-in support for interpolation. And if interpolation methods beyond SciPy's are of interest, MetPy also has a suite of other interpolation methods.

We have various "regridding" methods in Iris, if that isn't too much of a context switch for you.
Xarray explains its relationship to Iris here, and provides a to_iris() method.

Problems regarding Pyomo provided math functions

I am trying to solve a maximization problem using Pyomo which has a recursive relationship. I am trying to maximize the revenue from a battery and it involves updating the state of charge of the battery every hour (which is the recursive relationship here). I am using the following code:
import pyomo
import numpy as np
from pyomo.environ import *
import pandas as pd
model = ConcreteModel()
N = 24 #number of hours
lmpdata = np.random.randint(1,10,24) #LMP Data (to be imported from MISO/PJM)
R = 0 #discount
eta_s = 0.99 #self-discharge efficiency
eta_c = 0.95 #round-trip efficiency
gammas_min = 0.1 #fraction of energy capacity to reserve for discharging
gammas_max = 0.05 #fraction of energy capacity to reserve for charging
S_bar = 50 #energy capacity
Q_bar = 50 #energy charge/discharge rating
model.qd = Var(range(N), within = NonNegativeReals) #variables for energy sold at time t
model.qr = Var(range(N), within = NonNegativeReals) #variables for energy purchased at time t
model.obj = Objective(expr = sum((model.qd[i]-model.qr[i])*lmpdata[i]*np.exp(-R*(i+1)) for i in range(N)), sense = maximize) #objective function
model.SOC = np.zeros(N) #state of charge (s(t) in Sandia's Model)
model.SOC[0] = 25 #SOC at hour 0
#recursion relation describing the SOC
def con_rule1(model,i):
model.SOC[i] = eta_s*model.SOC[i-1] + eta_c*model.qr[i-1] - model.qd[i-1]
return (eta_s*model.SOC[i-1] + eta_c*model.qr[i-1] - model.qd[i-1]== model.SOC[i])
#def con_rule1(model,i):
model.con1 = Constraint(range(1,N), rule = con_rule1)
#model.con2 = Constraint(expr = eta_s*SOC[N-1] + eta_c*model.qr[N-1] - model.qd[N-1] == SOC[0]) #SOC relation for the last hour
#SOC boundaries
def con_rule2(model,i):
return (gammas_min*S_bar <= eta_s*model.SOC[i] + eta_c*model.qr[i] - model.qd[i] <= (1-gammas_max)*S_bar)
model.con3 = Constraint(range(N), rule = con_rule2)
#limits the total energy charged over each time step to the energy
#charge limit (derived from the power limit)
#It restricts the throughput based on the power rating
def con_rule3(model,i):
return (0 <= model.qr[i]+model.qd[i] <= Q_bar)
model.con4 = Constraint(range(N),rule = con_rule3)
def pyomo_postprocess(options=None, instance=None, results=None):
model.qd.display()
model.qr.display()
model.pprint()
However, when I try to run the code, I am getting the following error:
Implicit conversion of Pyomo NumericValue type `<class 'pyomo.core.kernel.expr_coopr3._SumExpression'>' to a float is
disabled. This error is often the result of using Pyomo components as
arguments to one of the Python built-in math module functions when
defining expressions. Avoid this error by using Pyomo-provided math
functions.
I could not find any reference to Pyomo's math function in its documentation. It would be great if anyone could help me solve this problem!

Pyomo defines its own set of math module functions for operations like exp, log, sin, etc. If you want to use any of these functions in your Pyomo expressions you should make sure they are the ones provided by Pyomo and not from some other Python package. I think the issue with your model is that you are using np.exp in your Objective function. The Pyomo math functions are automatically imported when you import pyomo.environ so you should be able to replace np.exp with exp to get the Pyomo-defined function.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pyspark Kernel Density Estimation over multiple groups in parallel - apache-spark

Related

Implementing a cointegration portfolio in Python for 3 ETFs (EWA, EWC, IGE)

efficient way of calculating Monte Carlo results for different impact assessment methods in Brightway

Using weighted adjacency matrices to calculate global efficiency of said matrix using networkx

Reprojecting Xarray Dataset

Problems regarding Pyomo provided math functions

Categories

Resources