Related
I am trying to develop a fit for a given SIR model. Following the suggestion from various sources, notably https://scipython.com/book/chapter-8-scipy/additional-examples/the-sir-epidemic-model/ and https://stackoverflow.com/a/34425290/3588242, I put together a testing code as I was having huge difficulties fitting anything coherently with my model.
It is shown below.
Basically, I generate data that I know fit my model, by definition of how they're generated. This gives me a infectious compartment over time (ydata and xdata). I then want to go back to the parameters (beta, gamma, mu) I used to generate the data, using the scipy function curve_fit.
However, it does not work, and I'm very unclear as to why it does not seem to do anything.
import numpy as np
from scipy import integrate, optimize
import matplotlib.pyplot as plt
# The SIR model differential equations.
def sir_model(y, t, N, beta, gamma, mu):
S, I, R = y
dSdt = -beta * S * I / N + mu * I
dIdt = beta * S * I / N - gamma * I
dRdt = gamma * I
return dSdt, dIdt, dRdt
# The fit integration.
def fit_odeint(x, beta, gamma, mu):
return integrate.odeint(sir_model, (S0, I0, R0), x, args=(N, beta, gamma, mu))[:,1]
if __name__ == "__main__":
########################################
# Get data to be fitted from the model #
########################################
# Total population, N.
N = 1
# Initial number of infected and recovered individuals, I0 and R0.
I0, R0 = 1e-6, 0
# Everyone else, S0, is susceptible to infection initially.
S0 = N - I0 - R0
# Contact rate, beta, and mean recovery rate, gamma, (in 1/deltaT).
beta, gamma, mu = -0.001524766068089, 1.115130184090387, -0.010726414041332
# A grid of time points (in deltaT increment)
t = np.linspace(0, 305, 306)
# Initial conditions vector
y0 = S0, I0, R0
# Integrate the SIR equations over the time grid, t.
ret = integrate.odeint(sir_model, y0, t, args=(N, beta, gamma, mu))
S, I, R = ret.T
# Plot the data on three separate curves for I(t) and R(t)
fig = plt.figure(facecolor='w')
ax = fig.add_subplot(111, facecolor='#dddddd', axisbelow=True)
# ax.plot(t, S/N, 'b', alpha=0.5, lw=2, label='Susceptible')
ax.plot(t, I/N, 'r', alpha=0.5, lw=2, label='Infected')
ax.plot(t, R/N, 'g', alpha=0.5, lw=2, label='Recovered with immunity')
ax.set_xlabel('Time /days')
ax.set_ylabel('Number (1000s)')
# ax.set_ylim(0,1.2)
ax.yaxis.set_tick_params(length=0)
ax.xaxis.set_tick_params(length=0)
ax.grid(b=True, which='major', c='w', lw=2, ls='-')
legend = ax.legend()
legend.get_frame().set_alpha(0.5)
for spine in ('top', 'right', 'bottom', 'left'):
ax.spines[spine].set_visible(False)
plt.show()
#####################################
# Fit the data using the same model #
#####################################
# Define the "experimental" data
ydata = I
xdata = t
# Define the initial conditions vector, note that this will be equal to y0 above.
I0, R0 = ydata[0], 0
S0 = N - I0 - R0
# Try with default p0 = [1, 1, 1] (most of the times I will have no clue where the values lie)
print('p0 default values...')
popt, pcov = optimize.curve_fit(fit_odeint, xdata, ydata, bounds=([-0.1, 0., -np.inf], np.inf))
psig = np.sqrt(np.diag(pcov))
print(f'beta : {popt[0]} (+/- {psig[0]})\ngamma: {popt[1]} (+/- {psig[1]})\nmu : {popt[2]} (+/- {psig[2]})\n\n')
# Try with p0 close to the known solution (kind of defeats the purpose of curve fit if it's too close...)
print('p0 close to the known solution...')
popt, pcov = optimize.curve_fit(fit_odeint, xdata, ydata, bounds=([-0.1, 0., -np.inf], np.inf), p0=[-0.01,1.,-0.01])
psig = np.sqrt(np.diag(pcov))
print(f'beta : {popt[0]} (+/- {psig[0]})\ngamma: {popt[1]} (+/- {psig[1]})\nmu : {popt[2]} (+/- {psig[2]})\n\n')
# Try with p0 equal to the known solution (kind of defeats the purpose of curve fit if it's too close...)
print('p0 equal to the known solution...')
popt, pcov = optimize.curve_fit(fit_odeint, xdata, ydata, bounds=([-0.1, 0., -np.inf], np.inf), p0=[-0.001524766068089, 1.115130184090387, -0.010726414041332])
psig = np.sqrt(np.diag(pcov))
print(f'beta : {popt[0]} (+/- {psig[0]})\ngamma: {popt[1]} (+/- {psig[1]})\nmu : {popt[2]} (+/- {psig[2]})\n\n')
This code gives me the correct expected plot, and then:
p0 default values...
beta : 0.9 (+/- 13.202991356641752)
gamma: 1.0 (+/- 13.203507667858469)
mu : 1.0 (+/- 50.75556985555176)
p0 close to the known solution...
beta : -0.01 (+/- 2.0204502661168218)
gamma: 1.0 (+/- 2.0182998608106186)
mu : -0.01 (+/- 7.701149479142956)
p0 equal to the known solution...
beta : -0.001524766068089 (+/- 0.0)
gamma: 1.115130184090387 (+/- 0.0)
mu : -0.010726414041332 (+/- 0.0)
So it would seem that curve_fit is like, well, that's close enough, let's stop, after doing barely anything and refusing to give it a shot. This may be due to an epsilon value somewhere (absolute versus relative or something like that, I encountered that issue with scipy previously for a different solver). However, the documentation of curve_fit doesn't seem to mention anything about being able to change the tolerance. Or this may be due to me misunderstanding how something in that code works.
If anyone has any suggestions, I would love that.
I tried looking into the lmfit package with little success getting it to work (I get some ValueError a.any() blabla error that I can't figure out).
EDIT:
I looked around and decided to use a stochastic evolution algorithm (differential evolution) to get a better initial guess automatically. The code works well, but again, curve_fit is simply not doing anything and returning the initial guess back.
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize, integrate
# The SIR model differential equations.
def sir_model(y, t, N, beta, gamma, mu):
S, I, R = y
dSdt = -beta * S * I / N + mu * I
dIdt = beta * S * I / N - gamma * I
dRdt = gamma * I
return dSdt, dIdt, dRdt
# The fit integration.
def fit_odeint(x, beta, gamma, mu):
return integrate.odeint(sir_model, (S0, I0, R0), x, args=(N, beta, gamma, mu))[:,1]
# function for genetic algorithm to minimize (sum of squared error)
# bounds on parameters are set in generate_Initial_Parameters() below
def sumOfSquaredError(parameterTuple):
return np.sum((ydata - fit_odeint(xdata, *parameterTuple)) ** 2)
def generate_Initial_Parameters():
parameterBounds = []
parameterBounds.append([-0.1, 10.0]) # parameter bounds for beta
parameterBounds.append([-0.1, 20.0]) # parameter bounds for gamma
parameterBounds.append([-0.1, 0.1]) # parameter bounds for mu
# "seed" the numpy random number generator for repeatable results
result = optimize.differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
if __name__ == "__main__":
########################################
# Get data to be fitted from the model #
########################################
# Total population, N.
N = 1
# Initial number of infected and recovered individuals, I0 and R0.
I0, R0 = 1e-6, 0
# Everyone else, S0, is susceptible to infection initially.
S0 = N - I0 - R0
# Contact rate, beta, and mean recovery rate, gamma, (in 1/deltaT).
beta, gamma, mu = -0.001524766068089, 1.115130184090387, -0.010726414041332
# A grid of time points (in deltaT increment)
t = np.linspace(0, 305, 306)
# Initial conditions vector
y0 = S0, I0, R0
# Integrate the SIR equations over the time grid, t.
ret = integrate.odeint(sir_model, y0, t, args=(N, beta, gamma, mu))
S, I, R = ret.T
# Plot the data on three separate curves for I(t) and R(t)
fig = plt.figure(facecolor='w')
ax = fig.add_subplot(111, facecolor='#dddddd', axisbelow=True)
# ax.plot(t, S/N, 'b', alpha=0.5, lw=2, label='Susceptible')
ax.plot(t, I/N, 'r', alpha=0.5, lw=2, label='Infected')
ax.plot(t, R/N, 'g', alpha=0.5, lw=2, label='Recovered with immunity')
ax.set_xlabel('Time /deltaT')
ax.set_ylabel('Number (normalized)')
# ax.set_ylim(0,1.2)
ax.yaxis.set_tick_params(length=0)
ax.xaxis.set_tick_params(length=0)
ax.grid(b=True, which='major', c='w', lw=2, ls='-')
legend = ax.legend()
legend.get_frame().set_alpha(0.5)
for spine in ('top', 'right', 'bottom', 'left'):
ax.spines[spine].set_visible(False)
plt.show()
#####################################
# Fit the data using the same model #
#####################################
# Define the "experimental" data
ydata = I
xdata = t
# Define the initial conditions vector, note that this will be equal to y0 above.
I0, R0 = ydata[0], 0
S0 = N - I0 - R0
# generate initial parameter values
initialParameters = generate_Initial_Parameters()
# curve fit the test data
fittedParameters, pcov = optimize.curve_fit(fit_odeint, xdata, ydata, p0=tuple(initialParameters),
bounds=([-0.1, 0., -np.inf], np.inf))
# create values for display of fitted peak function
b, g, m = fittedParameters
ret = integrate.odeint(sir_model, y0, t, args=(N, b, g, m))
S, I, R = ret.T
plt.plot(xdata, ydata) # plot the raw data
plt.plot(xdata, I, linestyle='--') # plot the equation using the fitted parameters
plt.show()
psig = np.sqrt(np.diag(pcov))
print('Initial parameters:')
print(f'beta : {initialParameters[0]}\n'
f'gamma: {initialParameters[1]}\n'
f'mu : {initialParameters[2]}\n\n')
print('Fitted parameters:')
print(f'beta : {fittedParameters[0]} (+/- {psig[0]})\n'
f'gamma: {fittedParameters[1]} (+/- {psig[1]})\n'
f'mu : {fittedParameters[2]} (+/- {psig[2]})\n\n')
This gives me, along with the correct and expected figures:
Initial parameters:
beta : -0.039959661364345145
gamma: 1.0766953272292845
mu : -0.040321969786292024
Fitted parameters:
beta : -0.039959661364345145 (+/- 5.6469679624489775e-12)
gamma: 1.0766953272292845 (+/- 5.647099056919525e-12)
mu : -0.040321969786292024 (+/- 5.720259134770649e-12)
So curve_fit did almost absolutely nothing. It's not shocking in this case since the initial guess is pretty good.
But since what I care about are the parameters, I am "surprised" by the apparent uselessness of curve_fit in my case, I was expecting more from it. It likely stems from a misunderstanding on my part of what it actually does though. I will note that as expected, if the initial guess parameters bounds are large, the genetic algorithm has trouble finding the minimum and this has a heavy computational cost.
Any enlightenment or suggestions welcome!
I have posted this question on Data Science StackExchange site since StackOverflow does not support LaTeX. Linking it here because this site is probably more appropriate.
The question with correctly rendered LaTeX is here: https://datascience.stackexchange.com/questions/48062/pytorch-does-not-seem-to-be-optimizing-correctly
The idea is that I am considering sums of sine waves with different phases. The waves are sampled with some sample rate s in the interval [0, 2pi]. I need to select phases in such a way, that the sum of the waves at any sample point is minimized.
Below is the Python code. Optimization does not seem to be computed correctly.
import numpy as np
import torch
def phaseOptimize(n, s = 48000, nsteps = 1000):
learning_rate = 1e-3
theta = torch.zeros([n, 1], requires_grad=True)
l = torch.linspace(0, 2 * np.pi, s)
t = torch.stack([l] * n)
T = t + theta
for jj in range(nsteps):
loss = T.sin().sum(0).pow(2).sum() / s
loss.backward()
theta.data -= learning_rate * theta.grad.data
print('Optimal theta: \n\n', theta.data)
print('\n\nMaximum value:', T.sin().sum(0).abs().max().item())
Below is a sample output.
phaseOptimize(5, nsteps=100)
Optimal theta:
tensor([[1.2812e-07],
[1.2812e-07],
[1.2812e-07],
[1.2812e-07],
[1.2812e-07]], requires_grad=True)
Maximum value: 5.0
I am assuming this has something to do with broadcasting in
T = t + theta
and/or the way I am computing the loss function.
One way to verify that optimization is incorrect, is to simply evaluate the loss function at random values for the array $\theta_1, \dots, \theta_n$, say uniformly distributed in $[0, 2\pi]$. The maximum value in this case is almost always much lower than the maximum value reported by phaseOptimize(). Much easier in fact is to consider the case with $n = 2$, and simply evaluate at $\theta_1 = 0$ and $\theta_2 = \pi$. In that case we get:
phaseOptimize(2, nsteps=100)
Optimal theta:
tensor([[2.8599e-08],
[2.8599e-08]])
Maximum value: 2.0
On the other hand,
theta = torch.FloatTensor([[0], [np.pi]])
l = torch.linspace(0, 2 * np.pi, 48000)
t = torch.stack([l] * 2)
T = t + theta
T.sin().sum(0).abs().max().item()
produces
3.2782554626464844e-07
You have to move computing T inside the loop, or it will always have the same constant value, thus constant loss.
Another thing is to initialize theta to different values at indices, otherwise because of the symmetric nature of the problem the gradient is the same for every index.
Another thing is that you need to zero gradient, because backward just accumulates them.
This seems to work:
def phaseOptimize(n, s = 48000, nsteps = 1000):
learning_rate = 1e-1
theta = torch.zeros([n, 1], requires_grad=True)
theta.data[0][0] = 1
l = torch.linspace(0, 2 * np.pi, s)
t = torch.stack([l] * n)
for jj in range(nsteps):
T = t + theta
loss = T.sin().sum(0).pow(2).sum() / s
loss.backward()
theta.data -= learning_rate * theta.grad.data
theta.grad.zero_()
You're being bitten by both PyTorch and math. Firstly, you need to
Zero out the gradient by setting theta.grad = None before each backward step. Otherwise the gradients accumulate instead of overwriting the previous ones
You need to recalculate T at each step. PyTorch is not symbolic, unlike TensorFlow and T = t + theta means "T equals the sum of current t and current theta" and not "T equals the sum of t and theta, whatever their values may be at any time in the future".
With those fixes you get the following code:
def phaseOptimize(n, s = 48000, nsteps = 1000):
learning_rate = 1e-3
theta = torch.zeros(n, 1, requires_grad=True)
l = torch.linspace(0, 2 * np.pi, s)
t = torch.stack([l] * n)
T = t + theta
for jj in range(nsteps):
T = t + theta
loss = T.sin().sum(0).pow(2).sum() / s
theta.grad = None
loss.backward()
theta.data -= learning_rate * theta.grad.data
T = t + theta
print('Optimal theta: \n\n', theta.data)
print('\n\nMaximum value:', T.sin().sum(0).abs().max().item())
which will still not work as you expect because of math.
One can easily see that the minimum to your loss function is when theta are also uniformly spaced over [0, 2pi). The problem is that you are initializing your parameters as torch.zeros, which leads to all those values being equal (this is the polar opposite of equispaced!). Since your loss function is symmetrical with respect to permutations of theta, the computed gradients are equal and the gradient descent algorithm can never "differentiate them". In more mathematical terms, you're unlucky enough to initialize your algorithm exactly on a saddle point, so it cannot continue. If you add any noise, it will converge. For instance with
theta = torch.zeros(n, 1) + 0.001 * torch.randn(n, 1)
theta.requires_grad_(True)
EDIT: I already made significant progress. My current question is written after my last edit below and can be answered without the context.
I currently follow Andrew Ng's Machine Learning Course on Coursera and tried to implement logistic regression today.
Notation:
X is a (m x n)-matrix with vectors of input variables as rows (m training samples of n-1 variables, the entries of the first column are equal to 1 everywhere to represent a constant).
y is the corresponding vector of expected output samples (column vector with m entries equal to 0 or 1)
theta is the vector of model coefficients (row vector with n entries)
For an input row vector x the model will predict the probability sigmoid(x * theta.T) for a positive outcome.
This is my Python3/numpy implementation:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
vec_sigmoid = np.vectorize(sigmoid)
def logistic_cost(X, y, theta):
summands = np.multiply(y, np.log(vec_sigmoid(X*theta.T))) + np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T)))
return - np.sum(summands) / len(y)
def gradient_descent(X, y, learning_rate, num_iterations):
num_parameters = X.shape[1] # dim theta
theta = np.matrix([0.0 for i in range(num_parameters)]) # init theta
cost = [0.0 for i in range(num_iterations)]
for it in range(num_iterations):
error = np.repeat(vec_sigmoid(X * theta.T) - y, num_parameters, axis=1)
error_derivative = np.sum(np.multiply(error, X), axis=0)
theta = theta - (learning_rate / len(y)) * error_derivative
cost[it] = logistic_cost(X, y, theta)
return theta, cost
This implementation seems to work fine, but I encountered a problem when calculating the logistic-cost. At some point the gradient descent algorithm converges to a pretty good fitting theta and the following happens:
For some input row X_i with expected outcome 1 X * theta.T will become positive with a good margin (for example 23.207). This will lead to sigmoid(X_i * theta) to become exactly 1.0000 (this is because of lost precision I think). This is a good prediction (since the expected outcome is equal to 1), but this breaks the calculation of the logistic cost, since np.log(1 - vec_sigmoid(X*theta.T)) will evaluate to NaN. This shouldn't be a problem, since the term is multiplied with 1 - y = 0, but once a value of NaN occurs, the whole calculation is broken (0 * NaN = NaN).
How should I handle this in the vectorized implementation, since np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T))) is calculated in every row of X (not only where y = 0)?
Example input:
X = np.matrix([[1. , 0. , 0. ],
[1. , 1. , 0. ],
[1. , 0. , 1. ],
[1. , 0.5, 0.3],
[1. , 1. , 0.2]])
y = np.matrix([[0],
[1],
[1],
[0],
[1]])
Then theta, _ = gradient_descent(X, y, 10000, 10000) (yes, in this case we can set the learning rate this large) will set theta as:
theta = np.matrix([[-3000.04008972, 3499.97995514, 4099.98797308]])
This will lead to vec_sigmoid(X * theta.T) to be the really good prediction of:
np.matrix([[0.00000000e+00], # 0
[1.00000000e+00], # 1
[1.00000000e+00], # 1
[1.95334953e-09], # nearly zero
[1.00000000e+00]]) # 1
but logistic_cost(X, y, theta) evaluates to NaN.
EDIT:
I came up with the following solution. I just replaced the logistic_cost function with:
def new_logistic_cost(X, y, theta):
term1 = vec_sigmoid(X*theta.T)
term1[y == 0] = 1
term2 = 1 - vec_sigmoid(X*theta.T)
term2[y == 1] = 1
summands = np.multiply(y, np.log(term1)) + np.multiply(1 - y, np.log(term2))
return - np.sum(summands) / len(y)
By using the mask I just calculate log(1) at the places at which the result will be multiplied with zero anyway. Now log(0) will only happen in wrong implementations of gradient descent.
Open questions: How can I make this solution more clean? Is it possible to achieve a similar effect in a cleaner way?
If you don't mind using SciPy, you could import expit and xlog1py from scipy.special:
from scipy.special import expit, xlog1py
and replace the expression
np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T)))
with
xlog1py(1 - y, -expit(X*theta.T))
I know it is an old question but I ran into the same problem, and maybe it can help others in the future, I actually solved it by implementing normalization on the data before appending X0.
def normalize_data(X):
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
return (X-mean) / std
After this all worked well!
I created a linear regression algorithm following a tutorial and applied it to the data-set provided and it works fine. However the same algorithm does not work on another similar data-set. Can somebody tell me why this happens?
def computeCost(X, y, theta):
inner = np.power(((X * theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
def gradientDescent(X, y, theta, alpha, iters):
temp = np.matrix(np.zeros(theta.shape))
params = int(theta.ravel().shape[1])
cost = np.zeros(iters)
for i in range(iters):
err = (X * theta.T) - y
for j in range(params):
term = np.multiply(err, X[:,j])
temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term))
theta = temp
cost[i] = computeCost(X, y, theta)
return theta, cost
alpha = 0.01
iters = 1000
g, cost = gradientDescent(X, y, theta, alpha, iters)
print(g)
On running the algo through this dataset I get the output as matrix([[ nan, nan]]) and the following errors:
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: RuntimeWarning: overflow encountered in power
from ipykernel import kernelapp as app
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:11: RuntimeWarning: invalid value encountered in double_scalars
However this data set works just fine and outputs matrix([[-3.24140214, 1.1272942 ]])
Both the datasets are similar, I have been over it many times but can't seem to figure out why it works on one dataset but not on other. Any help is welcome.
Edit: Thanks Mark_M for editing tips :-)
[Much better question, btw]
It's hard to know exactly what's going on here, but basically your cost is going the wrong direction and spiraling out of control, which results in an overflow when you try to square the value.
I think in your case it boils down to your step size (alpha) being too big which can cause gradient descent to go the wrong way. You need to watch the cost in gradient descent and makes sure it's always going down, if it's not either something is broken or alpha is to large.
Personally, I would reevaluate the code and try to get rid of the loops. It's a matter of preference, but I find it easier to work with X and Y as column vectors. Here is a minimal example:
from numpy import genfromtxt
# this is your 'bad' data set from github
my_data = genfromtxt('testdata.csv', delimiter=',')
def computeCost(X, y, theta):
inner = np.power(((X # theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
def gradientDescent(X, y, theta, alpha, iters):
for i in range(iters):
# you don't need the extra loop - this can be vectorize
# making it much faster and simpler
theta = theta - (alpha/len(X)) * np.sum((X # theta.T - y) * X, axis=0)
cost = computeCost(X, y, theta)
if i % 10 == 0: # just look at cost every ten loops for debugging
print(cost)
return (theta, cost)
# notice small alpha value
alpha = 0.0001
iters = 100
# here x is columns
X = my_data[:, 0].reshape(-1,1)
ones = np.ones([X.shape[0], 1])
X = np.hstack([ones, X])
# theta is a row vector
theta = np.array([[1.0, 1.0]])
# y is a columns vector
y = my_data[:, 1].reshape(-1,1)
g, cost = gradientDescent(X, y, theta, alpha, iters)
print(g, cost)
Another useful technique is to normalize your data before doing regression. This is especially useful when you have more than one feature you're trying to minimize.
As a side note - if you're step size is right you shouldn't get overflows no matter how many iterations you do because the cost will will decrease with every iteration and the rate of decrease will slow.
After 1000 iterations I arrived at a theta and cost of:
[[ 1.03533399 1.45914293]] 56.041973778
after 100:
[[ 1.01166889 1.45960806]] 56.0481988054
You can use this to look at the fit in an iPython notebook:
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(my_data[:, 0].reshape(-1,1), y)
axes = plt.gca()
x_vals = np.array(axes.get_xlim())
y_vals = g[0][0] + g[0][1]* x_vals
plt.plot(x_vals, y_vals, '--')
I have a problem fitting with LinearRegressionWithSGD in Spark's MLlib. I used their example for fitting from here https://spark.apache.org/docs/latest/mllib-linear-methods.html (using Python interface).
In their example all features are almost scaled with mean around 0 and standard deviation around 1. Now if I un-scale one of them by a factor of 10, the regression breaks (gives nans or very large coefficients):
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
from numpy import array
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
# UN-SCALE one of the features by a factor of 10
values[3] *= 10
return LabeledPoint(values[0], values[1:])
data = sc.textFile(spark_home+"data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parsePoint)
# Build the model
model = LinearRegressionWithSGD.train(parsedData)
# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
print "Model coefficients:", str(model)
So, I guess I need to do the feature scaling. If I do pre-scaling it works (because I'm back at scaled features). However now I don't know how to get coefficients in the original space.
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
from numpy import array
from pyspark.mllib.feature import StandardScaler
from pyspark.mllib.feature import StandardScalerModel
# Load and parse the data
def parseToDenseVector(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
# UN-SCALE one of the features by a factor of 10
values[3] *= 10
return Vectors.dense(values[0:])
# Load and parse the data
def parseToLabel(values):
return LabeledPoint(values[0], values[1:])
data = sc.textFile(spark_home+"data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parseToDenseVector)
scaler = StandardScaler(True, True)
scaler_model = scaler.fit(parsedData)
parsedData_scaled = scaler_model.transform(parsedData)
parsedData_scaled_transformed = parsedData_scaled.map(parseToLabel)
# Build the model
model = LinearRegressionWithSGD.train(parsedData_scaled_transformed)
# Evaluate the model on training data
valuesAndPreds = parsedData_scaled_transformed.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
print "Model coefficients:", str(model)
So, here I have all the coefficients in the transformed space. Now how do I get to the original space? I also have scaler_model which is StandardScalerModel object. But I can't get neither means or variances out of it. The only public method that this class has is transform which can transform points from original space to transform. But I can't get it reverse.
I just ran into this problem. The models cannot even learn f(x) = x if x is high (>3) in the training data. So terrible.
I think rather than scaling the data another option is to change the step size. This is discussed in SPARK-1859. To paraphrase from there:
The step size should be smaller than 1 over the Lipschitz constant L.
For quadratic loss and GD, the best convergence happens at stepSize = 1/(2L). Spark has a (1/n) multiplier on the loss function.
Let's say you have n = 5 data points and the largest feature value is 1500. So L = 1500 * 1500 / 5. The best convergence happens at stepSize = 1/(2L) = 10 / (1500 ^ 2).
The last equality doesn't even make sense (how did we get a 2 in the numerator?) but I've never heard of a Lipschitz constant before, so I am not qualified to fix it. Anyway I think we can just try different step sizes until it starts to work.
To rephrase your question, you want to find the intercept I and coefficients C_1 and C_2 that solve the equation: Y = I + C_1 * x_1 + C_2 * x_2 (where x_1 and x_2 are unscaled).
Let i be the intercept that mllib returns. Likewise let c_1 and c_2 be the coefficients (or weights) that mllib returns.
Let m_1 be the unscaled mean of x_1 and m_2 be the unscaled mean of x_2.
Let s_1 be the unscaled standard deviation of x_1 and s_2 be the unscaled standard deviation of x_2.
Then C_1 = (c_1 / s_1), C_2 = (c_2 / s_2), and
I = i - c_1 * m_1 / s_1 - c_2 * m_2 / s_2
This can easily be extended to 3 input variables:
C_3 = (c_3 / s_3) and I = i - c_1 * m_1 / s_1 - c_2 * m_2 / s_2 - c_3 * m_3 / s_3
As you pointed out StandardScalerModel object in pyspark doesn't expose std and mean attributes. There is an issue https://issues.apache.org/jira/browse/SPARK-6523
You can easily calculate them yourself
import numpy as np
from pyspark.mllib.stat import Statistics
summary = Statistics.colStats(features)
mean = summary.mean()
std = np.sqrt(features.variance())
These are the same mean and std that your Scaler uses. You can verify this using python magic dict
print scaler_model.__dict__.get('_java_model').std()
print scaler_model.__dict__.get('_java_model').mean()