Extra parameters to be updated in back propagation of Neural Network in Keras - keras

I'm working on a MIDAS regression approach to Neural Networks. I use the exponential almon lag function to preprocess my high frequency data and use the processed data as input for my neural network using Keras. However, the exponential almon lag takes two parameters to determine the shape of the polynomial theta_1 and theta_2. However, I want to incorporate this function in the back propagation scheme of the Neural Network. Is there a way to update these parameters as the model is training? The function I use for the exponential almon lag looks as follows:
def exp_almon_lag(data, theta_1, theta_2):
# Load the data
df = data
# Transform the data to 4 weekly sets
data_transposed = []
for i in range(int(len(df)/4)):
array = [df.iloc[(i*4),0], df.iloc[(i*4+1),0], df.iloc[(i*4+2),0], df.iloc[(i*4+3),0]]
data_transposed.append(array)
# For k = 4 the exponential almon lag values are
lag_1 = np.exp(theta_1 * 1 + theta_2 * 1 ** 2)
lag_2 = np.exp(theta_1 * 2 + theta_2 * 2 ** 2)
lag_3 = np.exp(theta_1 * 3 + theta_2 * 3 ** 2)
lag_4 = np.exp(theta_1 * 4 + theta_2 * 4 ** 2)
denominator = lag_1 + lag_2 + lag_3 + lag_4
# Store lag operator values in an array
almon_values = [(lag_1 / denominator), (lag_2 / denominator),
(lag_3 / denominator), (lag_4 / denominator)]
processed_data = []
for i in range(len(data_transposed)):
processed_data.append(np.multiply(data_transposed[i], almon_values).sum())
return processed_data
At the moment I use this function to preprocess the data, but I want the theta values to be updated by the network as well.

Related

Skewed random sample from Numpy random generator sample (numpy.random.Generator.choice)

I have made a piece of Python to generate mixture of normal distributions and I would want to sample from it. As the result is my probability density function I would want the sample to be representative of the original distribution.
So I have developped the function to create the pdf:
def gaussian_pdf(amplitude, mean, std, sample_int):
coeff = (amplitude / std) / np.sqrt(2 * np.pi)
if len(amplitude > 1):
# create mixture distribution
# get distribution support
absciss_array = np.linspace(np.min(mean) - 4 * std[np.argmin(mean)],
np.max(mean) + 4 * std[np.argmax(mean)],
sample_int)
normal_array = np.zeros(len(absciss_array))
for index in range(0, len(amplitude)):
normal_array += coeff[index] * np.exp(-((absciss_array - mean[index]) / std[index]) ** 2)
else:
# create simple gaussian distribution
absciss_array = np.linspace(mean - 4*std, mean + 4*std, sample_int)
normal_array = coeff * np.exp(-((absciss_array - mean) / 2*std) ** 2)
return np.ascontiguousarray(normal_array / np.sum(normal_array))
An I have tested a sampling with the main part of the script :
def main():
amplitude = np.asarray([1, 2, 1])
mean = np.asarray([0.5, 1, 2.5])
std = np.asarray([0.1, 0.2, 0.3])
no_sample = 10000
# create mixture gaussian array
gaussian_array = gaussian_pdf(amplitude, mean, std, no_sample)
# pot data
fig, ax = plt.subplots()
absciss = np.linspace(np.min(gaussian_array), np.max(gaussian_array), no_sample)
ax.plot(absciss, gaussian_array)
# create random generator to sample from distribution
rng = np.random.default_rng(424242)
# sample from distribution
sample = rng.choice(a=gaussian_array, size=100, replace=True, p=gaussian_array)
# plot results
ax.plot(sample, np.full_like(sample, -0.00001), '|k', markeredgewidth=1)
plt.show()
return None
I then have the result :
You can see with the dark lines the samples that have been extracted from the distribution. The problem is that, even if I specify to use the probability array in the numpy function, the sampling is skewed towards the end of the distribution. I have tried several times with other seeds but the result does not change...
I expect to have more samples in the area where the probability density is greater...
Would someone please help me ? Am I missing something here ?
Thanks in advance.
Well actually the answer was to use an uniform distribution for sampling. Thanks to #amzon-ex for pointing it out.
The code is then :
absciss = np.linspace(np.min(gaussian_array), np.max(gaussian_array), no_sample)
sample_other = rng.choice(a=absciss, size=100, replace=True, p=gaussian_array)

Neural Network Python Gradient Descent

I am new to machine learning and trying to understand it (self-learning). So I grabbed a book (this one if interested: https://www.amazon.com/Neural-Networks-Unity-Programming-Windows/dp/1484236726) and started to read the first chapter. While reading, there are a few things I did not understand so I went to research online.
However, I still have trouble with a few points that I cannot understand even after so much reading and research:
How are we calculating l2_delta and l1_delta? (marked with #what is this part doing? in code below)
How does gradient descent relate? (I looked up the formula and tried to read a bit about it but I could not relate the one line code to the code I have down there)
Is that a network with 3 layers (layer 1: 3 input nodes, layer 2: not sure ,layer 3: 1 output node )
Neural Network Full Code:
trying to write my first neural network!
import numpy as np
#activation function (sigmoid , maps value between 0 and 1)
def sigmoid(x):
return 1/(1+np.exp(-x))
def derivative(x):
return x*(1-x)
#initialize input (4 training data (row), 3 features (col))
X = np.array([[0,0,1],[0,1,1],[1,0,1],[1,1,1]])
#initialize output for training data (4 training data (rows), 1 output for each (col))
Y = np.array([[0],[1],[1],[0]])
np.random.seed(1)
#synapses
syn0 = 2* np.random.random((3,4)) - 1
syn1 = 2* np.random.random((4,1)) - 1
for iter in range(60000):
#layers
l0 = X
l1 = sigmoid(np.dot(l0,syn0))
l2 = sigmoid(np.dot(l1,syn1))
#error
l2_error = Y - l2
if(iter % 10000 == 0): #only print error every 10000 steps to save time and limit the amount of output
print("Error L2: " + str (np.mean(np.abs(l2_error))))
#what is this part doing?
l2_delta = l2_error * derivative(l2)
l1_error = l2_delta.dot(syn1.T)
l1_delta = l1_error * derivative(l1)
if(iter % 10000 == 0): #only print error every 10000 steps to save time and limit the amount of output
print("Error L1: " + str (np.mean(np.abs(l1_error))))
#update weights
syn1 = syn1 + l1.T.dot(l2_delta) // derative with respect to cost function
syn0 = syn2 + l0.T.dot(l1_delta)
print(l2)
Thank you!
In general, layerwise computations (Hence the notation l1 and l2 above) is simply getting the dot product of a vector $x \in \mathbb{R}^n$ and a vector of weights in the same dimension, then applying the sigmoid function on each component .
Gradient Descent. - - - Imagine, in two dimensions say the graph of $f(x) = x^2$ Suppose, we don't know how to get it's minimum. Gradient descent will basically evaluate $f'(x)$ at various points, and check whether $f'(x)$ is close to zero

PyTorch doesn't seem to be optimizing correctly

I have posted this question on Data Science StackExchange site since StackOverflow does not support LaTeX. Linking it here because this site is probably more appropriate.
The question with correctly rendered LaTeX is here: https://datascience.stackexchange.com/questions/48062/pytorch-does-not-seem-to-be-optimizing-correctly
The idea is that I am considering sums of sine waves with different phases. The waves are sampled with some sample rate s in the interval [0, 2pi]. I need to select phases in such a way, that the sum of the waves at any sample point is minimized.
Below is the Python code. Optimization does not seem to be computed correctly.
import numpy as np
import torch
def phaseOptimize(n, s = 48000, nsteps = 1000):
learning_rate = 1e-3
theta = torch.zeros([n, 1], requires_grad=True)
l = torch.linspace(0, 2 * np.pi, s)
t = torch.stack([l] * n)
T = t + theta
for jj in range(nsteps):
loss = T.sin().sum(0).pow(2).sum() / s
loss.backward()
theta.data -= learning_rate * theta.grad.data
print('Optimal theta: \n\n', theta.data)
print('\n\nMaximum value:', T.sin().sum(0).abs().max().item())
Below is a sample output.
phaseOptimize(5, nsteps=100)
Optimal theta:
tensor([[1.2812e-07],
[1.2812e-07],
[1.2812e-07],
[1.2812e-07],
[1.2812e-07]], requires_grad=True)
Maximum value: 5.0
I am assuming this has something to do with broadcasting in
T = t + theta
and/or the way I am computing the loss function.
One way to verify that optimization is incorrect, is to simply evaluate the loss function at random values for the array $\theta_1, \dots, \theta_n$, say uniformly distributed in $[0, 2\pi]$. The maximum value in this case is almost always much lower than the maximum value reported by phaseOptimize(). Much easier in fact is to consider the case with $n = 2$, and simply evaluate at $\theta_1 = 0$ and $\theta_2 = \pi$. In that case we get:
phaseOptimize(2, nsteps=100)
Optimal theta:
tensor([[2.8599e-08],
[2.8599e-08]])
Maximum value: 2.0
On the other hand,
theta = torch.FloatTensor([[0], [np.pi]])
l = torch.linspace(0, 2 * np.pi, 48000)
t = torch.stack([l] * 2)
T = t + theta
T.sin().sum(0).abs().max().item()
produces
3.2782554626464844e-07
You have to move computing T inside the loop, or it will always have the same constant value, thus constant loss.
Another thing is to initialize theta to different values at indices, otherwise because of the symmetric nature of the problem the gradient is the same for every index.
Another thing is that you need to zero gradient, because backward just accumulates them.
This seems to work:
def phaseOptimize(n, s = 48000, nsteps = 1000):
learning_rate = 1e-1
theta = torch.zeros([n, 1], requires_grad=True)
theta.data[0][0] = 1
l = torch.linspace(0, 2 * np.pi, s)
t = torch.stack([l] * n)
for jj in range(nsteps):
T = t + theta
loss = T.sin().sum(0).pow(2).sum() / s
loss.backward()
theta.data -= learning_rate * theta.grad.data
theta.grad.zero_()
You're being bitten by both PyTorch and math. Firstly, you need to
Zero out the gradient by setting theta.grad = None before each backward step. Otherwise the gradients accumulate instead of overwriting the previous ones
You need to recalculate T at each step. PyTorch is not symbolic, unlike TensorFlow and T = t + theta means "T equals the sum of current t and current theta" and not "T equals the sum of t and theta, whatever their values may be at any time in the future".
With those fixes you get the following code:
def phaseOptimize(n, s = 48000, nsteps = 1000):
learning_rate = 1e-3
theta = torch.zeros(n, 1, requires_grad=True)
l = torch.linspace(0, 2 * np.pi, s)
t = torch.stack([l] * n)
T = t + theta
for jj in range(nsteps):
T = t + theta
loss = T.sin().sum(0).pow(2).sum() / s
theta.grad = None
loss.backward()
theta.data -= learning_rate * theta.grad.data
T = t + theta
print('Optimal theta: \n\n', theta.data)
print('\n\nMaximum value:', T.sin().sum(0).abs().max().item())
which will still not work as you expect because of math.
One can easily see that the minimum to your loss function is when theta are also uniformly spaced over [0, 2pi). The problem is that you are initializing your parameters as torch.zeros, which leads to all those values being equal (this is the polar opposite of equispaced!). Since your loss function is symmetrical with respect to permutations of theta, the computed gradients are equal and the gradient descent algorithm can never "differentiate them". In more mathematical terms, you're unlucky enough to initialize your algorithm exactly on a saddle point, so it cannot continue. If you add any noise, it will converge. For instance with
theta = torch.zeros(n, 1) + 0.001 * torch.randn(n, 1)
theta.requires_grad_(True)

Improper cost function outputs for Vectorized Logistic Regression

I'm trying to implement vectorized logistic regression on the Iris dataset. This is the implementation from Andrew Ng's youtube series on deep learning. My best predictions using this method have been 81% accuracy while sklearn's implementation achieves 100% with completely different values for coefficients and bias. Also, I cant seem to get get proper outputs from my cost function. I suspect it is an issue with computing the gradients of the weights and bias with respect to the cost function though in the course he provides all of the necessary equations ( unless there is something in the actual exercise which I don't have access to being left out.) My code is as follows.
n = 4
m = 150
y = y.reshape(1, 150)
X = X.reshape(4, 150)
W = np.zeros((4, 1))
b = np.zeros((1,1))
for epoch in range(1000):
Z = np.dot(W.T, X) + b
A = sigmoid(Z) # 1/(1 + e **(-Z))
J = -1/m * np.sum(y * np.log(A) + (1-y) * (1 - np.log(A))) #cost function
dz = A - y
dw = 1/m * np.dot(X, dz.T)
db = np.sum(dz)
W = W - 0.01 * dw
b = b - 0.01 * db
if epoch % 100 == 0:
print(J)
My output looks something like this.
-1.6126604413879289
-1.6185960074767125
-1.6242504226045396
-1.6296400635926438
-1.6347800862216104
-1.6396845400653066
-1.6443664703028427
-1.648838008214648
-1.653110451818512
-1.6571943378913891
W and b values are:
array([[-0.68262679, -1.56816916, 0.12043066, 1.13296948]])
array([[0.53087131]])
Where as sklearn outputs:
(array([[ 0.41498833, 1.46129739, -2.26214118, -1.0290951 ]]),
array([0.26560617]))
I understand sklearn uses L2 regularization but even when turned off it's still far from the correct values. Any help would be appreciated. Thanks
You are likely getting strange results because you are trying to use logistic regression where y is not a binary choice. Categorizing the iris data is a multiclass problem, y can be one of three values:
> np.unique(iris.target)
> array([0, 1, 2])
The cross entropy cost function expects y to either be one or zero. One way to handle this is the one vs all method.
You can check each class by making y a boolean of whether the iris in in one class or not. For example here you can make y a data set of either class 1 or not:
y = (iris.target == 1).astype(int)
With that your cost function and gradient descent should work, but you'll need to run it multiple times and pick the best score for each example. Andrew Ng's class talks about this method.
EDIT:
It's not clear what you are starting with for data. When I do this, don't reshape the inputs. So you should double check that all your multiplication is delivering the shapes you want. On thing I notice that's a little odd, is the last term in your cost function. I generally do this:
cost = -1/m * np.sum(Y*np.log(A) + (1-Y) * np.log(1-A))
not:
-1/m * np.sum(y * np.log(A) + (1-y) * (1 - np.log(A)))
Here's code that converges for me using the dataset from sklearn:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
# Iris is a multiclass problem. Here, just calculate the probabily that
# the class is `iris_class`
iris_class = 0
Y = np.expand_dims((iris.target == iris_class).astype(int), axis=1)
# Y is now a data set of booleans indicating whether the sample is or isn't a member of iris_class
# initialize w and b
W = np.random.randn(4, 1)
b = np.random.randn(1, 1)
a = 0.1 # learning rate
m = Y.shape[0] # number of samples
def sigmoid(Z):
return 1/(1 + np.exp(-Z))
for i in range(1000):
Z = np.dot(X ,W) + b
A = sigmoid(Z)
dz = A - Y
dw = 1/m * np.dot(X.T, dz)
db = np.mean(dz)
W -= a * dw
b -= a * db
cost = -1/m * np.sum(Y*np.log(A) + (1-Y) * np.log(1-A))
if i%100 == 0:
print(cost)

Spark's LinearRegressionWithSGD is very sensitive to feature scaling

I have a problem fitting with LinearRegressionWithSGD in Spark's MLlib. I used their example for fitting from here https://spark.apache.org/docs/latest/mllib-linear-methods.html (using Python interface).
In their example all features are almost scaled with mean around 0 and standard deviation around 1. Now if I un-scale one of them by a factor of 10, the regression breaks (gives nans or very large coefficients):
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
from numpy import array
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
# UN-SCALE one of the features by a factor of 10
values[3] *= 10
return LabeledPoint(values[0], values[1:])
data = sc.textFile(spark_home+"data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parsePoint)
# Build the model
model = LinearRegressionWithSGD.train(parsedData)
# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
print "Model coefficients:", str(model)
So, I guess I need to do the feature scaling. If I do pre-scaling it works (because I'm back at scaled features). However now I don't know how to get coefficients in the original space.
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
from numpy import array
from pyspark.mllib.feature import StandardScaler
from pyspark.mllib.feature import StandardScalerModel
# Load and parse the data
def parseToDenseVector(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
# UN-SCALE one of the features by a factor of 10
values[3] *= 10
return Vectors.dense(values[0:])
# Load and parse the data
def parseToLabel(values):
return LabeledPoint(values[0], values[1:])
data = sc.textFile(spark_home+"data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parseToDenseVector)
scaler = StandardScaler(True, True)
scaler_model = scaler.fit(parsedData)
parsedData_scaled = scaler_model.transform(parsedData)
parsedData_scaled_transformed = parsedData_scaled.map(parseToLabel)
# Build the model
model = LinearRegressionWithSGD.train(parsedData_scaled_transformed)
# Evaluate the model on training data
valuesAndPreds = parsedData_scaled_transformed.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
print "Model coefficients:", str(model)
So, here I have all the coefficients in the transformed space. Now how do I get to the original space? I also have scaler_model which is StandardScalerModel object. But I can't get neither means or variances out of it. The only public method that this class has is transform which can transform points from original space to transform. But I can't get it reverse.
I just ran into this problem. The models cannot even learn f(x) = x if x is high (>3) in the training data. So terrible.
I think rather than scaling the data another option is to change the step size. This is discussed in SPARK-1859. To paraphrase from there:
The step size should be smaller than 1 over the Lipschitz constant L.
For quadratic loss and GD, the best convergence happens at stepSize = 1/(2L). Spark has a (1/n) multiplier on the loss function.
Let's say you have n = 5 data points and the largest feature value is 1500. So L = 1500 * 1500 / 5. The best convergence happens at stepSize = 1/(2L) = 10 / (1500 ^ 2).
The last equality doesn't even make sense (how did we get a 2 in the numerator?) but I've never heard of a Lipschitz constant before, so I am not qualified to fix it. Anyway I think we can just try different step sizes until it starts to work.
To rephrase your question, you want to find the intercept I and coefficients C_1 and C_2 that solve the equation: Y = I + C_1 * x_1 + C_2 * x_2 (where x_1 and x_2 are unscaled).
Let i be the intercept that mllib returns. Likewise let c_1 and c_2 be the coefficients (or weights) that mllib returns.
Let m_1 be the unscaled mean of x_1 and m_2 be the unscaled mean of x_2.
Let s_1 be the unscaled standard deviation of x_1 and s_2 be the unscaled standard deviation of x_2.
Then C_1 = (c_1 / s_1), C_2 = (c_2 / s_2), and
I = i - c_1 * m_1 / s_1 - c_2 * m_2 / s_2
This can easily be extended to 3 input variables:
C_3 = (c_3 / s_3) and I = i - c_1 * m_1 / s_1 - c_2 * m_2 / s_2 - c_3 * m_3 / s_3
As you pointed out StandardScalerModel object in pyspark doesn't expose std and mean attributes. There is an issue https://issues.apache.org/jira/browse/SPARK-6523
You can easily calculate them yourself
import numpy as np
from pyspark.mllib.stat import Statistics
summary = Statistics.colStats(features)
mean = summary.mean()
std = np.sqrt(features.variance())
These are the same mean and std that your Scaler uses. You can verify this using python magic dict
print scaler_model.__dict__.get('_java_model').std()
print scaler_model.__dict__.get('_java_model').mean()

Resources