Keras loss functions: how to round? - keras

I'm trying to recognize turning points in sequences, the points after which some process behaves differently. I use a keras model to do this. Input is the sequence (always the same length) and output should be 0 before the turning points, a 1 after the turning point.
I want the loss function to depend on the distance between the actual turning point and the predicted turning point.
I tried to round (to obtain the label 0 or 1), followed by summing the total number of 1's to get the "index" of the turning point. Assumed here is that the model gives just one turning point, as the data (synthetically produced) also has just one turning point. Tried is:
def dist_loss(yTrue,yPred):
turningPointTrue = K.sum(yTrue)
turningPointPred = K.sum(K.round(yPred))
return K.abs(turningPointTrue-turningPointPred)
This does not work, the following error is given:
ValueError: An operation has None for gradient. Please make sure
that all of your ops have a gradient defined (i.e. are
differentiable). Common ops without gradient: K.argmax, K.round,
K.eval.
I think this means that K.round(yPred) gives a singular value, instead of a vector/tensor. Does anyone know how to solve this issue?

The round operation has no defined gradient, so it cannot be used at all inside a loss function, since for training of a neural network the gradient of the loss with respect to the weights has to be computed, and this implies that all the parts of the network and loss must be differentiable (or a differentiable approximation must be available).
In your case you should try to find an approximation to round that is differentiable, but unfortunately I don't know if there is one. One example of such approximation is the softmax function as approximation of the max function.

Related

Does equal probabilities not summing to one in torch.utils.data.WeightedRandomSampler still make it uniform?

In pytorch, there is a sampler class called WeightedRandomSampler (https://pytorch.org/docs/stable/data.html#torch.utils.data.WeightedRandomSampler). It ('weights' parameter) expects probabilities for N samples. For uniform distribution, I believe it expects array with 1/N value.
But if I put say 0.5 for each sample, where N*0.5 is not equal to 1, does it still make the sampling uniform, given equal probabilities are there for each sample?
Yes, the sampling will still be uniform. Only the relative magnitude of the weights with respect to the other weights is important, not the absolute magnitude, as pytorch normalizes the weights.
If we look under the hood of WeightedRandomSampler, it makes a call to torch.multinomial which itself makes a call to torch.distributions.Categorical, which we can see here (line 57) normalizes the weights such that they sum to one.

How can r-squared be negative when the correlation between prediction and truth is positive?

Trying to understand how the r-squared (and also explained variance) metrics can be negative (thus indicating non-existant forecasting power) when at the same time the correlation factor between prediction and truth (as well as slope in a linear-regression (regressing truth on prediction)) are positive
R Squared can be negative in a rare scenario.
R squared = 1 – (SSR/SST)
Here, SST stands for Sum of Squared Total which is nothing but how much does the predicted points get varies from the mean of the target variable. Mean is nothing but a regression line here.
SST = Sum (Square (Each data point- Mean of the target variable))
For example,
If we want to build a regression model to predict height of a student with weight as the independent variable then a possible prediction without much effort is to calculate the mean height of all current students and consider it as the prediction.
In the above diagram, red line is the regression line which is nothing but the mean of all heights. This mean calculated without much effort and can be considered as one of the worst method of prediction with poor accuracy. In the diagram itself we can see that the prediction is nowhere near to the original data points.
Now come to SSR,
SSR stands for Sum of Squared Residuals. This residual is calculated from the model which we build from our mathematical approach (Linear regression, Bayesian regression, Polynomial regression or any other approach). If we use a sophisticated approach rather than using a naive approach like mean then our accuracy will obviously increase.
SSR = Sum (Square (Each data point - Each corresponding data point in the regression line))
In the above diagram, let's consider that the blue line indicates a sophisticated model with large mathematical analysis. We can see that it has obviously higher accuracy than the red line.
Now come to the formula,
R Squared = 1- (SSR/SST)
Here,
SST will be large number because it a very poor model (red line).
SSR will be a small number because it is the best model we developed
after much mathematical analysis (blue line).
So, SSR/SST will be a very small number (It will become very small
whenever SSR decreases).
So, 1- (SSR/SST) will be large number.
So we can infer that whenever R Squared goes higher, it means the
model is too good.
This is a generic case but this cannot be applied in many cases where multiple independent variables are present. In the example, we had only one independent variable and one target variable but in real case, we will have 100's of independent variables for a single dependent variable. The actual problem is that, out of 100's of independent variables-
Some variables will have very high correlation with target variable.
Some variables will have very small correlation with target variable.
Also some independent variables will have no correlation at all.
So, RSquared is calculated on an assumption that the average line of the target which is perpendicular line of y axis is the worst fit a model can have at a maximum riskiest case. SST is the squared difference between this average line and original data points. Similarly, SSR is the squared difference between the predicted data points (by the model plane) and original data points.
SSR/SST gives a ratio how SSR is worst with respect to SST. If your model can somewhat build a plane which is a comparatively good than the worst, then in 99% cases SSR<SST. It eventually makes R squared as positive if you substitute it in the equation.
But what if SSR>SST ? This means that your regression plane is worse than the mean line (SST). In this case, R squared will be obviously negative. But it happens only at 1% of cases or smaller.
Answer was originally written in quora by me -
https://qr.ae/pNsLU8
https://qr.ae/pNsLUr

How does ReLu work with zero-centered output domain?

In the problem i am trying to solve, my output domain is zero centered, between -1 an 1. When looking up activation functions i noticed that ReLu outputs values between 0 and 1, which basically would mean that your output is all negative or all positive.
This can be mapped back to the appropriate domain through inverse normalization but ReLu is designed to determine the "strength" of a neuron in a single direction, but in my problem, i need to determine the strength of a neuron in one of two direction. If i use tanh, i have to worry about vanishing/exploding gradients, but if i use ReLu, my output will always be "biased" towards positives or negative values because essentially really small values would be have to mapped to a postitive domain and large value a negative domain or visa versa.
Other info: I've used ReLu and it works well but i fear that it is for the wrong reasons. The reason i say this is that it seems for either the pos or neg domain approaching smaller values will mean a stronger connection up to a point, then which it will not be activated at all. Yes the network can technically work (probably harder than it needs to) to keep the entire domain of train outputs in the positive space, but if a value happens to exceed the bounds of the training set it will be non-existent? when in reality it should be even more active
What is the appropriate way to deal with zero centered output domains?
I think you have to use Sign function. It's zero center and have -1 , 1 as the out put.
Sign function:
https://helloacm.com/wp-content/uploads/2016/10/math-sgn-function-in-cpp.jpg
You could go with variations of ReLU which output values with mean closer to zero or being zero (ELU, CELU, PReLU and others) and having other interesting specific traits. Furthermore, it would help with the dying neurons problem in ReLU.
Anyway, I'm not aware of any hard research proving usefulness of one over the other, it is still in experimentation phase and really problem dependent from what I recall (pls correct me if I'm wrong).
And you should really check whether activation function is problematic in your case, it might be totally fine to go with ReLU.
First, you don't have to put an activation function after the last layer in your neural network. Activation function is required between layers to introduce non-linearity, so it's not required in the last layer.
You're free to experiment various options:
Use tanh. Vanishing/exploding gradient is sometimes not a problem in practice depending on the network architecture, and if you initialize the weights properly.
Do nothing. The NN should be trained to output value between -1 to 1 for "typical" inputs. You can clip the value in application layer.
Clip the output in the network. E.g. out = tf.clip_by_value(out, -1.0, 1.0)
Be creative and try your other ideas.
At the end, ML is a process of trial-and-error. Try different things and find something that works for you. Good luck.

Calculate gradient of neural network

I am reading about adversarial images and breaking neural networks. I am trying to work through the article step-by-step but do to my inexperience I am having a hard time trying to understand the following instructions.
At the moment, I have a logistic regression model for the MNIST data set. If you give an image, it will predict the number that it most likely is...
saver.restore(sess, "/tmp/model.ckpt")
# image of number 7
x_in = np.expand_dims(mnist.test.images[0], axis=0)
classification = sess.run(tf.argmax(pred, 1), feed_dict={x:x_in})
print(classification)
Now, the article states that in order to break this image, the first thing we need to do is get the gradient of the neural network. In other words, this will tell me the direction needed to make the image look more like a number 2 or 3, even though it is a 7.
The article states that this is relatively simple to do using back propagation. So you may define a function...
compute_gradient(image, intended_label)
...and this basically tells us what kind of shape the neural network is looking for at that point.
This may seem easy to implement to those more experienced but the logic evades me.
From the parameters of the function compute_gradient, I can see that you feed it an image and an array of labels where the value of the intended label is set to 1.
But I do not see how this is supposed to return the shape of the neural network.
Anyways, I want to understand how I should implement this back propagation algorithm to return the gradient of the neural network. If the answer is not very straightforward, I would like some step-by-step instructions as to how I may get my back propagation to work as the article suggests it should.
In other words, I do not need someone to just give me some code that I can copy but I want to understand how I may implement it as well.
Back propagation involves calculating the error in the network's output (the cost function) as a function of the inputs and the parameters of the network, then computing the partial derivative of the cost function with respect to each parameter. It's too complicated to explain in detail here, but this chapter from a free online book explains back propagation in its usual application as the process for training deep neural networks.
Generating images that fool a neural network simply involves extending this process one step further, beyond the input layer, to the image itself. Instead of adjusting the weights in the network slightly to reduce the error, we adjust the pixel values slightly to increase the error, or to reduce the error for the wrong class.
There's an easy (though computationally intensive) way to approximate the gradient with a technique from Calc 101: for a small enough e, df/dx is approximately (f(x + e) - f(x)) / e.
Similarly, to calculate the gradient with respect to an image with this technique, calculate how much the loss/cost changes after adding a small change to a single pixel, save that value as the approximate partial derivative with respect to that pixel, and repeat for each pixel.
Then the gradient with respect to the image is approximately:
(
(cost(x1+e, x2, ... xn) - cost(x1, x2, ... xn)) / e,
(cost(x1, x2+e, ... xn) - cost(x1, x2, ... xn)) / e,
.
.
.
(cost(x1, x2, ... xn+e) - cost(x1, x2, ... xn)) / e
)

Is linear regression the same thing as ordinary least squares in SPSS?

I want to use a linear regression model, but I want to use ordinary least squares, which I think it is a type of linear regression. The software I use is SPSS. It only has linear regression, partial least squares and 2-stages least squares. I have no idea which one is ordinary least squares (OLS).
Yes, although 'linear regression' refers to any approach to model the relationship between one or more variables, OLS is the method used to find the simple linear regression of a set of data.
Linear regression is a vast term that just says we are finding a relationship between the dependent and independent variable(s), no matter what technique we are using.
OLS is just one of the technique to do linear reg.
Lets say,
error(e) = (observed value - predicted value)
Observed values - blue dots in picture
predicted values - points on the line(vertically below to the observed values)
The vertical lines below represent 'e'. We square them -> add them and get total err. And we try to reduce this total error.
For OLS, as the name says (ordinary least squared method), here we reduce the sum of all e^2 i.e. we try to make the error least.

Resources