Why is the gradient calculation of y not disabled in the following piece of code?
x = torch.randn(3, requires_grad=True)
print(x.requires_grad)
print((x ** 2).requires_grad)
y = x**2
print(y.requires_grad)
with torch.no_grad():
print((x ** 2).requires_grad)
print(y.requires_grad)
Which gives the following output:
True
True
True
False
True
Going through the official documentation says that the results would have require_grad=False even though the inputs have required_grad=True
Disabling gradient calculation is useful for inference, when you are sure
that you will not call :meth:Tensor.backward(). It will reduce memory
consumption for computations that would otherwise have requires_grad=True.
In this mode, the result of every computation will have
requires_grad=False, even when the inputs have requires_grad=True.
I don't know the specific implementation of torch.no_grad(), but the doc contains the sentence the result of every computation which means it only works for the result but not origin variable.
run code below:
with torch.no_grad():
print(x.grad)
which will give output:
True
So as y which is not the result arising within torch.no_grad() context.
Related
What's the correct way to do gradient descent on an arbitrary function with no input using Pytorch?
x = torch.tensor(x_init, requires_grad=True)
opt = torch.optim.Adam([x])
cost_fnx = cost(x)
for iteration_count in range(100):
opt.zero_grad()
cost_fnx.backward()
opt.step()
When I tried the above, I got this error:
RuntimeError: Trying to backward through the graph a second time (or directly access saved variables after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved variables after calling backward.
The error occurs because you are trying to backpropagate on the same graph multiple times. You most likely need to recompute the cost value (your regularizer function since it only has the model's parameters as input) to backpropagate again. Something like:
x = x_init.requires_grad_(True)
opt = torch.optim.Adam([x])
for iteration_count in range(2):
cost_fnx = cost(x)
opt.zero_grad()
cost_fnx.backward()
opt.step()
I am trying to do something like this without redefining a = f(x,y):
a = f(x,y)
find gradient of a with respect to x
change x
find gradient of a with respect to x
find gradient of a with respect to y
I tried a partial example below but it just gives me an error. Does anyone know how I can do this without redefining the original function everytime?
>>> x = torch.tensor([2.], requires_grad=True)
>>> y = 10*x**2
>>> torch.autograd.grad(y,x, retain_graph=True)
(tensor([40.]),)
>>> x = torch.tensor([1.], requires_grad=True)
>>> torch.autograd.grad(y,x, retain_graph=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Philip/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 157, in grad
inputs, allow_unused)
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
How do I recalculate the gradient after changing the input?
In general you need to recompute the output with the new input.
Consider the way that the backpropagation algorithm works. Depending on the form of f, different intermediate results will need to be saved for later use by the backpropagation algorithm. These intermediate results may or may not depend on the original value of x, even when computing the gradient w.r.t. y.
For example, if f(x,y) = g(h(x,y)) then by the chain rule df/dy = dg/dh * dh/dy. To make this a little more concrete lets consider the case where g is some non-linear function and h(x,y) = x*y. Then we have that df/dy = g'(h(x,y))*x. The reason backpropagation works efficiently here is that it caches the intermediate value of h(x,y) during forward pass, so all it needs to do is plug that value into g' during the backward pass. If you change the value of x, then the cached value of h(x,y) will no longer be the correct value needed to compute the gradient you are interested in (which should be using the value of h computed using the new value of x). Therefore you must recompute the forward pass again to store the correct cached values.
I am trying to fit a linear model and my dataset is normalized where each feature is divided by the maximum possible value. So the values ranges from 0-1. Now i came to know from my previous post Linear Regression vs Closed form Ordinary least squares in Python linear regression in scikit learn produces same result as Closed form OLS when fit_intercept parameter is set to false. I am not quite getting how fit_intercept works.
For any linear problem, if y is the predicted value.
y(w, x) = w_0 + w_1 x_1 + ... + w_p x_p
Across the module, the vector w = (w_1, ..., w_p) is denoted as coef_ and w_0 as intercept_
In closed form OLS we also have a bias value for w_0 and we introduce vector X_0=[1...1] before computing the dot product and solves using matrix multiplication and inverse.
w = np.dot(X.T, X)
w1 = np.dot(np.linalg.pinv(w), np.dot(X.T, Y))
When fit_intercept is True, scikit-learn linear regression solves the problem if y is the predicted value.
y(w, x) = w_0 + w_1 x_1 + ... + w_p x_p + b where b is the intercept item.
How does it differ to use fit_intercept in a model and when should one set it to True/False. I was trying to look at the source code and it seems like the coefficients are normalized by some scale.
if self.fit_intercept:
self.coef_ = self.coef_ / X_scale
self.intercept_ = y_offset - np.dot(X_offset, self.coef_.T)
else:
self.intercept_ = 0
What does this scaling do exactly. I want to interpret the coefficients in both approach (Linear Regression, Closed form OLS) but since just setting fit_intercept True/False gives different result for Linear Regression i can't quite decide on the intuition behind them. Which one is better and why?
Let's take a step back and consider the following sentence you said:
since just setting fit_intercept True/False gives different result for Linear Regression
That is not entirely true. It may or may not be different, and it depends entirely on your data. It would help to understand what goes into the calculation of regression weights. I mean this somewhat literally: what does your input (x) data look like?
Understanding your input data, and understanding why it matters, will help you realize why you sometimes get different results, and why at other times the results are the same
Data setup
Lets set up some test data:
import numpy as np
from sklearn.linear_model import LinearRegression
np.random.seed(1243)
x = np.random.randint(0,100,size=10)
y = np.random.randint(0,100,size=10)
Our x and y variables look like this:
X Y
51 29
3 73
7 77
98 29
29 80
90 37
49 9
42 53
8 17
65 35
No-intercept model
Recall that the calculation of regression weights has a closed form solution, which we can obtain using normal equations:
Using this method, we get a single regression coefficient because we only have 1 predictor variable:
x = x.reshape(-1,1)
w = np.dot(x.T, x)
w1 = np.dot(np.linalg.pinv(w), np.dot(x.T, y))
print(w1)
[ 0.53297593]
Now, let's look at scikit-learn when we set fit_intercept = False:
clf = LinearRegression(fit_intercept=False)
print(clf.fit(x, y).coef_)
[ 0.53297593]
What happens when we set fit_intercept = True instead?
clf = LinearRegression(fit_intercept=True)
print(clf.fit(x, y).coef_)
[-0.35535884]
It would seem that setting fit_intercept to True and False gives different answers, and that the "correct" answer occurs only when we set it to False, but this is not entirely correct...
Intercept model
At this point we have to consider what our input data actually is. In the models above, our data matrix (also called a feature matrix, or design matrix in statistics) is just a single vector containing our x values. The y variable is not included in the design matrix. If we want to add an intercept to our model, one common approach is to add a column of 1's to the design matrix, so x becomes:
x_vals = x.flatten()
x = np.zeros((10, 2))
x[:,0] = 1
x[:,1] = x_vals
intercept x
0 1.0 51.0
1 1.0 3.0
2 1.0 7.0
3 1.0 98.0
4 1.0 29.0
5 1.0 90.0
6 1.0 49.0
7 1.0 42.0
8 1.0 8.0
9 1.0 65.0
Now, when we use this as our design matrix, we can try the closed form solution again:
w = np.dot(x.T, x)
w1 = np.dot(np.linalg.pinv(w), np.dot(x.T, y))
print(w1)
[ 59.60686058 -0.35535884]
Notice 2 things:
We now have 2 coefficients. The first is our intercept and the second is the regression coefficient for the x predictor variable
The coefficient for x matches the coefficient from the scikit-learn output above when we set fit_intercept = True
So in the scikit-learn models above, why was there a difference between True and False? Because in one case no intercept was modeled. In the other case the underlying model included an intercept, which is confirmed when you manually add an intercept term/column when solving the normal equations
If you were to use this new design matrix in scikit-learn, it doesn't matter whether you set True or False for fit_intercept, the coefficient for the predictor variable will not change (the intercept value will be different due to centering, but thats irrelevant for this discussion):
clf = LinearRegression(fit_intercept=False)
print(clf.fit(x, y).coef_)
[ 59.60686058 -0.35535884]
clf = LinearRegression(fit_intercept=True)
print(clf.fit(x, y).coef_)
[ 0. -0.35535884]
Summing up
The output (i.e. coefficient values) you get will be entirely dependent on the matrix that you input into these calculations (whether its normal equations, scikit-learn, or any other)
How does it differ to use fit_intercept in a model and when should one set it to True/False
If your design matrix does not contain a 1's column, then normal equations and scikit-learn (fit_intercept = False) will give you the same answer (as you noted). However, if you set the parameter to True, the answer you get will actually be the same as normal equations if you calculated that with a 1's column.
When should you set True/False? As the name suggests, you set False when you don't want to include an intercept in your model. You set True when you do want an intercept, with the understanding that the coefficient values will change, but will match the normal equations approach when your data includes a 1's column
So True/False doesn't actually give you different results (compared to normal equations) when considering the same underlying model. The difference you observe is because you're looking at two different statistical models (one with an intercept term, and one without). The reason the fit_intercept parameter exists is so you can create an intercept model without the hassle of manually adding that 1's column. It effectively allows you to toggle between the two underlying statistical models.
Without going into the details of mathematical formulation, when the fit intercept is set to false, the estimator deliberately sets the intercept to zero and this in turn affects the other regressors as the 'responsibility' of the error reduction falls onto these factors. As a result, the result could be very different in either cases if it is sensitive to the presence of an intercept term. The scaling shifts the origin thereby allowing the same closed loop solutions to both intercept and intercept-free models.
For diagnostic purposes, I am grabbing the gradients of the network periodically. One way to do this is to return the gradients as output of the theano function. However, copying the gradients from the GPU to CPU memory every time may be costly so I would prefer to do it only periodically. At the moment, I am achieving this by creating two function objects, one which returns the gradient and one which doesn't.
However, I do not know whether this is optimal and am looking for a more elegant way to achieve the same thing.
Your first function obviously executes a training step and updates all your parameters.
The second function must return the gradients of your parameters.
The fastest way to do what you are asking is to add the updates for the training step to the second function and when logging the gradients, don't call the first function, but only the second.
gradients = [ ... ]
train_f = theano.function([x, y], [], updates=updates)
train_grad_f = theano.function([x, y], gradients, updates=updates)
num_iters = 1000
grad_array = []
for i in range(num_iters):
# every 10 training steps keep log of gradients
if i % 10 == 0:
grad_array.append(train_grad_f(...))
else:
train_f(...)
Update
if you wish to have a single function to do this, you can do the following
from theano.ifelse import ifelse
no_grad = T.iscalar('no_grad')
example_gradient = T.grad(example_cost, example_variable)
# if no_grad is > 0 then return the gradient, otherwise return zeros array
out_grad = ifelse(T.gt(no_grad,0), example_gradient, T.zeros_like(example_variable))
train_f = theano.function([x, y, no_grad], [out_grad], updates=updates)
So when you want to retrieve the gradients you call
train_f(x_data, y_data, 1)
otherwise
train_f(x_data, y_data, 0)
I just applied the log loss in sklearn for logistic regression: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
My code looks something like this:
def perform_cv(clf, X, Y, scoring):
kf = KFold(X.shape[0], n_folds=5, shuffle=True)
kf_scores = []
for train, _ in kf:
X_sub = X[train,:]
Y_sub = Y[train]
#Apply 'log_loss' as a loss function
scores = cross_validation.cross_val_score(clf, X_sub, Y_sub, cv=5, scoring='log_loss')
kf_scores.append(scores.mean())
return kf_scores
However, I'm wondering why the resulting logarithmic losses are negative. I'd expect them to be positive since in the documentation (see my link above) the log loss is multiplied by a -1 in order to turn it into a positive number.
Am I doing something wrong here?
Yes, this is supposed to happen. It is not a 'bug' as others have suggested. The actual log loss is simply the positive version of the number you're getting.
SK-Learn's unified scoring API always maximizes the score, so scores which need to be minimized are negated in order for the unified scoring API to work correctly. The score that is returned is therefore negated when it is a score that should be minimized and left positive if it is a score that should be maximized.
This is also described in sklearn GridSearchCV with Pipeline and in scikit-learn cross validation, negative values with mean squared error
a similar discussion can be found here.
In this way, an higher score means better performance (less loss).
I cross checked the sklearn implementation with several other methods. It seems to be an actual bug within the framework. Instead consider the follwoing code for calculating the log loss:
import scipy as sp
def llfun(act, pred):
epsilon = 1e-15
pred = sp.maximum(epsilon, pred)
pred = sp.minimum(1-epsilon, pred)
ll = sum(act*sp.log(pred) + sp.subtract(1,act)*sp.log(sp.subtract(1,pred)))
ll = ll * -1.0/len(act)
return ll
Also take into account that the dimensions of act and pred have to Nx1 column vectors.