How can I pool gradients in Theano? - theano

I'm using Theano for the first time to build a large statistical model. I'm performing a kind of stochastic gradient descent, but for each sample in the minibatch I need to perform a sampling procedure to compute the gradient. Is there a way in Theano to pool the gradients while I perform the sampling procedure for each datapoint in a minibatch, and only afterward perform the gradient update?

I don't understand what you mean by "pool".
When you compute the gradient of your cost wrt some variables, the cost has to be a scalar. So, when using minibatches, you have to combine the individual costs for the examples in the minibatch. That can be done by a sum, a mean, a weighted sum... And then that cost is backpropagated.
The gradient of that cost wrt parameters will correspond (mathematically) to the sum/mean/weigted sum of the individual gradients (on each of the examples), but that is not the way it is computed.
The gradient of that cost wrt intermediate variables that are function of the inputs (hidden representations, etc.) will have the same format as the original minibatch, with the gradient wrt each of the minibatches in a different row.
So, maybe what you want is expressing your final cost as a result of your sampling procedure, and then backpropagate the gradient of that cost.
Or maybe you do not want to backpropagate the gradient of the true cost all the way, and backpropagate something that depends on the gradient instead.
In that case, you can do something like:
# minibatch of inputs
inputs = tt.matrix()
interm_result = f(input)
cost = g(interm_result).sum()
grad_wrt_interm_result = th.grad(cost, interm_result)
sampled_grad = sampling_procedure(grad_wrt_interm_result)
grad_wrt_params = th.grad(cost, params,
known_grads={inter_result: sampled_grad})
That way, you can perform some of the backpropagation to interm_result, then change the gradient wrt inter_result to sampled_grad, and then finish the backpropagation towards the parameters.

Related

Difference between autograd.grad and autograd.backward?

Suppose I have my custom loss function and I want to fit the solution of some differential equation with help of my neural network. So in each forward pass, I am calculating the output of my neural net and then calculating the loss by taking the MSE with the expected equation to which I want to fit my perceptron.
Now my doubt is: should I use grad(loss) or should I do loss.backward() for backpropagation to calculate and update my gradients?
I understand that while using loss.backward() I have to wrap my tensors with Variable and have to set the requires_grad = True for the variables w.r.t which I want to take the gradient of my loss.
So my questions are :
Does grad(loss) also requires any such explicit parameter to identify the variables for gradient computation?
How does it actually compute the gradients?
Which approach is better?
what is the main difference between the two in a practical scenario.
It would be better if you could explain the practical implications of both approaches because whenever I try to find it online I am just bombarded with a lot of stuff that isn't much relevant to my project.
TLDR; Both are two different interfaces to perform gradient computation: torch.autograd.grad is non-mutable while torch.autograd.backward is.
Descriptions
The torch.autograd module is the automatic differentiation package for PyTorch. As described in the documentation it only requires minimal change to code base in order to be used:
you only need to declare Tensors for which gradients should be computed with the requires_grad=True keyword.
The two main functions torch.autograd provides for gradient computation are torch.autograd.backward and torch.autograd.grad:
torch.autograd.backward (source)
torch.autograd.grad (source)
Description
Computes the sum of gradients of given tensors with respect to graph leaves.
Computes and returns the sum of gradients of outputs with respect to the inputs.
Header
torch.autograd.backward( tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None, inputs=None)
torch.autograd.grad( outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False, only_inputs=True, allow_unused=False)
Parameters
- tensors – Tensors of which the derivative will be computed.- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.- retain_graph – If False, the graph used to compute the grad will be freed. [...] - inputs – Inputs w.r.t. which the gradient be will be accumulated into .grad. All other Tensors will be ignored. If not provided, the gradient is accumulated into all the leaf Tensors that were used [...].
- outputs – outputs of the differentiated function.- inputs – Inputs w.r.t. which the gradient will be returned (and not accumulated into .grad).- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.- retain_graph – If False, the graph used to compute the grad will be freed. [...].
Usage examples
In terms of high-level usage, you can look at torch.autograd.grad as a non-mutable function. As mentioned in the documentation table above, it will not accumulate the gradients on the grad attribute but instead return the computed partial derivatives. In contrast torch.autograd.backward will be able to mutate the tensors by updating the grad attribute of leaf nodes, the function won't return any value. In other words, the latter is more suitable when computing gradients for a large number of parameters.
In the following, we will take two inputs (x1 and, x2), calculate a tensor y with them, and then compute the partial derivatives of the result w.r.t both inputs, i.e. dL/dx1 and dL/dx2:
>>> x1 = torch.rand(1, requires_grad=True)
>>> x2 = torch.rand(1, requires_grad=True)
>>> x1, x2
(tensor(0.3939, grad_fn=<UnbindBackward>),
tensor(0.7965, grad_fn=<UnbindBackward>))
Inference:
>>> y = x1**2 + 5*x2
>>> y
tensor(4.1377, grad_fn=<AddBackward0>)
Since y was computed using tensor(s) requiring gradients (i.e. with requires_grad=True) - *outside of a torch.no_grad context. It will have a grad_fn function attached. This callback is used to backpropagate onto the computation graph to compute the gradients of preceding tensor nodes.
torch.autograd.grad:
Here we provide torch.ones_like(y) as the grad_outputs.
>>> torch.autograd.grad(y, (x1, x2), torch.ones_like(y))
(tensor(0.7879), tensor(5.))
The above output is a tuple containing the two partial derivatives w.r.t. to the provided inputs respectively in order of appearance, i.e. dL/dx1 and dL/dx2.
This corresponds to the following computation:
# dL/dx1 = dL/dy * dy/dx1 = grad_outputs # 2*x1
# dL/dx2 = dL/dy * dy/dx2 = grad_outputs # 5
torch.autograd.backward: in contrast it will mutate the provided tensors by updating the grad of the tensors which have been used to compute the output tensor and that require gradients. It is equivalent to the torch.Tensor.backward API. Here, we go through the same example by defining x1, x2, and y again. We call backward:
>>> # y.backward(torch.ones_like(y))
>>> torch.autograd.backward(y, torch.ones_like(y))
None
Then you can retrieve the gradients on x1.grad and x2.grad:
>>> x1.grad, x2.grad
(tensor(0.7879), tensor(5.))
In conclusion: both perform the same operation. They are two different interfaces to interact with the autograd library and perform gradient computations. The latter, torch.autograd.backward (equivalent to torch.Tensor.backward), is generally used in neural networks training loops to compute the partial derivative of the loss w.r.t each one of the model's parameters.
You can read more about how torch.autograd.grad works by reading through this other answer I made on: Meaning of grad_outputs in PyTorch's torch.autograd.grad.
In addition to Ivan's answer, having torch.autograd.grad not accumulating gradients into .grad can avoid racing conditions in multi-thread scenarios.
Quoting PyTorch doc https://pytorch.org/docs/stable/notes/autograd.html#non-determinism
If you are calling backward() on multiple thread concurrently but with shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared across threads, gradient accumulation might become non-deterministic on backward calls across threads, because two backward calls might access and try to accumulate the same .grad attribute. This is technically not safe, and it might result in racing condition and the result might be invalid to use.
But this is expected pattern if you are using the multithreading approach to drive the whole training process but using shared parameters, user who use multithreading should have the threading model in mind and should expect this to happen. User could use the functional API torch.autograd.grad() to calculate the gradients instead of backward() to avoid non-determinism.
implementation details https://github.com/pytorch/pytorch/blob/7e3a694b23b383e38f5e39ef960ba8f374d22404/torch/csrc/autograd/functions/accumulate_grad.h

How does logistic regression build Sigmoid curve from categorical dependent variable?

I'm exploring the Scikit-learn logistic regression algorithm. I understand that as part of the training, the algorithm builds a regression curve where the y-variable ranges from 0 to 1 (sigmoid S-curve). The y-variable is a continuous variable here (although in reality it is a discrete variable). .
How is the algorithm able to learn the S-curve, when the training dataset reflects reality and includes the y-variable as a discrete variable? There is no probability estimate in the training, so I'm wondering how is the algorithm able to learn the S-curve.
There is no probability estimate in the training
Sure, but we pretend there is for modeling purposes. We want to maximize the probability of, as you call it, “reality”—if the observed response (the discrete value you refer to) is a 0, we want to predict that with probability 1; similarly, if the response is a 1, we want to predict that with probability 1.
Fitting the model to one data point, getting the right answer with probability 1, would be easy. Of course, we have more than one data point. We have to balance concerns between these. We want the predicted value sigmoid(weights * features) to be close to the true response (0 or 1) for all of the data points, but there may not be a way to set the parameters of the model to achieve this. (That is, the data may not be linearly separable.)
Good question! The fitting process in logistic regression is a search procedure that seeks the beta coefficients that minimize the error in the probabilities predicted by the model (continuous values) and the data (discrete values).
In logistic regression, you model probabilities using a logistic function (also known as a sigmoid function):
XB = B0 + B1 * X1 + B2 * X2 + ... + BN * XN
p(X) = e^(XB) / (1 + e^(XB))
The algorithm tries to find the beta coefficients that minimize the error using Maximum Likelihood estimation. The function to be minimized is called the cost function, and it can be any number of things. The most common ones are:
sum (P(X_i) - y_i)^2
sum |P(X_i) - y_i|
A random set of betas is picked at random, the cost is calculated and the algorithm will pick a new set of betas that will result in a lower cost. The algorithm stops searching for new betas when the decrease in cost is smaller than a given threshold (set by the tol parameter in sklearn).
The way the model converges to the final set of coefficients depends on the solver parameter. Each solver has a different way of converging to the final set of betas, but they usually converge to the same results.

Is it acceptable to scale target values for regressors?

I am getting very high RMSE and MAE for MLPRegressor , ForestRegression and Linear regression with only input variables scaled (30,000+) however when i scale target values aswell i get RMSE (0.2) , i will like to know if that is acceptable thing to do.
Secondly is it normal to have better R squared values for Test (ie. 0.98 and 0.85 for train)
Thank You
Answering your first question, I think you are quite deceived by the performance measures which you have chosen to evaluate your model with. Both RMSE and MAE are sensitive to the range in which you measure your target variables, if you are going to scale down your target variable then for sure the values of RMSE and MAE will go down, lets take an example to illustrate that.
def rmse(y_true, y_pred):
return np.sqrt(np.mean(np.square(y_true - y_pred)))
def mae(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred))
I have written two functions for computing both RMSE and MAE. Now lets plug in some values and see what happens,
y_true = np.array([2,5,9,7,10,-5,-2,2])
y_pred = np.array([3,4,7,9,8,-3,-2,1])
For the time being let's assume that the true and the predicted vales are as shown above. Now we are ready to compute RMSE and MAE for this data.
rmse(y_true,y_pred)
1.541103500742244
mae(y_true, y_pred)
1.375
Now let's scale down our target variable by a factor of 10 and compute the same measure again.
y_scaled_true = np.array([2,5,9,7,10,-5,-2,2])/10
y_scaled_pred = np.array([3,4,7,9,8,-3,-2,1])/10
rmse(y_scaled_true,y_scaled_pred)
0.15411035007422444
mae(y_scaled_true,y_scaled_pred)
0.1375
We can now very well see that just by scaling our target variable our RMSE and MAE scores have dropped creating an illusion that our model has improved, but actually NOT. When we scale back our model's predictions we are into the same state.
So coming to the point, MAPE (Mean Absolute Percentage Error) could be a better way to measure your performance of the model and it is insensitive to the scale in which the variables are measure. If you compute MAPE for both the sets of values we see that they are same,
def mape(y, y_pred):
return np.mean(np.abs((y - y_pred)/y))
mape(y_true,y_pred)
0.28849206349206347
mape(y_scaled_true,y_scaled_pred)
0.2884920634920635
So it is better to rely on MAPE over MAE or RMSE, if you want your performance measure to be independent on the scale in which they are measured.
Answering your second question, since you are dealing with some complicated models like MLPRegressor and ForestRegression which has some hyper-parameters which needs to be tuned to avoid over fitting, the best way to find the ideal levels of the hyper-parameters is to divide the data into train, test and validation and use techniques like K-Fold Cross Validation to find the optimal setting. It is quite difficult to say if the above values are acceptable or not just by looking at this one case.
It is actually a common practice to scale target values in many cases.
For example a highly skewed target may give better results if it is applied log or log1p transforms. I don't know the characteristics of your data, but there could a transformation that might decrease your RMSE.
Secondly, Test set is meant to be a sample of unseen data, to give a final estimate of your model's performance. When you see the unseen data and tune to perform better on it, it becomes a cross validation set.
You should try to split your data into three parts, Train, Cross-validation and test sets. Train on your data and tune parameters according to it's performance on cross validation and then after you are done tuning, run it on the test set to get a prediction of how it works on unseen data and mark it as the accuracy of your model.

SkikitLearn learning curve strongly dependent on batch size of MLPClassifier ??? Or: how to diagnose bias/ variance for NN?

I am currently working on a classification problem with two classes in ScikitLearn with the solver adam and activation relu. To explore if my classifier suffers from high bias or high variance, I plotted the learning curve with Scikitlearns build-in function:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
I am using a Group-K_Fold crossvalidation with 8 splits.
However, I found that my learning curve is strongly dependent on the batch size of my classifier:
https://imgur.com/a/FOaWKN1
Is it supposed to be like this? I thought learning curves are tackling the accuracy scores dependent on the portion of training data independent from any batches/ epochs? Can I actually use this build-in function for batch methods? If yes, which batch size should I choose (full batch or batch size= number of training examples or something in between) and what diagnosis do I get from this? Or how do you usually diagnose bias/ variance problems of a neural network classifier?
Help would be really appreciated!
Yes, the learning curve depends on the batch size.
The optimal batch size depends on the type of data and the total volume of the data.
In ideal case batch size of 1 will be best, but in practice, with big volumes of data, this approach is not feasible.
I think you have to do that through experimentation because you can’t easily calculate the optimal value.
Moreover, when you change the batch size you might want to change the learning rate as well so you want to keep the control over the process.
But indeed having a tool to find the optimal (memory and time-wise) batch size is quite interesting.
What is Stochastic Gradient Descent?
Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that calculates the error and updates the model for each example in the training dataset.
The update of the model for each training example means that stochastic gradient descent is often called an online machine learning algorithm.
What is Batch Gradient Descent?
Batch gradient descent is a variation of the gradient descent algorithm that calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated.
One cycle through the entire training dataset is called a training epoch. Therefore, it is often said that batch gradient descent performs model updates at the end of each training epoch.
What is Mini-Batch Gradient Descent?
Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients.
Implementations may choose to sum the gradient over the mini-batch or take the average of the gradient which further reduces the variance of the gradient.
Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.
Source: https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/

Re-weight the input to a neural network

For example, I feed a set of images into a CNN. And the default weight of these images is 1. How can I re-weight some of these images so that they have different weights? Can 'DataLoader' achieve this goal in pytorch?
I learned two other possibilities:
Defining a custom loss function, providing weights for each sample as I require.
Repeating samples in the training set, which will result in more frequent samples having a higher weight in the final loss.
Is there any other way, we can achieve that? Any suggestion would be appreciated.
I can think of two ways to achieve this.
Pass on the weight explicitly, when you backpropagate the gradients.
After you computed loss, and when you're about to backpropagate, you can pass a Tensor to backward() and all the subsequent gradients will be scaled by the corresponding element, i.e. do something like
loss = torch.pow(out-target,2)
loss.backward(my_weights) # is a Tensor of same dimension as loss
However, if you use want to assign individual weights in a batch, you can't use the custom loss functions from the nn.module which aggregates the loss over all samples in a batch.
Use torch.utils.data.sampler.WeightedRandomSampler
If you use PyTorch's data.utils anyway, this is simpler than multiplying your training set. However it doesn't assign exact weights, since it's stochastic. But if you're iterating over your training set a sufficient number of times, it's probably close enough.

Resources