I have sequential data and I declared a LSTM model which predicts y with x in Keras. So if I call model.predict(x1) and model.predict(x2), Is it correct to call model.reset_states between the two predict() explicitly? Does model.reset_states clear history of inputs, not weights, right?
# data1
x1 = [2,4,2,1,4]
y1 = [1,2,3,2,1]
# dat2
x2 = [5,3,2,4,5]
y2 = [5,3,2,3,2]
And in my actual code, I use model.evaluate(). In evaluate(), is reset_states called implicitly for each data sample?
model.evaluate(dataX, dataY)
reset_states clears only the hidden states of your network. It's worth to mention that depending on if the option stateful=True was set in your network - the behaviour of this function might be different. If it's not set - all states are automatically reset after every batch computations in your network (so e.g. after calling fit, predict and evaluate also). If not - you should call reset_states every time, when you want to make consecutive model calls independent.
If you use explicitly either of:
model.reset_states()
to reset the states of all layers in the model, or
layer.reset_states()
to reset the states of a specific stateful RNN layer (also LSTM layer), implemented here:
def reset_states(self, states=None):
if not self.stateful:
raise AttributeError('Layer must be stateful.')
this means your layer(s) must be stateful.
In LSTM you need to:
explicitly specify the batch size you are using, by passing a batch_size argument to the first layer in your model or batch_input_shape argument
set stateful=True.
specify shuffle=False when calling fit().
The benefits of using stateful models are probable best explained here.
Related
Suppose I have my custom loss function and I want to fit the solution of some differential equation with help of my neural network. So in each forward pass, I am calculating the output of my neural net and then calculating the loss by taking the MSE with the expected equation to which I want to fit my perceptron.
Now my doubt is: should I use grad(loss) or should I do loss.backward() for backpropagation to calculate and update my gradients?
I understand that while using loss.backward() I have to wrap my tensors with Variable and have to set the requires_grad = True for the variables w.r.t which I want to take the gradient of my loss.
So my questions are :
Does grad(loss) also requires any such explicit parameter to identify the variables for gradient computation?
How does it actually compute the gradients?
Which approach is better?
what is the main difference between the two in a practical scenario.
It would be better if you could explain the practical implications of both approaches because whenever I try to find it online I am just bombarded with a lot of stuff that isn't much relevant to my project.
TLDR; Both are two different interfaces to perform gradient computation: torch.autograd.grad is non-mutable while torch.autograd.backward is.
Descriptions
The torch.autograd module is the automatic differentiation package for PyTorch. As described in the documentation it only requires minimal change to code base in order to be used:
you only need to declare Tensors for which gradients should be computed with the requires_grad=True keyword.
The two main functions torch.autograd provides for gradient computation are torch.autograd.backward and torch.autograd.grad:
torch.autograd.backward (source)
torch.autograd.grad (source)
Description
Computes the sum of gradients of given tensors with respect to graph leaves.
Computes and returns the sum of gradients of outputs with respect to the inputs.
Header
torch.autograd.backward( tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None, inputs=None)
torch.autograd.grad( outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False, only_inputs=True, allow_unused=False)
Parameters
- tensors – Tensors of which the derivative will be computed.- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.- retain_graph – If False, the graph used to compute the grad will be freed. [...] - inputs – Inputs w.r.t. which the gradient be will be accumulated into .grad. All other Tensors will be ignored. If not provided, the gradient is accumulated into all the leaf Tensors that were used [...].
- outputs – outputs of the differentiated function.- inputs – Inputs w.r.t. which the gradient will be returned (and not accumulated into .grad).- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.- retain_graph – If False, the graph used to compute the grad will be freed. [...].
Usage examples
In terms of high-level usage, you can look at torch.autograd.grad as a non-mutable function. As mentioned in the documentation table above, it will not accumulate the gradients on the grad attribute but instead return the computed partial derivatives. In contrast torch.autograd.backward will be able to mutate the tensors by updating the grad attribute of leaf nodes, the function won't return any value. In other words, the latter is more suitable when computing gradients for a large number of parameters.
In the following, we will take two inputs (x1 and, x2), calculate a tensor y with them, and then compute the partial derivatives of the result w.r.t both inputs, i.e. dL/dx1 and dL/dx2:
>>> x1 = torch.rand(1, requires_grad=True)
>>> x2 = torch.rand(1, requires_grad=True)
>>> x1, x2
(tensor(0.3939, grad_fn=<UnbindBackward>),
tensor(0.7965, grad_fn=<UnbindBackward>))
Inference:
>>> y = x1**2 + 5*x2
>>> y
tensor(4.1377, grad_fn=<AddBackward0>)
Since y was computed using tensor(s) requiring gradients (i.e. with requires_grad=True) - *outside of a torch.no_grad context. It will have a grad_fn function attached. This callback is used to backpropagate onto the computation graph to compute the gradients of preceding tensor nodes.
torch.autograd.grad:
Here we provide torch.ones_like(y) as the grad_outputs.
>>> torch.autograd.grad(y, (x1, x2), torch.ones_like(y))
(tensor(0.7879), tensor(5.))
The above output is a tuple containing the two partial derivatives w.r.t. to the provided inputs respectively in order of appearance, i.e. dL/dx1 and dL/dx2.
This corresponds to the following computation:
# dL/dx1 = dL/dy * dy/dx1 = grad_outputs # 2*x1
# dL/dx2 = dL/dy * dy/dx2 = grad_outputs # 5
torch.autograd.backward: in contrast it will mutate the provided tensors by updating the grad of the tensors which have been used to compute the output tensor and that require gradients. It is equivalent to the torch.Tensor.backward API. Here, we go through the same example by defining x1, x2, and y again. We call backward:
>>> # y.backward(torch.ones_like(y))
>>> torch.autograd.backward(y, torch.ones_like(y))
None
Then you can retrieve the gradients on x1.grad and x2.grad:
>>> x1.grad, x2.grad
(tensor(0.7879), tensor(5.))
In conclusion: both perform the same operation. They are two different interfaces to interact with the autograd library and perform gradient computations. The latter, torch.autograd.backward (equivalent to torch.Tensor.backward), is generally used in neural networks training loops to compute the partial derivative of the loss w.r.t each one of the model's parameters.
You can read more about how torch.autograd.grad works by reading through this other answer I made on: Meaning of grad_outputs in PyTorch's torch.autograd.grad.
In addition to Ivan's answer, having torch.autograd.grad not accumulating gradients into .grad can avoid racing conditions in multi-thread scenarios.
Quoting PyTorch doc https://pytorch.org/docs/stable/notes/autograd.html#non-determinism
If you are calling backward() on multiple thread concurrently but with shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared across threads, gradient accumulation might become non-deterministic on backward calls across threads, because two backward calls might access and try to accumulate the same .grad attribute. This is technically not safe, and it might result in racing condition and the result might be invalid to use.
But this is expected pattern if you are using the multithreading approach to drive the whole training process but using shared parameters, user who use multithreading should have the threading model in mind and should expect this to happen. User could use the functional API torch.autograd.grad() to calculate the gradients instead of backward() to avoid non-determinism.
implementation details https://github.com/pytorch/pytorch/blob/7e3a694b23b383e38f5e39ef960ba8f374d22404/torch/csrc/autograd/functions/accumulate_grad.h
I am new to PyTorch and I would like to add a mean-variance normalization layer to my network that will normalize features to zero mean and unit standard deviation. I got a bit confused reading the documentation, could anyone give me some leads?
As #Ivan commented, the normalization can be done on many levels. However, as You say
normalize features to zero mean and unit standard deviation
I suppose You just want to input unbiased data to the network. If that's the case, You should treat it as data preprocessing step rather than a layer of Your model and basically do:
X = (X - torch.mean(X, dim=0))/torch.std(X, dim=0)
As an alternative, You can use torchvision.transforms:
preprocess = torchvision.transforms.Normalize(mean=torch.mean(X, dim=0), std=torch.std(X, dim=0))
X = preprocess(X)
as in this ResNet native example. Note how it is reasonably assumed that the future data would always have roughly the same mean and std_dev as the set that is used for their initial calculation (supposedly the training set). For this reason, we should preserve the initially calculated values and use them for preprocessing in any future inference scenario.
I am printing a tensorflow.keras.Model instance summary. The type is tensorflow.python.keras.engine.functional.Functional object.
This model has layers with activations and batch normalization associated. When I print the list of parameters, I see
weights
bias
4 items co-dimensional with the bias
These four items are (I guess) the batch normalization and activations.
My question is: why do we have parameters associated with batch normalization and activations? And what could be the other two items?
My aim is to transpose this Keras model to a PyTorch counterpart, so I need to know the order of the parameters and what these parameters represent
there are no parameters associated with activations, those are simply some element-wise nonlinear function. So no matter how many activations you have they don't account for any parameter counts. However, your guess is right, there are in fact parameters associated with BatchNorm layer, 2 parameters in each BatchNorm layer to be precise (lambda and beta). So those BatchNorm layer does add additional parameters in your network.
I am new to modeling with Keras. I am trying to evaluate appropriate parameters for setting up the model. How do I know when you use bias vs when to turn it off?
The short answer is, always use bias variables when your model is small. Otherwise, it is still recommended to keep using bias in all neural network architectures.
Because each neurone performs like a simple logistic regression. In each neurone, the input values are multiplied with by the weights and the bias affects the initial level in the sigmoid function, which results the desired the non-linearity.
For example, if you have a zero input in your training data like X = [[0,0,...], [0,0,...],... ] , Y = 1, in a sigmoid function, the output will always be exactly Y=0.5 since X*W is zero. However, in large networks, each node can make a bias node out of the average activation of all of its inputs.
One common task in DL is that you normalize input samples to zero mean and unit variance. One can "manually" perform the normalization using code like this:
mean = np.mean(X, axis = 0)
std = np.std(X, axis = 0)
X = [(x - mean)/std for x in X]
However, then one must keep the mean and std values around, to normalize the testing data, in addition to the Keras model being trained. Since the mean and std are learnable parameters, perhaps Keras can learn them? Something like this:
m = Sequential()
m.add(SomeKerasLayzerForNormalizing(...))
m.add(Conv2D(20, (5, 5), input_shape = (21, 100, 3), padding = 'valid'))
... rest of network
m.add(Dense(1, activation = 'sigmoid'))
I hope you understand what I'm getting at.
Add BatchNormalization as the first layer and it works as expected, though not exactly like the OP's example. You can see the detailed explanation here.
Both the OP's example and batch normalization use a learned mean and standard deviation of the input data during inference. But the OP's example uses a simple mean that gives every training sample equal weight, while the BatchNormalization layer uses a moving average that gives recently-seen samples more weight than older samples.
Importantly, batch normalization works differently from the OP's example during training. During training, the layer normalizes its output using the mean and standard deviation of the current batch of inputs.
A second distinction is that the OP's code produces an output with a mean of zero and a standard deviation of one. Batch Normalization instead learns a mean and standard deviation for the output that improves the entire network's loss. To get the behavior of the OP's example, Batch Normalization should be initialized with the parameters scale=False and center=False.
There's now a Keras layer for this purpose, Normalization. At time of writing it is in the experimental module, keras.layers.experimental.preprocessing.
https://keras.io/api/layers/preprocessing_layers/core_preprocessing_layers/normalization/
Before you use it, you call the layer's adapt method with the data X you want to derive the scale from (i.e. mean and standard deviation). Once you do this, the scale is fixed (it does not change during training). The scale is then applied to the inputs whenever the model is used (during training and prediction).
from keras.layers.experimental.preprocessing import Normalization
norm_layer = Normalization()
norm_layer.adapt(X)
model = keras.Sequential()
model.add(norm_layer)
# ... Continue as usual.
Maybe you can use sklearn.preprocessing.StandardScaler to scale you data,
This object allow you to save the scaling parameters in an object,
Then you can use Mixin types inputs into you model, lets say:
Your_model
[param1_scaler, param2_scaler]
Here is a link https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/
https://keras.io/getting-started/functional-api-guide/
There's BatchNormalization, which learns mean and standard deviation of the input. I haven't tried using it as the first layer of the network, but as I understand it, it should do something very similar to what you're looking for.