Constructing parameter groups in pytorch - pytorch

In the torch.optim documentation, it is stated that model parameters can be grouped and optimized with different optimization hyperparameters. It says that
For example, this is very useful when one wants to specify per-layer
learning rates:
optim.SGD([
{'params': model.base.parameters()},
{'params': model.classifier.parameters(), 'lr': 1e-3}
], lr=1e-2, momentum=0.9)
This means that model.base’s parameters will use the default
learning rate of 1e-2, model.classifier’s parameters will use a
learning rate of 1e-3, and a momentum of 0.9 will be used for all
parameters.
I was wondering how to define such groups that have parameters() attribute. What came to my mind was something in the form of
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.base()
self.classifier()
self.relu = nn.ReLU()
def base(self):
self.fc1 = nn.Linear(1, 512)
self.fc2 = nn.Linear(512, 264)
def classifier(self):
self.fc3 = nn.Linear(264, 128)
self.fc4 = nn.Linear(128, 964)
def forward(self, y0):
y1 = self.relu(self.fc1(y0))
y2 = self.relu(self.fc2(y1))
y3 = self.relu(self.fc3(y2))
return self.fc4(y3)
How should I modify the snippet above to be able to get model.base.parameters()? Is the only way to define a nn.ParameterList and explicitly add weights and biases of the desired layers to that list? What is the best practice?

I will show three approaches to solving this. In the end though, it comes down to personal preference.
- Grouping parameters with nn.ModuleDict.
I noticed here an answer using nn.Sequential to group the layers which allow to
target different sections of the model using the parameters attribute of nn.Sequential. Indeed base and classifier might be more than sequential layers. I believe a more general approach is to leave the module as is, but instead, initialize an additional nn.ModuleDict module which will contain all parameters ordered by the optimization group in separate nn.ModuleLists:
class MyModel(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(1, 512)
self.fc2 = nn.Linear(512, 264)
self.fc3 = nn.Linear(264, 128)
self.fc4 = nn.Linear(128, 964)
self.params = nn.ModuleDict({
'base': nn.ModuleList([self.fc1, self.fc2]),
'classifier': nn.ModuleList([self.fc3, self.fc4])})
def forward(self, y0):
y1 = self.relu(self.fc1(y0))
y2 = self.relu(self.fc2(y1))
y3 = self.relu(self.fc3(y2))
return self.fc4(y3)
Then you can define your optimizer with:
optim.SGD([
{'params': model.params.base.parameters()},
{'params': model.params.classifier.parameters(), 'lr': 1e-3}
], lr=1e-2, momentum=0.9)
Do note MyModel's parameters' generator won't contain duplicate parameters.
- Creating an interface for accessing parameter groups.
A different solution is to provide an interface in the nn.Module to separate the parameters into groups:
class MyModel(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(1, 512)
self.fc2 = nn.Linear(512, 264)
self.fc3 = nn.Linear(264, 128)
self.fc4 = nn.Linear(128, 964)
def forward(self, y0):
y1 = self.relu(self.fc1(y0))
y2 = self.relu(self.fc2(y1))
y3 = self.relu(self.fc3(y2))
return self.fc4(y3)
def base_params(self):
return chain(m.parameters() for m in [self.fc1, self.fc2])
def classifier_params(self):
return chain(m.parameters() for m in [self.fc3, self.fc4])
Having imported itertools.chain as chain.
Then define your optimizer with:
optim.SGD([
{'params': model.base_params()},
{'params': model.classifier_params(), 'lr': 1e-3}
], lr=1e-2, momentum=0.9)
- Using child nn.Modules.
Lastly, you can define your module sections as submodules (here it comes down as the method as the nn.Sequential one, yet you can generalize this to any submodules).
class Base(nn.Sequential):
def __init__(self):
super().__init__(nn.Linear(1, 512),
nn.ReLU(),
nn.Linear(512, 264),
nn.ReLU())
class Classifier(nn.Sequential):
def __init__(self):
super().__init__(nn.Linear(264, 128),
nn.ReLU(),
nn.Linear(128, 964))
class MyModel(nn.Module):
def __init__(self):
super().__init__()
self.base = Base()
self.classifier = Classifier()
def forward(self, y0):
features = self.base(y0)
out = self.classifier(features)
return out
Here again you can use the same interface as the first method:
optim.SGD([
{'params': model.base.parameters()},
{'params': model.classifier.parameters(), 'lr': 1e-3}
], lr=1e-2, momentum=0.9)
I would argue this is the best practice. However, it forces you to define each of your components into separate nn.Module, which can be a hassle when experimenting with more complex models.

You can use torch.nn.Sequential to define base and classifier. Your class definition can then be:
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.base = nn.Sequential(nn.Linear(1, 512), nn.ReLU(), nn.Linear(512,264), nn.ReLU())
self.classifier = nn.Sequential(nn.Linear(264,128), nn.ReLU(), nn.Linear(128,964))
def forward(self, y0):
return self.classifier(self.base(y0))
Then, you can access parameters using model.base.parameters() and model.classifier.parameters().

Related

How does `optimizer.step()` perform an in-place operation?

Here is a simple example that results in an in-place operation error.
import torch
import torch.nn as nn
import torch.nn.functional as F
from collections import OrderedDict
from torch import optim
torch.autograd.set_detect_anomaly(True)
class Loss(nn.Module):
def __init__(self):
super(Loss, self).__init__()
def forward(self, x, target):
return x[0,0,0,0]
def block(in_channels, features, name):
return nn.Conv2d(in_channels=in_channels,
out_channels=features,
kernel_size=3,
padding=1,
bias=False)
class SharedNetwork(nn.Module):
def __init__(self):
super().__init__()
self.shared_layer = block(in_channels=3, features=1, name="wow")
def forward(self, x):
x = self.shared_layer(x)
return x
class Network1(nn.Module):
def __init__(self):
super().__init__()
self.conv = block(in_channels=1, features=1, name="wow-1")
def forward(self, x):
return self.conv(x)
class Network2(nn.Module):
def __init__(self):
super().__init__()
self.conv = block(in_channels=1, features=1, name="wow-2")
def forward(self, x):
return torch.sigmoid(self.conv(x))
shared_net = SharedNetwork()
net_1 = Network1()
segmentor = Network2()
optimizer = optim.Adam(list(shared_net.parameters()) + list(segmentor.parameters()), lr=1e-6)
optimizer_conf = optim.Adam(list(shared_net.parameters()), lr=1e-6)
loss_fn = Loss()
# 2. Run a forward pass
fake_data = torch.randint(0,255,(1, 3, 256, 256))/255
target_data_1 = torch.randint(0,255,(1, 3, 256, 256))/255
target_data_2 = torch.randint(0,255,(1, 3, 256, 256))/255
optimizer.zero_grad()
optimizer_conf.zero_grad()
features = shared_net(fake_data)
segmented = segmentor(features)
s_loss = loss_fn(segmented, target_data_2)
s_loss.backward(retain_graph=True)
optimizer.step()
out_1 = net_1(features)
loss = loss_fn(out_1, target_data_1)
loss.backward(retain_graph=False)
optimizer_conf.step()
Error message:
UserWarning: Error detected in ConvolutionBackward0. No forward pass information available. Enable detect anomaly during forward pass for more information. (Triggered internally at C:\cb\pytorch_1000000000000\work\torch\csrc\autograd\python_anomaly_mode.cpp:97.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 3, 3, 3]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
I was able to solve the problem by changing the order of running the step function of optimizers.
optimizer_conf.zero_grad()
optimizer.zero_grad()
features = shared_net(fake_data)
segmented = segmentor(features)
s_loss = loss_fn(segmented, target_data_2)
s_loss.backward(retain_graph=True)
out_1 = net_1(features)
loss = loss_fn(out_1, target_data_1)
loss.backward(retain_graph=False)
optimizer_conf.step()
optimizer.step()
The following questions, however, remain:
How does the step method cause an inplace operation in convolution?
Why does moving the steps to the end of the file resolve this error?
NOTE: The loss function is used for simplicity, using dice-loss also results in the same error!
Before answering the question, I have to mention that it seems having multiple optimizers for one set of parameters is anti-pattern and it's better to be avoided.
How does the step method cause an inplace operation in convolution?
A: step method adds the gradients to the weights, so it does something like the following:
param.weight += param.grad
which can be interpreted as an in place operation
Why does moving the steps to the end of the file resolve this error?
A: Obviously, by moving the step method after the second backward method, the above-mentioned operation is not executed. As a result, there are no in-place operations and no errors raised due to their existence.
To sum up, it's best to have only one optimizer for one set of parameters, the previous example could coded in the following way:
import torch
import torch.nn as nn
import torch.nn.functional as F
from collections import OrderedDict
from torch import optim
torch.autograd.set_detect_anomaly(True)
class Loss(nn.Module):
def __init__(self):
super(Loss, self).__init__()
def forward(self, x, target):
return x[0,0,0,0]
def block(in_channels, features, name):
return nn.Conv2d(in_channels=in_channels,
out_channels=features,
kernel_size=(3,3),
padding=1,
bias=False)
class SharedNetwork(nn.Module):
def __init__(self):
super().__init__()
self.shared_layer = block(in_channels=3, features=1, name="wow")
def forward(self, x):
x = self.shared_layer(x)
return x
class Network1(nn.Module):
def __init__(self):
super().__init__()
self.conv = block(in_channels=1, features=1, name="wow-1")
def forward(self, x):
return self.conv(x)
class Network2(nn.Module):
def __init__(self):
super().__init__()
self.conv = block(in_channels=1, features=1, name="wow-2")
def forward(self, x):
return torch.sigmoid(self.conv(x))
torch.manual_seed(0)
shared_net = SharedNetwork()
net_1 = Network1()
net_2 = Network2()
shared_optimizer = optim.Adam(list(shared_net.parameters()), lr=1e-6)
net_1_optimizer = optim.Adam(list(net_1.parameters()), lr=1e-6)
net_2_optimizer = optim.Adam(list(segmentor.parameters()), lr=1e-6)
loss_fn = Loss()
# 2. Run a forward pass
fake_data = torch.randint(0,255,(1, 3, 256, 256))/255
target_data_1 = torch.randint(0,255,(1, 3, 256, 256))/255
target_data_2 = torch.randint(0,255,(1, 3, 256, 256))/255
net_2_optimizer.zero_grad()
features = shared_net(fake_data)
net_2_out = net_2(features)
s_loss = loss_fn(net_2_out, target_data_2)
s_loss.backward(retain_graph=True)
net_2_optimizer.step()
net_1_optimizer.zero_grad()
shared_optimizer.zero_grad()
out_1 = net_1(features)
loss = loss_fn(out_1, target_data_1)
loss.backward(retain_graph=False)
net_1_optimizer.step()
shared_optimizer.step()
Note: If you want to have two different learning rates for different losses applied to one set of parameters, you can multiply the losses based on their importance by a value. For example, you can multiply loss_1 by 0.1 and loss_1 by 0.5. Or, you can use backward hooks as mentioned in this comment:
backward-hook

Where and how are the parameters initialised to calculate the gradients during loss.backwards() in Pytorch?

I have two Neural networks Child1 and Child2. The output of Child1 is fed to the child2. The output of child2 is fed to the child1 again. Each one has their respective cost function. I have created a parent module to pack both the modules in a single module. I have done this because I have a requirement that the back propagation(derivative calculation) of each of the loss function of the neural networks(child1 and child2) should be done with respect to the weights of both the neural networks combined(child1 + child2). Could I do it this way? Could I backpropagate through my Parent module and get the gradient of each of the loss functions with repect to the combined weights?
class Parent(nn.Module):
def __init__(self,in_features,z_dim):
super().__init__()
self.my_child1 = Child1(z_dim)
self.my_child2 = Child2 (in_features)
def forward(self,input):
input=self.my_child1(input)
input=self.my_child2(input)
return input
class Child1(nn.Module):
def __init__(self, in_features):
super().__init__()
self.child1 = nn.Sequential(
nn.Linear(in_features, 128),
nn.LeakyReLU(0.01),
nn.Linear(128, 1),
nn.Sigmoid(),
)
def forward(self, x):
return self.child1(x)
class Child2(nn.Module):
def __init__(self, z_dim, img_dim):
super().__init__()
self.child2 = nn.Sequential(
nn.Linear(z_dim, 256),
nn.LeakyReLU(0.01),
nn.Linear(256, img_dim),
nn.Tanh(),
)
def forward(self, x):
return self.child2(x)

Implementing 1D self attention in PyTorch

I'm trying to implement the 1D self-attention block below using PyTorch:
proposed in the following paper. Below you can find my (provisional) attempt:
import torch.nn as nn
import torch
#INPUT shape ((B), CH, H, W)
class Self_Attention1D(nn.Module):
def __init__(self, in_channels=1, out_channels=3):
super().__init__()
self.pointwise_conv1 = nn.Conv1d(in_channels=in_channels, out_channels=out_channels, kernel_size=(1,1))
self.pointwise_conv2 = nn.Conv1d(in_channels=out_channels, out_channels=in_channels, kernel_size=(1,1))
self.phi = MLP(in_size = out_channels, out_size=32)
self.psi = MLP(in_size = out_channels, out_size=32)
self.gamma = MLP(in_size=32, out_size=out_channels)
def forward(self, x):
x = self.pointwise_conv1(x)
phi = self.phi(x.transpose(1,3))
psi = self.psi(x.transpose(1,3))
delta = phi-psi
gamma = self.gamma(delta).transpose(3,1)
out = self.pointwise_conv2(torch.mul(gamma,x))
return out
class MLP(nn.Module):
def __init__(self, in_size, out_size):
super().__init__()
self.in_size = in_size
self.out_size = out_size
self.layers = nn.Sequential(
nn.Linear(in_size, 64),
nn.ReLU(),
nn.Linear(64,128),
nn.ReLU(),
nn.Linear(128,64),
nn.ReLU(),
nn.Linear(64,out_size))
def forward(self, x):
out = self.layers(x)
return out
I'm not sure at all that this is correct, as the operations in my implementation are happening globally while as displayed in the image we should compute some operation between each entry and its neighbours one at a time. I was initially tempted to instantiate a for loop to iteratively compute the neural networks delta,phi,psi for each entry, but I felt that it wasn't the right way to do that.
Apologies if this is trivial but I still don't have a huge experience in PyTorch.

Use torch.square() inside a nn.Sequential layer in PyTorch

I want to square the result of a maxpool layer.
I tried the following:
class CNNClassifier(Classifier): # nn.Module
def __init__(self, in_channels):
super().__init__()
self.save_hyperparameters('in_channels')
self.cnn = nn.Sequential(
# maxpool
nn.MaxPool2d((1, 5), stride=(1, 5)),
torch.square(),
# layer1
nn.Conv2d(in_channels=in_channels, out_channels=32, kernel_size=5,
)
Which to the experienced PyTorch user for sure makes no sense.
Indeed, the error is quite clear:
TypeError: square() missing 1 required positional arguments: "input"
How can I feed in to square the tensor from the preceding layer?
You can't put a PyTorch function in a nn.Sequential pipeline, it needs to be a nn.Module.
You could wrap it like this:
class Square(nn.Module):
def forward(self, x):
return torch.square(x)
Then use it inside your sequential layer like so:
class CNNClassifier(Classifier): # nn.Module
def __init__(self, in_channels):
super().__init__()
self.save_hyperparameters('in_channels')
self.cnn = nn.Sequential(
nn.MaxPool2d((1, 5), stride=(1, 5)),
Square(),
nn.Conv2d(in_channels=in_channels, out_channels=32, kernel_size=5))

How to use custom activation on a custom keras layer

I want to create a gaussian activation function where I want to train the mean and standard deviation. So, using a custom layer, I've returned the mean and std, but then unable to extract the learned parameters and feed to activation unit.
class MyLayer(Layer):
def __init__(self, output_dim,m,st, **kwargs):
self.units = output_dim
self.m=m
self.st=st
super(MyLayer, self).__init__(**kwargs)
def build(self, input_shape):
# Create a trainable weight variable for this layer.
self.kernel = self.add_weight(name='kernel',
shape=(input_shape[1], self.output_dim),
initializer='uniform',
trainable=True)
self.m = self.add_weight(name='mean',
shape=(1, 1),
initializer='uniform',
trainable=True)
self.std = self.add_weight(name='std',
shape=(1, 1),
initializer='uniform',
trainable=True)
super(MyLayer, self).build(input_shape) # Be sure to call this at the end
def call(self, x):
return K.dot(x, self.kernel)
def compute_output_shape(self, input_shape):
return (input_shape[0], self.output_dim,m,std)
How to use a custom activation function along with this custom layer

Resources