I am trying to use the OneCycleLR or atleast the cyclicLR in torch.optim.lr_scheduler.
Suppose I have the following:
param_list = []
for lr, block in zip(lrs, blocks):
param_list.extend([{'params':p ,'lr':lr} for n,p in model.named_parameters() if n.startswith(block)])
optimizer = torch.optim.Adam(param_list)
where blocks = ["base", "fc"] (in my use case there is ~20 blocks) and lrs=[1e-4, 1e-3].
It is easy enough to control the learning rates manually by using a function, eg:
lr_sched = lambda batch: 1.1**batch
scheduler = LambdaLR(optimizer, lr_lambda=[lr_sched]*len(param_list))
The above example increases the learning rate.
However, what I would like to do is change the learning rate as well as the momentum parameters as offered in OneCycleLR. So my question is:
Is it possible?
If not, is there a way to manipulate the momentum while training, and I can write a function for the cyclic learning rate myself.
Is it possible to use a list of optimizers instead of one, if so is that slower?
Minimal example:
import torch
import torch.nn as nn
from torch.optim.lr_scheduler import LambdaLR
class Model(nn.Module):
def __init__(self):
self.base = nn.Linear(10, 5)
self.fc = nn.Linear(5, 1)
self.relu = nn.ReLU()
def forward(self, x):
return self.fc(self.relu(self.base(x)))
model = Model()
param_list = []
for lr, block in zip(lrs, blocks):
param_list.extend([{'params':p ,'lr':lr} for n,p in model.named_parameters() if n.startswith(block)])
optimizer = torch.optim.Adam(param_list)
lr_sched = lambda batch: 1.1**batch
scheduler = LambdaLR(optimizer, lr_lambda=[lr_sched]*len(param_list))
Related
I'm trying to get my toy network to learn a sine wave.
I output (via tanh) a number between -1 and 1, and I want the network to minimise the following loss, where self(x) are the predictions.
loss = -torch.mean(self(x)*y)
This should be equivalent to trading a stock with a sinusoidal price, where self(x) is our desired position, and y are the returns of the next time step.
The issue I'm having is that the network doesn't learn anything. It does work if I change the loss function to be torch.mean((self(x)-y)**2) (MSE), but this isn't what I want. I'm trying to focus the network on 'making a profit', not making a prediction.
I think the issue may be related to the convexity of the loss function, but I'm not sure, and I'm not certain how to proceed. I've experimented with differing learning rates, but alas nothing works.
What should I be thinking about?
Actual code:
%load_ext tensorboard
import matplotlib.pyplot as plt; plt.rcParams["figure.figsize"] = (30,8)
import torch;from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F;import pytorch_lightning as pl
from torch import nn, tensor
def piecewise(x): return 2*(x>0)-1
class TsDs(torch.utils.data.Dataset):
def __init__(self, s, l=5): super().__init__();self.l,self.s=l,s
def __len__(self): return self.s.shape[0] - 1 - self.l
def __getitem__(self, i): return self.s[i:i+self.l], torch.log(self.s[i+self.l+1]/self.s[i+self.l])
def plt(self): plt.plot(self.s)
class TsDm(pl.LightningDataModule):
def __init__(self, length=5000, batch_size=1000): super().__init__();self.batch_size=batch_size;self.s = torch.sin(torch.arange(length)*0.2) + 5 + 0*torch.rand(length)
def train_dataloader(self): return DataLoader(TsDs(self.s[:3999]), batch_size=self.batch_size, shuffle=True)
def val_dataloader(self): return DataLoader(TsDs(self.s[4000:]), batch_size=self.batch_size)
dm = TsDm()
class MyModel(pl.LightningModule):
def __init__(self, learning_rate=0.01):
super().__init__();self.learning_rate = learning_rate
super().__init__();self.learning_rate = learning_rate
self.conv1 = nn.Conv1d(1,5,2)
self.lin1 = nn.Linear(20,3);self.lin2 = nn.Linear(3,1)
# self.network = nn.Sequential(nn.Conv1d(1,5,2),nn.ReLU(),nn.Linear(20,3),nn.ReLU(),nn.Linear(3,1), nn.Tanh())
# self.network = nn.Sequential(nn.Linear(5,5),nn.ReLU(),nn.Linear(5,3),nn.ReLU(),nn.Linear(3,1), nn.Tanh())
def forward(self, x):
out = x.unsqueeze(1)
out = self.conv1(out)
out = out.reshape(-1,20)
out = nn.ReLU()(out)
out = self.lin1(out)
out = nn.ReLU()(out)
out = self.lin2(out)
return nn.Tanh()(out)
def step(self, batch, batch_idx, stage):
x, y = batch
loss = -torch.mean(self(x)*y)
# loss = torch.mean((self(x)-y)**2)
print(loss)
self.log("loss", loss, prog_bar=True)
return loss
def training_step(self, batch, batch_idx): return self.step(batch, batch_idx, "train")
def validation_step(self, batch, batch_idx): return self.step(batch, batch_idx, "val")
def configure_optimizers(self): return torch.optim.SGD(self.parameters(), lr=self.learning_rate)
#logger = pl.loggers.TensorBoardLogger(save_dir="/content/")
mm = MyModel(0.1);trainer = pl.Trainer(max_epochs=10)
# trainer.tune(mm, dm)
trainer.fit(mm, datamodule=dm)
#
If I understand you correctly, I think that you were trying to maximize the unnormalized correlation between the network's prediction, self(x), and the target value y.
As you mention, the problem is the convexity of the loss wrt the model weights. One way to see the problem is to consider that the model is a simple linear predictor w'*x, where w is the model weights, w' it's transpose, and x the input feature vector (assume a scalar prediction for now). Then, if you look at the derivative of the loss wrt the weight vector (i.e., the gradient), you'll find that it no longer depends on w!
One way to fix this is change the loss to,
loss = -torch.mean(torch.square(self(x)*y))
or
loss = -torch.mean(torch.abs(self(x)*y))
You will have another big problem, however: these loss functions encourage unbound growth of the model weights. In the linear case, one solves this by a Lagrangian relaxation of a hard constraint on, for example, the norm of the model weight vector. I'm not sure how this would be done with neural networks as each layer would need it's own Lagrangian parameter...
I trained model on some images. Now to fit similar dataset but with another colors I want to load this model but also i want to drop all running stats from Batchnorm layers (set them to default value, like totally untrained). What parameters should i reset? Simple model looks like this
import torch
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv0 = nn.Conv2d(3, 3, 3, padding = 1)
self.norm = nn.BatchNorm2d(3)
self.conv = nn.Conv2d(3, 3, 3, padding = 1)
def forward(self, x):
x = self.conv0(x)
x = self.norm(x)
return self.conv(x)
net = Net()
##or for pretrained it will be
##net = torch.load('net.pth')
def drop_to_default():
for m in net.modules():
if type(m) == nn.BatchNorm2d:
####???####
drop_to_default()
Simplest way to do that is to run reset_running_stats() method on BatchNorm objects:
def drop_to_default():
for m in net.modules():
if type(m) == nn.BatchNorm2d:
m.reset_running_stats()
Below is this method's source code:
def reset_running_stats(self) -> None:
if self.track_running_stats:
# running_mean/running_var/num_batches... are registered at runtime depending
# if self.track_running_stats is on
self.running_mean.zero_() # Zero (neutral) mean
self.running_var.fill_(1) # One (neutral) variance
self.num_batches_tracked.zero_() # Number of batches tracked
You can see the source code here, _NormBase class.
I need to visualize the output of Vgg16 model which classify 14 different classes.
I load the trained model and I did replace the classifier layer with the identity() layer but it doesn't categorize the output.
Here is the snippet:
the number of samples here is 1000 images.
epoch = 800
PATH = 'vgg16_epoch{}.pth'.format(epoch)
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
class Identity(nn.Module):
def __init__(self):
super(Identity, self).__init__()
def forward(self, x):
return x
model.classifier._modules['6'] = Identity()
model.eval()
logits_list = numpy.empty((0,4096))
targets = []
with torch.no_grad():
for step, (t_image, target, classess, image_path) in enumerate(test_loader):
t_image = t_image.cuda()
target = target.cuda()
target = target.data.cpu().numpy()
targets.append(target)
logits = model(t_image)
print(logits.shape)
logits = logits.data.cpu().numpy()
print(logits.shape)
logits_list = numpy.append(logits_list, logits, axis=0)
print(logits_list.shape)
tsne = TSNE(n_components=2, verbose=1, perplexity=10, n_iter=1000)
tsne_results = tsne.fit_transform(logits_list)
target_ids = range(len(targets))
plt.scatter(tsne_results[:,0],tsne_results[:,1],c = target_ids ,cmap=plt.cm.get_cmap("jet", 14))
plt.colorbar(ticks=range(14))
plt.legend()
plt.show()
here is what this script has been produced: I am not sure why I have all colors for each cluster!
The VGG16 outputs over 25k features to the classifier. I believe it's too much to t-SNE. It's a good idea to include a new nn.Linear layer to reduce this number. So, t-SNE may work better. In addition, I'd recommend you two different ways to get the features from the model:
The best way to get it regardless of the model is by using the register_forward_hook method. You may find a notebook here with an example.
If you don't want to use the register, I'd suggest this one. After loading your model, you may use the following class to extract the features:
class FeatNet (nn.Module):
def __init__(self, vgg):
super(FeatNet, self).__init__()
self.features = nn.Sequential(*list(vgg.children())[:-1]))
def forward(self, img):
return self.features(img)
Now, you just need to call FeatNet(img) to get the features.
To include the feature reducer, as I suggested before, you need to retrain your model doing something like:
class FeatNet (nn.Module):
def __init__(self, vgg):
super(FeatNet, self).__init__()
self.features = nn.Sequential(*list(vgg.children())[:-1]))
self.feat_reducer = nn.Sequential(
nn.Linear(25088, 1024),
nn.BatchNorm1d(1024),
nn.ReLU()
)
self.classifier = nn.Linear(1024, 14)
def forward(self, img):
x = self.features(img)
x_r = self.feat_reducer(x)
return self.classifier(x_r)
Then, you can run your model returning x_r, that is, the reduced features. As I told you, 25k features are too much for t-SNE. Another method to reduce this number is by using PCA instead of nn.Linear. In this case, you send the 25k features to PCA and then train t-SNE using the PCA's output. I prefer using nn.Linear, but you need to test to check which one you get a better result.
I implemented a very simple custom recurrent layer in pytorch using PackedSequence. The layer slows down my network in the order of x20. I read about slow down on custom layers without using JIT, but in the order of x1.7, which is something I could live with.
I am simply indexing the packed sequences per sequence and performing a recursion.
I have the suspicion some of the code is not executed on the GPU?
I'm also grateful for any other tips how to implement this type of RNN (essentially not having a dense layer, without any mixing between features).
import torch
import torch.nn as nn
from torch.nn.utils.rnn import PackedSequence
def getPackedSequenceIndices(batch_sizes):
"""input: batch_sizes from PackedSequence object
requires length-sorted sequences!
"""
nBatches = batch_sizes[0]
seqIdx = []
for ii in range(nBatches):
seqLen = torch.sum((batch_sizes - ii) > 0).item()
idx = torch.LongTensor(seqLen)
idx[0] = ii
idx[1:] = batch_sizes[0:seqLen-1]
seqIdx.append( torch.cumsum(idx, dim=0) )
return seqIdx
class LinearRecursionLayer(nn.Module):
"""Linear recursive smoothing layer with trainable smoothing constants."""
def __init__(self, feat_dim, alpha_smooth=0.5):
super(LinearRecursionLayer, self).__init__()
self.feat_dim = feat_dim
# trainable parameters
self.alpha_smooth = nn.Parameter(alpha_smooth*torch.ones(self.feat_dim))
self.wx = nn.Parameter(torch.ones(self.feat_dim))
self.activ = nn.Tanh
def forward(self, x):
if isinstance(x, PackedSequence):
seqIdx = getPackedSequenceIndices(x.batch_sizes)
ydata = torch.zeros_like(x.data)
for idx in seqIdx:
y_frame = x.data[idx[0]] # init with first frame
# iterate over sequence
for nn in idx:
x_frame = x.data[nn]
y_frame = self.alpha_smooth*y_frame + (1-self.alpha_smooth)*x_frame # smoothing recurrence
ydata[nn,:] = self.activ(self.wx*(y_frame))
y = PackedSequence(ydata, x.batch_sizes) # pack
else: # tensor
raise ValueError('not implemented')
return y
I'm new to pytorch and I'm trying to explore the feasibility of its usage with spark (for now I'm working in spark standalone).
As for now I'm struggling on a very specific topic.
Let's start with a very simple model:
# linmodel.py
import torch
import torch.nn as nn
import numpy as np
def standardize(x):
return (x - np.mean(x)) / np.std(x)
def add_noise(y):
rnd = np.random.randn(y.shape[0])
return y + rnd
def cost(target, predicted):
cost = torch.sum((torch.t(target) - predicted) ** 2)
return cost
class LinModel(nn.Module):
def __init__(self, in_size, out_size):
super(LinModel, self).__init__() # always call parent's init
self.linear = nn.Linear(in_size, out_size, bias=False) # layer parameters
def forward(self, x):
return self.linear(x)
Which instantiates a basic linear model, along with some utility functions.
The goal is to approximate a target matrix, and to keep track of how the
gradients behave.
I'm trying to achieve the following:
create my target matrix
split the inputs on the workers
instantiate models and optimizer on the workers
compute the approximation on subsets of input
retrieve the gradients for further analysis
And everything works fine until point 5.
Here's the code:
#test.py
import torch
import torch.nn as nn
import numpy as np
import torch.optim
from torch.autograd import Variable
from pyspark import SparkContext
import linmodel
def prepare_input(nsamples=400):
Xold = np.linspace(0, 1000, nsamples).reshape([nsamples, 1])
X = linmodel.standardize(Xold)
W = np.random.randint(1, 10, size=(5, 1))
Y = W.dot(X.T) # target
for i in range(Y.shape[1]):
Y[:, i] = linmodel.add_noise(Y[:, i])
x = Variable(torch.from_numpy(X), requires_grad=False).type(torch.FloatTensor)
y = Variable(torch.from_numpy(Y), requires_grad=False).type(torch.FloatTensor)
print("created torch variables {} {}".format(x.size(), y.size()))
return x, y, W
def initialize(tup):
x, y = tup[0] # data
m, o = tup[1] # model and optimizer
model, optimizer = torch_step(x, y, m, o)
# here we have the gradients
print('gradient: {}'.format([param.grad.data for param in model.parameters()]))
return (x, y), (model, optimizer)
def create_model():
model = linmodel.LinModel(1, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
return model, optimizer
def torch_step(x, y, model, optimizer):
prediction = model(x)
loss = linmodel.cost(y, prediction)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return model, optimizer
def main(sc, num_partitions=4):
x, y, W = prepare_input()
parts_x = list(torch.split(x, int(x.size()[0] / num_partitions)))
parts_y = list(torch.split(y, int(x.size()[0] / num_partitions), 1))
rdd_models = sc.parallelize([create_model() for _ in range(num_partitions)]).repartition(num_partitions)
rdd_x = sc.parallelize(parts_x).repartition(num_partitions)
rdd_y = sc.parallelize(parts_y).repartition(num_partitions)
parts = rdd_x.zip(rdd_y) # [((100x1), (5x100)), ...]
full = parts.zip(rdd_models).map(initialize).cache()
models_out = full.map(lambda x: x[1][0]).collect()
test_model = models_out[0]
print(type(test_model))
print('gradient: {}'.format([param.grad.data for param in test_model.parameters()]))
if __name__ == '__main__':
sc = SparkContext(appName='test')
main(sc)
As you can see in the comments , when the function initialize is mapped on the full rdd, if you inspect the logs of the executors you'll find the gradients to be computed.
When I collect the result and try to access the very same attribute on the driver I receive a AttributeError: 'NoneType' object has no attribute 'data'
meaning that all the model.grad attribute are set to None.
I'm sure I'm missing something big here, but I cannot see it.
Any hint is appreciated.
Thanks a lot.
There are two major mistakes in your approach (according to me):
Since you want a distributed training, your approach of instantiating the model separately in all of the executors is wrong. You should instantiate the model in the head node (node where spark driver is located) and then distribute that model to all the executors. So each executor independently does forward pass and calculates the gradients on its portion of data and passes the gradients to the head node for weight update (weight update has to be serialized). Then the updated network is again scattered to the executors for the next iteration.
A much bigger concern is that I am not very sure if the gradient buffers are copied to the head node from the executors when you perform .collect(). Due to which model.grad can be set to None. To begin debugging, I suggest you have only one executor (and 1 partition) and then perform a .collect() to see if the gradient buffers are being copied. Or if you are good at Java or Scala, you can look at the collect() method's implementation.
Hope this helps.....