pytorch simple custom recurrent layer extremely slow - pytorch

I implemented a very simple custom recurrent layer in pytorch using PackedSequence. The layer slows down my network in the order of x20. I read about slow down on custom layers without using JIT, but in the order of x1.7, which is something I could live with.
I am simply indexing the packed sequences per sequence and performing a recursion.
I have the suspicion some of the code is not executed on the GPU?
I'm also grateful for any other tips how to implement this type of RNN (essentially not having a dense layer, without any mixing between features).
import torch
import torch.nn as nn
from torch.nn.utils.rnn import PackedSequence
def getPackedSequenceIndices(batch_sizes):
"""input: batch_sizes from PackedSequence object
requires length-sorted sequences!
"""
nBatches = batch_sizes[0]
seqIdx = []
for ii in range(nBatches):
seqLen = torch.sum((batch_sizes - ii) > 0).item()
idx = torch.LongTensor(seqLen)
idx[0] = ii
idx[1:] = batch_sizes[0:seqLen-1]
seqIdx.append( torch.cumsum(idx, dim=0) )
return seqIdx
class LinearRecursionLayer(nn.Module):
"""Linear recursive smoothing layer with trainable smoothing constants."""
def __init__(self, feat_dim, alpha_smooth=0.5):
super(LinearRecursionLayer, self).__init__()
self.feat_dim = feat_dim
# trainable parameters
self.alpha_smooth = nn.Parameter(alpha_smooth*torch.ones(self.feat_dim))
self.wx = nn.Parameter(torch.ones(self.feat_dim))
self.activ = nn.Tanh
def forward(self, x):
if isinstance(x, PackedSequence):
seqIdx = getPackedSequenceIndices(x.batch_sizes)
ydata = torch.zeros_like(x.data)
for idx in seqIdx:
y_frame = x.data[idx[0]] # init with first frame
# iterate over sequence
for nn in idx:
x_frame = x.data[nn]
y_frame = self.alpha_smooth*y_frame + (1-self.alpha_smooth)*x_frame # smoothing recurrence
ydata[nn,:] = self.activ(self.wx*(y_frame))
y = PackedSequence(ydata, x.batch_sizes) # pack
else: # tensor
raise ValueError('not implemented')
return y

Related

Efficiently training a network simultaneously on labels and partial derivaties

I'm trying to train a network in pytorch along the lines of this idea.
The author creates a simple MLP (4 hidden layers) and then explicitly works out what the partial derivatives of the output is wrt the inputs. He then trains the network on the training labels as well as the gradients of the output wrt the input data (which is also part of the training data).
To replicate the idea in pytorch, my training loop looks like this:
import torch
import torch.nn.functional as F
class vanilla_net(torch.nn.Module):
def __init__(self,
input_dim, # dimension of inputs, e.g. 10
hidden_units, # units in hidden layers, assumed constant, e.g. 20
hidden_layers): # number of hidden layers, e.g. 4):
super(vanilla_net, self).__init__()
self.input = torch.nn.Linear(input_dim, hidden_units)
self.hidden = torch.nn.ModuleList()
for hl in range(hidden_layers):
layer = torch.nn.Linear(hidden_units, hidden_units)
self.hidden.append(layer)
self.output = torch.nn.Linear(hidden_units, 1)
def forward(self, x):
x = self.input(x)
x = F.softplus(x)
for h in self.hidden:
x = h(x)
x = F.softplus(x)
x = self.output(x)
return x
....
def lossfn(x, y, dx, dy):
# some loss function involving both sets of training data (y and dy)
# the network outputs x and what's needed is an efficient way of calculating dx - the partial
# derivatives of x wrt the batch inputs.
pass
def train(net, x_train, y_train, dydx_train, batch_size=256)
m, n = x_train.shape
first = 0
last = min(batch_size, m)
while first < m:
xi = x_train[first:last]
yi = y_train[first:last]
zi = dydx_train[first:last]
xi.requires_grad_()
# Perform forward pass
outputs = net(xi)
minimizer.zero_grad()
outputs.backward(torch.ones_like(outputs), create_graph=True)
xi_grad = xi.grad
# Compute loss
loss = lossfn(outputs, yi, xi_grad, zi)
minimizer.zero_grad()
# Perform backward pass
loss.backward()
# Perform optimization
minimizer.step()
first = last
last = min(first + batch_size, m)
net = vanilla_net(4, 10, 4)
minimizer = torch.optim.Adam(net.parameters(), lr=1e-4)
...
This seems to work but is there a more elegant/efficient way to achieve the same thing? Also - not sure I know where the best place to put the minimizer.zero_grad()
Thanks

How to drop running stats to default value for Norm layer in pyTorch?

I trained model on some images. Now to fit similar dataset but with another colors I want to load this model but also i want to drop all running stats from Batchnorm layers (set them to default value, like totally untrained). What parameters should i reset? Simple model looks like this
import torch
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv0 = nn.Conv2d(3, 3, 3, padding = 1)
self.norm = nn.BatchNorm2d(3)
self.conv = nn.Conv2d(3, 3, 3, padding = 1)
def forward(self, x):
x = self.conv0(x)
x = self.norm(x)
return self.conv(x)
net = Net()
##or for pretrained it will be
##net = torch.load('net.pth')
def drop_to_default():
for m in net.modules():
if type(m) == nn.BatchNorm2d:
####???####
drop_to_default()
Simplest way to do that is to run reset_running_stats() method on BatchNorm objects:
def drop_to_default():
for m in net.modules():
if type(m) == nn.BatchNorm2d:
m.reset_running_stats()
Below is this method's source code:
def reset_running_stats(self) -> None:
if self.track_running_stats:
# running_mean/running_var/num_batches... are registered at runtime depending
# if self.track_running_stats is on
self.running_mean.zero_() # Zero (neutral) mean
self.running_var.fill_(1) # One (neutral) variance
self.num_batches_tracked.zero_() # Number of batches tracked
You can see the source code here, _NormBase class.

how to get the real shape of batch_size which is none in keras

When implementing a custom layer in Keras, I need to know the real size of batch_size. my shape is (?,20).
questions:
1. What is the best way to change (?,20) to (batch_size,20).
I have looked into this but it can not adjust to my problem.
I can pass the batch_size to this layer. In that case, I need to reshape (?,20) to (batch_size,20), how can I do that?
2. Is it the best way to that, or is there any builtin function that can get the real batch_size while building and running the model?
This is my layer:
from scipy.stats import entropy
from keras.engine import Layer
import keras.backend as K
import numpy as np
class measure(Layer):
def __init__(self, beta, **kwargs):
self.beta = beta
self.uses_learning_phase = True
self.supports_masking = True
super(measure, self).__init__(**kwargs)
def call(self, x):
return K.in_train_phase(self.rev_entropy(x, self.beta), x)
def get_config(self):
config = {'beta': self.beta}
base_config = super(measure, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
def rev_entropy(self, x, beta):
entropy_p_t_w = np.apply_along_axis(entropy, 1, x)
con = (beta / (1 + entropy_p_t_w)) ** 1.5
new_f_w_t = x * (con.reshape(con.shape[0], 1))
norm_const = 1e-30 + np.sum(new_f_w_t, axis=0)
for t in range(norm_const.shape[0]):
new_f_w_t[:, t] /= norm_const[t]
return new_f_w_t
And here is where I call this layer:
encoded = measure(beta=0.08)(encoded)
I am also using fit_generator if it can help at all:
autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS,
validation_data=test_gen, validation_steps=num_test_steps, callbacks=[checkpoint])
The dimension of the x passed to the layer is (?,20) and that's why I can not do my calculation.
Thanks:)

Implementing RNN and LSTM into DQN Pytorch code

I have some troubles finding some example on the great www to how i implement a recurrent neural network with LSTM layer into my current Deep q-network in Pytorch so it become a DRQN.. Bear with me i am just getting started..
Futhermore, I am NOT working with images processing, thereby CNN so do not worry about this. My states are purely temperatures values.
Here is my code that i am currently train my DQN with:
# Importing the libraries
import numpy as np
import random # random samples from different batches (experience replay)
import os # For loading and saving brain
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim # for using stochastic gradient descent
import torch.autograd as autograd # Conversion from tensor (advanced arrays) to avoid all that contains a gradient
# We want to put the tensor into a varaible taht will also contain a
# gradient and to this we need:
from torch.autograd import Variable
# to convert this tensor into a variable containing the tensor and the gradient
# Creating the architecture of the Neural Network
class Network(nn.Module): #inherinting from nn.Module
#Self - refers to the object that will be created from this class
# - self here to specify that we're referring to the object
def __init__(self, input_size, nb_action): #[self,input neuroner, output neuroner]
super(Network, self).__init__() #inorder to use modules in torch.nn
# Input and output neurons
self.input_size = input_size
self.nb_action = nb_action
# Full connection between different layers of NN
# In this example its one input layer, one hidden layer and one output layer
# Using self here to specify that fc1 is a variable of my object
self.fc1 = nn.Linear(input_size, 40)
self.fc2 = nn.Linear(40, 30)
#Example of adding a hiddenlayer
# self.fcX = nn.Linear(30,30)
self.fc3 = nn.Linear(30, nb_action) # 30 neurons in hidden layer
# For function that will activate neurons and perform forward propagation
def forward(self, state):
# rectifier function
x = F.relu(self.fc1(state))
x = F.relu(self.fc2(x))
q_values = self.fc3(x)
return q_values
# Implementing Experience Replay
# We know that RL is based on MDP
# So going from one state(s_t) to the next state(s_t+1)
# We gonna put 100 transition between state into what we call the memory
# So we can use the distribution of experience to make a decision
class ReplayMemory(object):
def __init__(self, capacity):
self.capacity = capacity #100 transitions
self.memory = [] #memory to save transitions
# pushing transitions into memory with append
#event=transition
def push(self, event):
self.memory.append(event)
if len(self.memory) > self.capacity: #memory only contain 100 events
del self.memory[0] #delete first transition from memory if there is more that 100
# taking random sample
def sample(self, batch_size):
#Creating variable that will contain the samples of memory
#zip =reshape function if list = ((1,2,3),(4,5,6)) zip(*list)= (1,4),(2,5),(3,6)
# (state,action,reward),(state,action,reward)
samples = zip(*random.sample(self.memory, batch_size))
#This is to be able to differentiate with respect to a tensor
#and this will then contain the tensor and gradient
#so for state,action and reward we will store the seperately into some
#bytes which each one will get a gradient
#so that eventually we'll be able to differentiate each one of them
return map(lambda x: Variable(torch.cat(x, 0)), samples)
# Implementing Deep Q Learning
class Dqn():
def __init__(self, input_size, nb_action, gamma, lrate, T):
self.gamma = gamma #self.gamma gets assigned to input argument
self.T = T
# Sliding window of the evolving mean of the last 100 events/transitions
self.reward_window = []
#Creating network with network class
self.model = Network(input_size, nb_action)
#creating memory with memory class
#We gonna take 100000 samples into memory and then we will sample from this memory to
#to get a snakk number of random transitions
self.memory = ReplayMemory(100000)
#creating optimizer (stochastic gradient descent)
self.optimizer = optim.Adam(self.model.parameters(), lr = lrate) #learning rate
#input vector which is batch of input observations
#by unsqeeze we create a fake dimension to this is
#what the network expect for its inputs
#have to be the first dimension of the last_state
self.last_state = torch.Tensor(input_size).unsqueeze(0)
#Inilizing
self.last_action = 0
self.last_reward = 0
def select_action(self, state):
#Q value depends on state
#Temperature parameter T will be a positive number and the closer
#it is to ze the less sure the NN will when taking an action
#forexample
#softmax((1,2,3))={0.04,0.11,0.85} ==> softmax((1,2,3)*3)={0,0.02,0.98}
#to deactivate brain then set T=0, thereby it is full random
probs = F.softmax((self.model(Variable(state, volatile = True))*self.T),dim=1) # T=100
#create a random draw from the probability distribution created from softmax
action = probs.multinomial()
print(probs.multinomial())
return action.data[0,0]
# See section 5.3 in AI handbook
def learn(self, batch_state, batch_next_state, batch_reward, batch_action):
outputs = self.model(batch_state).gather(1, batch_action.unsqueeze(1)).squeeze(1)
#next input for target see page 7 in attached AI handbook
next_outputs = self.model(batch_next_state).detach().max(1)[0]
target = self.gamma*next_outputs + batch_reward
#Using hubble loss inorder to obtain loss
td_loss = F.smooth_l1_loss(outputs, target)
#using lass loss/error to perform stochastic gradient descent and update weights
self.optimizer.zero_grad() #reintialize the optimizer at each iteration of the loop
#This line of code that backward propagates the error into the NN
#td_loss.backward(retain_variables = True) #userwarning
td_loss.backward(retain_graph = True)
#And this line of code uses the optimizer to update the weights
self.optimizer.step()
def update(self, reward, new_signal):
#Updated one transition and we have dated the last element of the transition
#which is the new state
new_state = torch.Tensor(new_signal).float().unsqueeze(0)
self.memory.push((self.last_state, new_state, torch.LongTensor([int(self.last_action)]), torch.Tensor([self.last_reward])))
#After ending in a state its time to play a action
action = self.select_action(new_state)
if len(self.memory.memory) > 100:
batch_state, batch_next_state, batch_action, batch_reward = self.memory.sample(100)
self.learn(batch_state, batch_next_state, batch_reward, batch_action)
self.last_action = action
self.last_state = new_state
self.last_reward = reward
self.reward_window.append(reward)
if len(self.reward_window) > 1000:
del self.reward_window[0]
return action
def score(self):
return sum(self.reward_window)/(len(self.reward_window)+1.)
def save(self):
torch.save({'state_dict': self.model.state_dict(),
'optimizer' : self.optimizer.state_dict(),
}, 'last_brain.pth')
def load(self):
if os.path.isfile('last_brain.pth'):
print("=> loading checkpoint... ")
checkpoint = torch.load('last_brain.pth')
self.model.load_state_dict(checkpoint['state_dict'])
self.optimizer.load_state_dict(checkpoint['optimizer'])
print("done !")
else:
print("no checkpoint found...")
I hope there is someone out there that can help me and could implement a RNN and a LSTM layer into my code! I believe in you stackflow!
Best regards Søren Koch
From my point of view, I think you could add RNN, LSTM layer to the Network#__init__,Network#forward; shape of data should be reshaped into sequences...
For more detail, I think you should read these two following articles; after that implementing RNN, LSTM not hard as it seem to be.
http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#sphx-glr-beginner-nlp-sequence-models-tutorial-py
http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

pyspark and pytorch: unable to retrieve gradients on the driver

I'm new to pytorch and I'm trying to explore the feasibility of its usage with spark (for now I'm working in spark standalone).
As for now I'm struggling on a very specific topic.
Let's start with a very simple model:
# linmodel.py
import torch
import torch.nn as nn
import numpy as np
def standardize(x):
return (x - np.mean(x)) / np.std(x)
def add_noise(y):
rnd = np.random.randn(y.shape[0])
return y + rnd
def cost(target, predicted):
cost = torch.sum((torch.t(target) - predicted) ** 2)
return cost
class LinModel(nn.Module):
def __init__(self, in_size, out_size):
super(LinModel, self).__init__() # always call parent's init
self.linear = nn.Linear(in_size, out_size, bias=False) # layer parameters
def forward(self, x):
return self.linear(x)
Which instantiates a basic linear model, along with some utility functions.
The goal is to approximate a target matrix, and to keep track of how the
gradients behave.
I'm trying to achieve the following:
create my target matrix
split the inputs on the workers
instantiate models and optimizer on the workers
compute the approximation on subsets of input
retrieve the gradients for further analysis
And everything works fine until point 5.
Here's the code:
#test.py
import torch
import torch.nn as nn
import numpy as np
import torch.optim
from torch.autograd import Variable
from pyspark import SparkContext
import linmodel
def prepare_input(nsamples=400):
Xold = np.linspace(0, 1000, nsamples).reshape([nsamples, 1])
X = linmodel.standardize(Xold)
W = np.random.randint(1, 10, size=(5, 1))
Y = W.dot(X.T) # target
for i in range(Y.shape[1]):
Y[:, i] = linmodel.add_noise(Y[:, i])
x = Variable(torch.from_numpy(X), requires_grad=False).type(torch.FloatTensor)
y = Variable(torch.from_numpy(Y), requires_grad=False).type(torch.FloatTensor)
print("created torch variables {} {}".format(x.size(), y.size()))
return x, y, W
def initialize(tup):
x, y = tup[0] # data
m, o = tup[1] # model and optimizer
model, optimizer = torch_step(x, y, m, o)
# here we have the gradients
print('gradient: {}'.format([param.grad.data for param in model.parameters()]))
return (x, y), (model, optimizer)
def create_model():
model = linmodel.LinModel(1, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
return model, optimizer
def torch_step(x, y, model, optimizer):
prediction = model(x)
loss = linmodel.cost(y, prediction)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return model, optimizer
def main(sc, num_partitions=4):
x, y, W = prepare_input()
parts_x = list(torch.split(x, int(x.size()[0] / num_partitions)))
parts_y = list(torch.split(y, int(x.size()[0] / num_partitions), 1))
rdd_models = sc.parallelize([create_model() for _ in range(num_partitions)]).repartition(num_partitions)
rdd_x = sc.parallelize(parts_x).repartition(num_partitions)
rdd_y = sc.parallelize(parts_y).repartition(num_partitions)
parts = rdd_x.zip(rdd_y) # [((100x1), (5x100)), ...]
full = parts.zip(rdd_models).map(initialize).cache()
models_out = full.map(lambda x: x[1][0]).collect()
test_model = models_out[0]
print(type(test_model))
print('gradient: {}'.format([param.grad.data for param in test_model.parameters()]))
if __name__ == '__main__':
sc = SparkContext(appName='test')
main(sc)
As you can see in the comments , when the function initialize is mapped on the full rdd, if you inspect the logs of the executors you'll find the gradients to be computed.
When I collect the result and try to access the very same attribute on the driver I receive a AttributeError: 'NoneType' object has no attribute 'data'
meaning that all the model.grad attribute are set to None.
I'm sure I'm missing something big here, but I cannot see it.
Any hint is appreciated.
Thanks a lot.
There are two major mistakes in your approach (according to me):
Since you want a distributed training, your approach of instantiating the model separately in all of the executors is wrong. You should instantiate the model in the head node (node where spark driver is located) and then distribute that model to all the executors. So each executor independently does forward pass and calculates the gradients on its portion of data and passes the gradients to the head node for weight update (weight update has to be serialized). Then the updated network is again scattered to the executors for the next iteration.
A much bigger concern is that I am not very sure if the gradient buffers are copied to the head node from the executors when you perform .collect(). Due to which model.grad can be set to None. To begin debugging, I suggest you have only one executor (and 1 partition) and then perform a .collect() to see if the gradient buffers are being copied. Or if you are good at Java or Scala, you can look at the collect() method's implementation.
Hope this helps.....

Resources