Pytorch DataParallel with custom model - pytorch

I want to train model with multiple gpu's. I'm using following code
model = load_model(path)
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = nn.DataParallel(model)
model.to(device)
It works well except DataParallel doesn't contain functions from original model, is there a way around it? Thank you

The nn.Module passed to nn.DataParallel will end up being wrapped by the class to handle data parallelism. You can still access your model with the module attribute.
>>> p_model = nn.DataParallel(model)
>>> p_model.module # <- model
For instance, to access your underlying model's quantize attribute, you would do:
>>> p_model.module.quantize

Related

Transferring pretrained pytorch model to onnx

I am trying to convert pytorch model to ONNX, in order to use it later for TensorRT. I followed the following tutorial https://pytorch.org/tutorials/advanced/super_resolution_with_caffe2.html, but my kernel dies all the time.
This is the code that I implemented.
# Some standard imports
import io
import numpy as np
from torch import nn
import torch.onnx
from deepformer.nets.quicknat import quickNAT
param = {
'num_channels': 64,
'num_filters': 64,
'kernel_h': 5,
'kernel_w': 5,
'kernel_c': 1,
'stride_conv': 1,
'pool': 2,
'stride_pool': 2,
'num_classes': 1,
'padding': 'reflection'
}
net = quickNAT(param)
checkpoint_path = 'checkpoint_epoch36_loss0.78.t7'
checkpoints=torch.load(checkpoint_path)
map_location = lambda storage, loc: storage
if torch.cuda.is_available():
map_location = None
net.load_state_dict(checkpoints['net'])
net.train(False)
# Input to the modelvcdfx
x = torch.rand(1, 64, 256, 1600, requires_grad=True)
# Export the model
torch_out = torch.onnx._export(net, # model being run
x, # model input (or a tuple for multiple inputs)
"quicknat.onnx", # where to save the model (can be a file or file-like object)
export_params=True) # store the trained parameter weights inside the model file
What is the output you get? It seems SuperResolution is supported with the export operators in pytorch as mentioned in the documentation
Are you sure the input to your model is:
x = torch.rand(1, 64, 256, 1600, requires_grad=True)
That could be the variable that you used for training, since for deployment you run the network on one or multiple images the dummy input to export to onnx is usually:
dummy_input = torch.randn(1, 3, 720, 1280, device='cuda')
With 1 being the batch size, 3 being the channels of the image(RGB), and then the size of the image, in this case 720x1280. Check on that input, I guess you don't have a 64 channel image as input right?
Also, it'd be helpful if you post the terminal output to see where it fails.
Good luck!

Create a tree structure of Keras layers in python

I'm trying to obtain a structure of Keras layer. Within a simple network, this is possible by just iterating through the model.layers. However, if the network is complex (e.g. concatenates different layer) this approach is not sufficient.
This is an example:
FEATURES = ['A','B','C','D']
IMPORTANT_FEATURES = [0, 3]
NORMAL_FEATURES = [1, 2]
inputLayer = [Input(shape=(1, )) for i,f in enumerate(FEATURES)]
importantInput = keras.layers.Concatenate(axis=-1)([inputLayer[i] for i in IMPORTANT_FEATURES])
layer1 = Dense()(importantInput)
normalInput = keras.layers.Concatenate(axis=-1)(layer1 + [inputLayer[i] for i in NORMAL_FEATURES])
layer2 = Dense()(normalInput)
model = Model([
inputLayer[i]
for i in range(len(FEATURES))
], layer2)
This produces in model.layer a list of Keras layers composed like this:
[Input1, Input2, Concatenate1, Dense1, Input3, Input4, Concatenate2, Dense2]
The only way to get which level is in input to the concatenate is to access to Concatenate1.input. However, this returns a list of Tensorflow layers.
Is it possible to obtain a tree-structure of layers just using Keras layers?
The Keras list of input layers can be obtained by Concatenate1._inbound_nodes[0].inbound_layers

Convert tensor to numpy without a session

I'm using the estimator library of tensorflow on python. I want to train a student network by using a pre-trained teacher.I'm facing the following issue.
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": train_data},
y=train_labels,
batch_size=100,
num_epochs=None,
shuffle=True)
student_classifier.train(
input_fn=train_input_fn,
steps=20,
hooks=None)
This code returns a generator object that is passed to a student classifier. Inside the generator, we have the inputs and labels (in batches of 100) as tensors. The problem is, I want to pass the same values to the teacher model and extract its softmax outputs. But unfortunately, the model input requires a numpy array as follows
student_classifier = tf.estimator.Estimator(
model_fn=student_model_fn, model_dir="./models/mnist_student")
def student_model_fn(features, labels, mode):
sess=tf.InteractiveSession()
tf.train.start_queue_runners(sess)
data=features['x'].eval()
out=labels.eval()
sess.close()
input_layer = tf.reshape(features["x"], [-1, 28, 28, 1])
eval_teacher_fn = tf.estimator.inputs.numpy_input_fn(
x={"x":data},
y=out,
num_epochs=1,
shuffle=False)
This requires x and y to be numpy arrays so I converted it via using such as ugly hack of using a session to convert tensor to numpy. Is there a better way of doing this?
P.S. I tried tf.estimator.Estimator.get_variable_value() but it retrieves weights from the model, not the input and output
Convert Tensor to Numpy_array using tf.make_ndarray.
tf.make_ndarray(), Create a numpy ndarray with the same shape and data as the tensor.
Sample working code:
import tensorflow as tf
a = tf.constant([[1,2,3],[4,5,6]])
proto_tensor = tf.make_tensor_proto(a)
tf.make_ndarray(proto_tensor)
output:
array([[1, 2, 3],
[4, 5, 6]], dtype=int32)
# output has shape (2,3)

Input tensors to a Model must be Keras tensors

Input tensors to a Model must be Keras tensors. Found:
Tensor("my_layer/Identity:0", shape=(?, 10, 1152, 16), dtype=float32)
(missing Keras metadata).
Hi, I get this error when trying to take one layer's intermediate variable to use it as input to a parallel network. Such that one layer's intermediate variable will be input to the other network.
def call(self, inputs, training=None):
inputs_expand = K.expand_dims(inputs, 1)
tensor_b = K.tile(inputs_expand, [1, 16, 1, 1])
tensor_a = K.map_fn(lambda x: K.batch_dot(x, self.Weights, [2, 3]), elems=tensor_b)
# I need this tensor_a
# I tried many things but ended putting it to member variable.
self.tensor_a = K.identity(inputs_hat)
....
outside when trying to build the parallel model I do this
a_model = models.Model([my_layer.tensor_a],[my_layer.c])
I could not find any good solution to this problem? How can I turn the tensor into K.tensor??

model selection with Keras and Theano takes a very long time

I am performing nested cross-validation for model selection and performance estimation for a set of recurrent neural networks with different architectures and parameters using Keras and Theano, which are set up to run on a AWS P2 instance which has a Tesla K80 GPU with CUDA and cuDNN installed/enabled.
To perform model selection, I compare 30 models sampled from the parameter space using
param_grid = {
'nb_hidden_layers': [1, 2, 3],
'dropout_frac': [0.15, 0.20],
'output_activation': ['sigmoid', 'softmax'],
'optimization': ['Adedelta', 'RMSprop', 'Adam'],
'learning_rate': [0.001, 0.005, 0.010],
'batch_size': [64, 100, 150, 200],
'nb_epoch': [10, 15, 20],
'perform_batchnormalization': [True, False]
}
params_list = list(ParameterSampler(param_grid, n_iter = 30))
I then construct a RNN model using the function NeuralNetworkClassifier() defined below
def NeuralNetworkClassifier(params, units_in_hidden_layer = [50, 75, 100, 125, 150]):
nb_units_in_hidden_layers = np.random.choice(units_in_hidden_layer, size = params['nb_hidden_layers'], replace = False)
layers = [8] # number of features in every week
layers.extend(nb_units_in_hidden_layers)
layers.extend([1]) # node identifying quit/stay
model = Sequential()
# constructing all layers up to, but not including, the penultimate one
layer_idx = -1 # this ensures proper generalization nb_hidden_layers = 1 (for which the loop below will never run)
for layer_idx in range(len(layers) - 3):
model.add(LSTM(input_dim = layers[layer_idx], output_dim = layers[layer_idx + 1], init = 'he_uniform', return_sequences = True)) # all LSTM layers, up to and including the penultimate one, need return_sequences = True
if params['perform_batchnormalization'] == True:
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(params['dropout_frac']))
# constructing the penultimate layer
model.add(LSTM(input_dim = layers[layer_idx + 1], output_dim = layers[(layer_idx + 1) + 1], init = 'he_uniform', return_sequences = False)) # the last LSTM layer needs return_sequences = False
if params['perform_batchnormalization'] == True:
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(params['dropout_frac']))
# constructing the final layer
model.add(Dense(output_dim = layers[-1], init = 'he_normal'))
model.add(Activation(params['output_activation']))
if params['optimization'] == 'SGD':
optim = SGD()
optim.lr.set_value(params['learning_rate'])
elif params['optimization'] == 'RMSprop':
optim = RMSprop()
optim.lr.set_value(params['learning_rate'])
elif params['optimization'] == 'Adam':
optim = Adam()
elif params['optimization'] == 'Adedelta':
optim = Adadelta()
model.compile(loss = 'binary_crossentropy', optimizer = optim, metrics = ['precision'])
return model
which construct a RNN whose number of hidden layers is given by the parameter 'nb_hidden_layers' in param_grid and the number of hidden units in each layer is randomly sampled from the list [50, 75, 100, 125, 150]. At the end, this function compiles the model and returns it.
During the nested cross-validation (CV), the inner loop (which runs IN times) compares the performance of the 30 randomly selected model. After this step, I pick the best-performing model in the outer loop and estimate its performance on a hold-out dataset; this scheme is repeated OUT times. Therefore, I am compileing a RNN model OUTxINx30 times, and this takes an extremely long time; for example, when OUT=4 and IN=3, my method takes between 6 to 7 hours to finish.
I see that the GPU is being used sporadically (but the GPU usage never goes above 40%); however, most of the time, it is the CPU that is being used. My (uneducated) guess is that compile is being done on the CPU many many times and takes the bulk of the computing time, whereas model fitting and predicting are done on the GPU and takes a short time.
My questions:
Is there a way to remedy this situation?
Is compile actually done on the CPU?
How do people do nested CV to select the best RNN architecture?
Is it reasonable for me to perform this scheme on the production server? Do you suggest I do one big nested CV, that might take 24 hours, to select the best performing model and just use that one model afterwards on the production server?
Thank you all.
I can't answer all your questions, still hope it helps.
Compilation is done in CPU because it's mainly composed of symbolic graph operations and code generation. To make things worse, theano graph optimization uses pure python code, which can be an overhead compared to a C/C++ implementation.
To improve theano compilation time (at the cost of runtime performance):
Use less aggressive optimization
In /home/ec2-user/.theanorc add line:
optimizer = fast_compile
Or totally disable optimization with:
optimizer = None
Precompile some blocks
If there are common blocks shared amoung your models, you can precompile them with theano.OpFromGraph
You can't do this in Keras alone, though.
Switch framework
Keras does support tensorflow backend. Compared to theano, tensorflow work more like a VM than a compiler. Typically TF runs slower than theano but compiles much faster.

Resources