model selection with Keras and Theano takes a very long time - theano

I am performing nested cross-validation for model selection and performance estimation for a set of recurrent neural networks with different architectures and parameters using Keras and Theano, which are set up to run on a AWS P2 instance which has a Tesla K80 GPU with CUDA and cuDNN installed/enabled.
To perform model selection, I compare 30 models sampled from the parameter space using
param_grid = {
'nb_hidden_layers': [1, 2, 3],
'dropout_frac': [0.15, 0.20],
'output_activation': ['sigmoid', 'softmax'],
'optimization': ['Adedelta', 'RMSprop', 'Adam'],
'learning_rate': [0.001, 0.005, 0.010],
'batch_size': [64, 100, 150, 200],
'nb_epoch': [10, 15, 20],
'perform_batchnormalization': [True, False]
}
params_list = list(ParameterSampler(param_grid, n_iter = 30))
I then construct a RNN model using the function NeuralNetworkClassifier() defined below
def NeuralNetworkClassifier(params, units_in_hidden_layer = [50, 75, 100, 125, 150]):
nb_units_in_hidden_layers = np.random.choice(units_in_hidden_layer, size = params['nb_hidden_layers'], replace = False)
layers = [8] # number of features in every week
layers.extend(nb_units_in_hidden_layers)
layers.extend([1]) # node identifying quit/stay
model = Sequential()
# constructing all layers up to, but not including, the penultimate one
layer_idx = -1 # this ensures proper generalization nb_hidden_layers = 1 (for which the loop below will never run)
for layer_idx in range(len(layers) - 3):
model.add(LSTM(input_dim = layers[layer_idx], output_dim = layers[layer_idx + 1], init = 'he_uniform', return_sequences = True)) # all LSTM layers, up to and including the penultimate one, need return_sequences = True
if params['perform_batchnormalization'] == True:
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(params['dropout_frac']))
# constructing the penultimate layer
model.add(LSTM(input_dim = layers[layer_idx + 1], output_dim = layers[(layer_idx + 1) + 1], init = 'he_uniform', return_sequences = False)) # the last LSTM layer needs return_sequences = False
if params['perform_batchnormalization'] == True:
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(params['dropout_frac']))
# constructing the final layer
model.add(Dense(output_dim = layers[-1], init = 'he_normal'))
model.add(Activation(params['output_activation']))
if params['optimization'] == 'SGD':
optim = SGD()
optim.lr.set_value(params['learning_rate'])
elif params['optimization'] == 'RMSprop':
optim = RMSprop()
optim.lr.set_value(params['learning_rate'])
elif params['optimization'] == 'Adam':
optim = Adam()
elif params['optimization'] == 'Adedelta':
optim = Adadelta()
model.compile(loss = 'binary_crossentropy', optimizer = optim, metrics = ['precision'])
return model
which construct a RNN whose number of hidden layers is given by the parameter 'nb_hidden_layers' in param_grid and the number of hidden units in each layer is randomly sampled from the list [50, 75, 100, 125, 150]. At the end, this function compiles the model and returns it.
During the nested cross-validation (CV), the inner loop (which runs IN times) compares the performance of the 30 randomly selected model. After this step, I pick the best-performing model in the outer loop and estimate its performance on a hold-out dataset; this scheme is repeated OUT times. Therefore, I am compileing a RNN model OUTxINx30 times, and this takes an extremely long time; for example, when OUT=4 and IN=3, my method takes between 6 to 7 hours to finish.
I see that the GPU is being used sporadically (but the GPU usage never goes above 40%); however, most of the time, it is the CPU that is being used. My (uneducated) guess is that compile is being done on the CPU many many times and takes the bulk of the computing time, whereas model fitting and predicting are done on the GPU and takes a short time.
My questions:
Is there a way to remedy this situation?
Is compile actually done on the CPU?
How do people do nested CV to select the best RNN architecture?
Is it reasonable for me to perform this scheme on the production server? Do you suggest I do one big nested CV, that might take 24 hours, to select the best performing model and just use that one model afterwards on the production server?
Thank you all.

I can't answer all your questions, still hope it helps.
Compilation is done in CPU because it's mainly composed of symbolic graph operations and code generation. To make things worse, theano graph optimization uses pure python code, which can be an overhead compared to a C/C++ implementation.
To improve theano compilation time (at the cost of runtime performance):
Use less aggressive optimization
In /home/ec2-user/.theanorc add line:
optimizer = fast_compile
Or totally disable optimization with:
optimizer = None
Precompile some blocks
If there are common blocks shared amoung your models, you can precompile them with theano.OpFromGraph
You can't do this in Keras alone, though.
Switch framework
Keras does support tensorflow backend. Compared to theano, tensorflow work more like a VM than a compiler. Typically TF runs slower than theano but compiles much faster.

Related

keras, LSTM - predict on inputs of different length?

I have fitted an LSTM that deals with inputs of different length:
model = Sequential()
model.add(LSTM(units=10, return_sequences=False, input_shape=(None, 5)))
model.add(Dense(units=1, activation='sigmoid'))
Having fitted the model, I want to test it on inputs of different size.
x_test.shape # = 100000
x_test[0].shape # = (1, 5)
x_test[1].shape # = (3, 5)
x_test[2].shape # = (8, 5)
Testing on single instances j is not a problem (model.predict(x_test[j]), but looping on all of them is really slow.
Is there a way of speeding up the computation? model.predict(x_test) does not work.
Thank you!
The most common way to speed up model inference is to run inference on GPU, instead of the CPU (I'm assuming you are not already doing that). You can set up GPU support by following the official guide here. Unless you are explicitly asking keras to run inference on CPU, your code should work as is, without any changes. To confirm if you are using GPU, you can use this article.
Hope the answer was helpful!
The best solution that I have found so far is grouping together data windows with the same length. For my problem, it's enough to significantly speed up the computation.
Hope this trick would help other people.
import numpy as np
def predict_custom(model, x):
"""x should be a list of np.arrays with different number of rows, but same number of columns"""
# dictionary with key = length of the window, value = indices of samples with such length
dic = {}
for i, x in enumerate(x):
if dic.get(x.shape[0]):
dic[x.shape[0]].append(i)
else:
dic[x.shape[0]] = [i]
y_pred = np.full((len(x),1), np.nan)
# loop over dictionary and predict together samples of the same length
for key, indexes in dic.items():
# select samples of the same length (conversion to np.array is used for subsetting "x" using "indexes")
x = np.asarray(x, dtype=object)[indexes].tolist()
# gather such samples in a 3D np.array
x_3d = np.stack(x, axis=0)
# use dictionary values to insert results in the correspondent row of y_pred
y_pred[indexes] = model.predict(x_3d)
return y_pred

Will switching GPU device affect the gradient in PyTorch back propagation?

I use the Pytorch. In the computation, I move some data and operators A in the GPU. In the middle step, I move the data and operators B to CPU and continue the forward.
My question is that:
My operator B is very memory-consuming that cannot be used in GPU. Will this affect (some parts compute in GPU and the others are computed in CPU) the backpropagation?
Pytorch keeps track of the location of tensors. If you use .cpu() or .to('cpu') pytorch's native commands you should be okay.
See, e.g., this model parallel tutorial - the computation is split between two different GPU devices.
If your model fits into the GPU memory, you might let PyTorch do the parallel distribution for you within the DataParallel (one process multiple threads) or DistributedDataParallel (multiple processes multiple threads, single or multiple nodes) frameworks.
Code below checks if you have a gpu device torch.cuda.device_count() > 1 and sets the DataParallel mode model = nn.DataParallel(model)
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = nn.DataParallel(model)
model.to(device)
DataParallel replicates the same model to all GPUs, where each GPU consumes a different partition of the input data, it can significantly accelerate the training process, but it does not work for some use cases where the model is too large to fit into a single GPU.
To solve this problem, you might resort to a model parallel approach, which splits a single model onto different GPUs, rather than replicating the entire model on each GPU.
(e.g. a model m contains 10 layers: when using DataParallel, each GPU
will have a replica of each of these 10 layers, whereas when using
model parallel on two GPUs, each GPU could host 5 layers)
An example where .to('cuda:0') indicates where the layer should be positioned.
import torch
import torch.nn as nn
import torch.optim as optim
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = torch.nn.Linear(10, 10).to('cuda:0')
self.relu = torch.nn.ReLU()
self.net2 = torch.nn.Linear(10, 5).to('cuda:1')
def forward(self, x):
x = self.relu(self.net1(x.to('cuda:0')))
return self.net2(x.to('cuda:1'))
backward() then automatically takes location into consideration.
model = ToyModel()
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
optimizer.zero_grad()
outputs = model(torch.randn(20, 10))
labels = torch.randn(20, 5).to('cuda:1')
loss_fn(outputs, labels).backward()
optimizer.step()
https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
This snippet suggests that the gradient is preserved when computation goes through different devices.
def change_device():
import torch.nn as nn
a = torch.rand((4, 32))
m1 = nn.Linear(32, 32)
cpu = m1(a)
gpu = cpu.to(0)
m2 = nn.Linear(32, 32).to(0)
out = m2(gpu)
loss = out.sum()
loss.backward()
print(m1.weight.grad)
# works like magic
"""
tensor([[ 0.7746, 1.0342, 0.8706, ..., 1.0993, 0.7975, 0.3915],
[-0.5369, -0.7169, -0.6034, ..., -0.7619, -0.5527, -0.2713],
[ 0.3607, 0.4815, 0.4053, ..., 0.5118, 0.3713, 0.1823],
...,
[ 1.1200, 1.4955, 1.2588, ..., 1.5895, 1.1531, 0.5660],
[-0.1582, -0.2112, -0.1778, ..., -0.2245, -0.1629, -0.0799],
[-0.4531, -0.6050, -0.5092, ..., -0.6430, -0.4665, -0.2290]])
"""
Modifying this snippet, the gradient is preserved when tensor moves from gpu to cpu as well.

Memory bottleneck with autoregressive transformer decoding

I am trying to train a transformer model for sequence modeling. Below is a standalone example:
import torch
import torch.nn as nn
criterion = nn.MSELoss()
decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=12)
memory = torch.rand(10, 32, 512)
y = torch.rand(20, 32, 512)
start_token = torch.ones((1,32,512))
tgt_input = torch.cat((start_token,y[:-1,:]),axis=0)
optimizer = torch.optim.Adam(transformer_decoder.parameters())
###################Teacher forced
while(True):
optimizer.zero_grad()
out = transformer_decoder(tgt_input, memory, nn.Transformer.generate_square_subsequent_mask(20,20))
loss = criterion(out,y)
print("loss: ", loss.item())
loss.backward()
optimizer.step()
For a 12 layer decoder, the model works fine on a personal machine with 8GB memory. The model is autoregressive and works with shifted targets. Given we provide targets above, I refer to this setting as "teacher forced".
However, at inference stage, we will not have targets fed as above, and one would need to condition on targets generated on the go. This setting is as follows:
###################Non Teacher forced
while(True):
optimizer.zero_grad()
predictions = torch.ones((1,32,512))
for i in range(1,21):
predictions = torch.cat((predictions, transformer_decoder(tgt_input[:i], memory, nn.Transformer.generate_square_subsequent_mask(i,i))[-1].unsqueeze(0)),axis=0)
print("i: ", i, "predictions.shape: ", predictions.shape)
loss = criterion(predictions[1:],y)
print("loss: ", loss.item())
loss.backward()
optimizer.step()
I wish to train the model with a hybrid training strategy with, without teacher forcing. However, the non-teacher forced strategy causes out-of-memory exception and doesn't work. For final inference (testing), usually, with torch.no_grad() it can work, but not in training. Can anyone explain as to why this causes memory bottlenecks exactly?
This is because of the rolling of the computational graph. For the teacher forced model, gradients are not propagated after the true values. However, for non-teacher forced model they backpropagate making the accumulation of gradients (similar to RNN).

Why does accruacy of the CNN drop when I remove a filter and its associated weights?

The model architecture is Conv2D with 32 filters -> Flatten -> Dense -> Compile -> Fit
I deleted the last filter from the first layer and the corresponding Fully connected layer in this model using
w,b = model.layers[0].get_weights()
w = np.delete(w, [32], -1)
b = np.delete(b, [32], 0)
w_2,b_2 = model.layers[2].get_weights()
w_2 = w_2[:20956,:]
I use 20956 because the output of the first layer is 26 x 26 x 31, which is an image dimension in 2D multiply by a number of channels.
I create a new model called model_1 using:
# Input stays the same
model_1 = Sequential()
# New modified conv layer
model_1.add(Conv2D(31, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape,
kernel_initializer='he_normal'))
model_1.add(Flatten())
model_1.add(Dense(10, activation='softmax'))
model_1.layers[0].set_weights([w,b])
model_1.layers[2].set_weights([w_2,b_2])
model_1.compile(loss="categorical_crossentropy",
optimizer="Adam",
metrics=['accuracy'])
I can confirm that the weights are the same by doing model_1.layers[0].get_weights()[0] == model.layers[0].get_weights()[0][:,:,:,:31] and model_1.layers[2].get_weights()[0] == model.layers[2].get_weights()[0][:20956,:]which returns True.
When I do
score = model_1.evaluate(x_test_reshape, y_test)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
score = model.evaluate(x_test_reshape, y_test)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
The accuracy drops from 98% to 10%, any ideas why?
What you are essentially doing is removing a channel from the last convolutional layer. Intuitively it may sound like this is not a big deal and the remaining 31 channel will still make the network perform well. In reality all convolution channels interact with each other in the dense layer that follows, but since this interaction is missing one of the channels of information it was optimized on it's accuracy will drop.
Another way to think of this is to view your network as a function of sequential steps that takes as input an image and as output a label with 98% accuracy. Removing a fraction (1/32) of calculations in this function will change the outcomes, and likely give worse results since the function is optimized with these calculations still present. You are removing a part of the function that is apparently crucial to reach the high accuracy.
You can test this by training your new model with 31 channels for a short time. Since the new model only needs to re-learn the function of the deleted channel, it should quickly reach the high performance again.

LSTM with Keras for mini-batch training and online testing

I would like to implement an LSTM in Keras for streaming time-series prediction -- i.e., running online, getting one data point at a time. This is explained well here, but as one would assume, the training time for an online LSTM can be prohibitively slow. I would like to train my network on mini-batches, and test (run prediction) online. What is the best way to do this in Keras?
For example, a mini-batch could be a sequence of 1000 data values ([33, 34, 42, 33, 32, 33, 36, ... 24, 23]) that occur at consecutive time steps. To train the network I've specified an array X of shape (900, 100, 1), where there are 900 sequences of length 100, and an array y of shape (900, 1). E.g.,
X[0] = [[33], [34], [42], [33], ...]]
X[1] = [[34], [42], [33], [32], ...]]
...
X[999] = [..., [24]]
y[999] = [23]
So for each sequence X[i], there is a corresponding y[i] that represents the next value in the time-series -- what we want to predict.
In test I want to predict the next data values 1000 to 1999. I do this by feeding an array of shape (1, 100, 1) for each step from 1000 to 1999, where the model tries to predict the value at the next step.
Is this the recommended approach and setup for my problem? Enabling statefulness may be the way to go for a purely online implementation, but in Keras this requires a consistent batch_input_shape in training and testing, which would not work for my intent of training on mini-batches and then testing online. Or is there a way I can do this?
UPDATE: Trying to implement the network as #nemo recommended
I ran my own dataset on an example network from a blog post "Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras", and then tried implementing the prediction phase as a stateful network.
The model building and training is the same for both:
# Create and fit the LSTM network
numberOfEpochs = 10
look_back = 30
model = Sequential()
model.add(LSTM(4, input_dim=1, input_length=look_back))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, nb_epoch=numberOfEpochs, batch_size=1, verbose=2)
# trainX.shape = (6883, 30, 1)
# trainY.shape = (6883,)
# testX.shape = (3375, 30, 1)
# testY.shape = (3375,)
Batch prediction is done with:
trainPredict = model.predict(trainX, batch_size=batch_size)
testPredict = model.predict(testX, batch_size=batch_size)
To try a stateful prediction phase, I ran the same model setup and training as before, but then the following:
w = model.get_weights()
batch_size = 1
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, look_back, 1), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
trainPredictions, testPredictions = [], []
for trainSample in trainX:
trainPredictions.append(model.predict(trainSample.reshape((1,look_back,1)), batch_size=batch_size))
trainPredict = numpy.concatenate(trainPredictions).ravel()
for testSample in testX:
testPredictions.append(model.predict(testSample.reshape((1,look_back,1)), batch_size=batch_size))
testPredict = numpy.concatenate(testPredictions).ravel()
To inspect the results, the plots below show the actual (normalized) data in blue, the predictions on the training set in green, and the predictions on the test set in red.
The first figure is from using batch prediction, and the second from stateful. Any ideas what I'm doing incorrectly?
If I understand you correctly you are asking if you can enable statefulness after training. This should be possible, yes. For example:
net = Dense(1)(SimpleRNN(stateful=False)(input))
model = Model(input=input, output=net)
model.fit(...)
w = model.get_weights()
net = Dense(1)(SimpleRNN(stateful=True)(input))
model = Model(input=input, output=net)
model.set_weights(w)
After that you can predict in a stateful way.

Resources