Edit: apparently DGL is working on it already: https://github.com/dmlc/dgl/pull/3641
I have several types of embeddings and each one needs its own linear projection. I can solve the problem with a for loop of type:
emb_out = dict()
for ntype in ntypes:
emb_out[ntype] = self.lin_layer[ntype](emb[ntype])
But ideally, I wanted to do some sort of scatter operation to run it in parallel. Something like:
pytorch_scatter(lin_layers, embeddings, layer_map, reduce='matmul'), where the layer map tells which embedding should go through which layer. If I have 2 types of linear layers and batch_size = 5, then layer_map would be something like [1,0,1,1,0].
Would it be possible to vectorize the for loop in a efficient way, like in pytorch_scatter? Please check below minimal examples.
import torch
import random
import numpy as np
seed = 42
torch.manual_seed(seed)
random.seed(seed)
def matmul_single_embtype(lin_layers, embeddings, layer_map):
#run single linear layer over all embeddings, irrespective of type
output_embeddings = torch.matmul(lin_layers[0], embeddings.T).T
return output_embeddings
def matmul_for_loop(lin_layers, embeddings, layer_map):
#let each embedding type have its own projection, looping over emb types
output_embeddings = dict()
for emb_type in np.unique(layer_map):
output_embeddings[emb_type] = torch.matmul(lin_layers[emb_type], embeddings[layer_map == emb_type].T).T
return output_embeddings
def matmul_scatter(lin_layers, embeddings, layer_map):
#parallelize the for loop by creating a diagonal matrix of lin layers
#this is very innefficient, because creates a copy of the layer for each embedding, instead of broadcasting
mapped_lin_layers = [lin_layers[i] for i in layer_map]
mapped_lin_layers = torch.block_diag(*mapped_lin_layers) #batch_size*inp_size x batch_size*output_size
embeddings_stacked = embeddings.view(-1,1) #stack all embeddings to multiply the linear block
output_embeddings = torch.matmul(mapped_lin_layers, embeddings_stacked).view(embeddings.shape)
return output_embeddings
"""
GENERATE DATA
lin_layers:
List of matrices of size n_layer x inp_size x output_size
embeddings:
Matrix of size batch_size x inp_size
layer_map:
Vector os size batch_size stating which embedding should go thorugh each layer
"""
emb_size = 32
batch_size = 500
emb_types = 20
layer_map = [random.choice(list(range(emb_types))) for i in range(batch_size)]
lin_layers = [torch.arange(emb_size*emb_size, dtype=torch.float32).view(emb_size,emb_size) for i in range(emb_types)]
embeddings = torch.arange(batch_size*emb_size, dtype=torch.float32).view(batch_size,emb_size)
grouped_emb = {i: embeddings[layer_map==i] for i in np.unique(layer_map)} #separate embeddings by embedding type
#Run experiments
%timeit matmul_scatter(lin_layers, embeddings, layer_map)
%timeit matmul_for_loop(lin_layers, embeddings, layer_map)
%timeit matmul_single_embtype(lin_layers, embeddings, layer_map)
>>>>>133 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>>>>1.64 ms ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>>>>31.4 µs ± 805 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Related stackoverflow question: how to vectorize the scatter-matmul operation
Related issue in pytorch: https://github.com/pytorch/pytorch/issues/31942
Just found out that DGL is working on this feature already: https://github.com/dmlc/dgl/pull/3641
Related
I have fitted an LSTM that deals with inputs of different length:
model = Sequential()
model.add(LSTM(units=10, return_sequences=False, input_shape=(None, 5)))
model.add(Dense(units=1, activation='sigmoid'))
Having fitted the model, I want to test it on inputs of different size.
x_test.shape # = 100000
x_test[0].shape # = (1, 5)
x_test[1].shape # = (3, 5)
x_test[2].shape # = (8, 5)
Testing on single instances j is not a problem (model.predict(x_test[j]), but looping on all of them is really slow.
Is there a way of speeding up the computation? model.predict(x_test) does not work.
Thank you!
The most common way to speed up model inference is to run inference on GPU, instead of the CPU (I'm assuming you are not already doing that). You can set up GPU support by following the official guide here. Unless you are explicitly asking keras to run inference on CPU, your code should work as is, without any changes. To confirm if you are using GPU, you can use this article.
Hope the answer was helpful!
The best solution that I have found so far is grouping together data windows with the same length. For my problem, it's enough to significantly speed up the computation.
Hope this trick would help other people.
import numpy as np
def predict_custom(model, x):
"""x should be a list of np.arrays with different number of rows, but same number of columns"""
# dictionary with key = length of the window, value = indices of samples with such length
dic = {}
for i, x in enumerate(x):
if dic.get(x.shape[0]):
dic[x.shape[0]].append(i)
else:
dic[x.shape[0]] = [i]
y_pred = np.full((len(x),1), np.nan)
# loop over dictionary and predict together samples of the same length
for key, indexes in dic.items():
# select samples of the same length (conversion to np.array is used for subsetting "x" using "indexes")
x = np.asarray(x, dtype=object)[indexes].tolist()
# gather such samples in a 3D np.array
x_3d = np.stack(x, axis=0)
# use dictionary values to insert results in the correspondent row of y_pred
y_pred[indexes] = model.predict(x_3d)
return y_pred
If we set the steps_per_epoch (in ImageDataGenerator) higher than the total possible batches(total_samples/batch_Size). Will the model revisit the same data points from starting or will it ignore?
Ex:
Flattened image shape which will go to Dense layer: (2000*1)
batch size: 20
Total no of batches possible: 100 (2000/20)
steps per epoch: 1000 (set explicitly)
As far as I know, steps_per_epoch is independent of the 'real' epoch (which is number_of_inputs/batch_size). Let's use an example similar to what you want to know, with 2000 data points and batch_size of 20 (which means 2000/20 = 100 steps for one 'real' epoch):
If you set steps_per_epoch = 1000: Keras asks for a loop of 1000 batches, which basically means 10 'real' epochs (or 10 times of whole data traversal).
If you set steps_per_epoch = 50: Keras asks for a loop of 50 batches, and the remaining 50 batches of one 'real' epoch is visited in the next loop.
I am using TensorFlow 2.0 and Python 3.8 and I want to use a learning rate scheduler for which I have a function. I have to train a neural network for 160 epochs with the following where the learning rate is to be decreased by a factor of 10 at 80 and 120 epochs, where the initial learning rate = 0.01.
def scheduler(epoch, current_learning_rate):
if epoch == 79 or epoch == 119:
return current_learning_rate / 10
else:
return min(current_learning_rate, 0.001)
How can I use this learning rate scheduler function with 'tf.GradientTape()'? I know how to use this using "model.fit()" as a callback:
callback = tf.keras.callbacks.LearningRateScheduler(scheduler)
How do I use this while using custom training loops with "tf.GradientTape()"?
Thanks!
The learning rate for different epochs can be set using lr attribute of tensorflow keras optimizer. lr attribute of the optimizer still exists since tensorflow 2 has backward compatibility for keras (For more details refer the source code here).
Below is a small snippet of how the learning rate can be varied across different epochs. self._train_step is similar to the train_step function defined here.
def set_learning_rate(epoch):
if epoch > 180:
optimizer.lr = 0.5e-6
elif epoch > 160:
optimizer.lr = 1e-6
elif epoch > 120:
optimizer.lr = 1e-5
elif epoch > 3:
optimizer.lr = 1e-4
def train(epochs, train_data, val_data):
prev_val_loss = float('inf')
for epoch in range(epochs):
self.set_learning_rate(epoch)
for images, labels in train_data:
self._train_step(images, labels)
for images, labels in val_data:
self._test_step(images, labels)
Another alternative would be to use tf.keras.optimizers.schedules
learning_rate_fn = keras.optimizers.schedules.PiecewiseConstantDecay(
[80*num_steps, 120*num_steps, 160*num_steps, 180*num_steps],
[1e-3, 1e-4, 1e-5, 1e-6, 5e-6]
)
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate_fn)
Note that here one cant directly provide the epochs, instead the number of steps have to be given, where each step is len(train_data)/batch_size.
A learning rate schedule needs a step value that can not be specified when using GradientTape followed by optimizer.apply_gradient().
So you should not pass directly the schedule as the learning_rate of the optimizer.
Instead, you can first call the schedule function to get the value for current step and then update the learning rate value in the optimizer:
optim = tf.keras.optimizers.SGD()
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(1e-2,1000,.9)
for step in range(0,1000):
lr = lr_schedule(step)
optim.learning_rate = lr
with GradientTape() as tape:
call func to differentiate
optim.apply_gradient(func,...)
I am new to pytorch and are trying to implement a feed forward neural network to classify the mnist data set. I have some problems when trying to use cross-validation. My data has the following shapes:
x_train:
torch.Size([45000, 784]) and
y_train: torch.Size([45000])
I tried to use KFold from sklearn.
kfold =KFold(n_splits=10)
Here is the first part of my train method where I'm dividing the data into folds:
for train_index, test_index in kfold.split(x_train, y_train):
x_train_fold = x_train[train_index]
x_test_fold = x_test[test_index]
y_train_fold = y_train[train_index]
y_test_fold = y_test[test_index]
print(x_train_fold.shape)
for epoch in range(epochs):
...
The indices for the y_train_fold variable is right, it's simply:
[ 0 1 2 ... 4497 4498 4499], but it's not for x_train_fold, which is [ 4500 4501 4502 ... 44997 44998 44999]. And the same goes for the test folds.
For the first iteration I want the varibale x_train_fold to be the first 4500 pictures, in other words to have the shape torch.Size([4500, 784]), but it has the shape torch.Size([40500, 784])
Any tips on how to get this right?
I think you're confused!
Ignore the second dimension for a while, When you've 45000 points, and you use 10 fold cross-validation, what's the size of each fold? 45000/10 i.e. 4500.
It means that each of your fold will contain 4500 data points, and one of those fold will be used for testing, and the remaining for training i.e.
For testing: one fold => 4500 data points => size: 4500
For training: remaining folds => 45000-4500 data points => size: 45000-4500=40500
Thus, for first iteration, the first 4500 data points (corresponding to indices) will be used for testing and the rest for training. (Check below image)
Given your data is x_train: torch.Size([45000, 784]) and y_train: torch.Size([45000]), this is how your code should look like:
for train_index, test_index in kfold.split(x_train, y_train):
print(train_index, test_index)
x_train_fold = x_train[train_index]
y_train_fold = y_train[train_index]
x_test_fold = x_train[test_index]
y_test_fold = y_train[test_index]
print(x_train_fold.shape, y_train_fold.shape)
print(x_test_fold.shape, y_test_fold.shape)
break
[ 4500 4501 4502 ... 44997 44998 44999] [ 0 1 2 ... 4497 4498 4499]
torch.Size([40500, 784]) torch.Size([40500])
torch.Size([4500, 784]) torch.Size([4500])
So, when you say
I want the variable x_train_fold to be the first 4500 picture... shape torch.Size([4500, 784]).
you're wrong. this size corresonds to x_test_fold. In the first iteration, based on 10 folds, x_train_fold will have 40500 points, thus its size is supposed to be torch.Size([40500, 784]).
Think I have it right now, but I feel the code is a bit messy, with 3 nested loops. Is there any simpler way to it or is this approach okay?
Here's my code for the training with cross validation:
def train(network, epochs, save_Model = False):
total_acc = 0
for fold, (train_index, test_index) in enumerate(kfold.split(x_train, y_train)):
### Dividing data into folds
x_train_fold = x_train[train_index]
x_test_fold = x_train[test_index]
y_train_fold = y_train[train_index]
y_test_fold = y_train[test_index]
train = torch.utils.data.TensorDataset(x_train_fold, y_train_fold)
test = torch.utils.data.TensorDataset(x_test_fold, y_test_fold)
train_loader = torch.utils.data.DataLoader(train, batch_size = batch_size, shuffle = False)
test_loader = torch.utils.data.DataLoader(test, batch_size = batch_size, shuffle = False)
for epoch in range(epochs):
print('\nEpoch {} / {} \nFold number {} / {}'.format(epoch + 1, epochs, fold + 1 , kfold.get_n_splits()))
correct = 0
network.train()
for batch_index, (x_batch, y_batch) in enumerate(train_loader):
optimizer.zero_grad()
out = network(x_batch)
loss = loss_f(out, y_batch)
loss.backward()
optimizer.step()
pred = torch.max(out.data, dim=1)[1]
correct += (pred == y_batch).sum()
if (batch_index + 1) % 32 == 0:
print('[{}/{} ({:.0f}%)]\tLoss: {:.6f}\t Accuracy:{:.3f}%'.format(
(batch_index + 1)*len(x_batch), len(train_loader.dataset),
100.*batch_index / len(train_loader), loss.data, float(correct*100) / float(batch_size*(batch_index+1))))
total_acc += float(correct*100) / float(batch_size*(batch_index+1))
total_acc = (total_acc / kfold.get_n_splits())
print('\n\nTotal accuracy cross validation: {:.3f}%'.format(total_acc))
You messed with indices.
x_train = x[train_index]
x_test = x[test_index]
y_train = y[train_index]
y_test = y[test_index]
x_fold = x_train[train_index]
y_fold = y_train[test_index]
It should be:
x_fold = x_train[train_index]
y_fold = y_train[train_index]
Though all the above answers provide a good example of how to split the dataset, I am curious about the way to implement the K-fold cross-validation. K-fold aims to estimate the skill of a machine learning model on unseen data. To use a limited sample to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model. (See the concept and explanation in Wikipedia https://en.wikipedia.org/wiki/Cross-validation_(statistics)) Therefore, it is necessary to initialize the parameters of your to-be-trained model at the beginning of each fold. Otherwise, your model will see every sample in the dataset after K-fold and there is no such thing as validation (all are training samples).
When I use torch.nn.DataParallel() to realize data parallel computation, I find the parallel model inference time for the same total amount batch size don't significantly decrease compared with serial model.
As shown below.
model = SNModel()
criterion = nn.CrossEntropyLoss()
if parallel_enable:
model = nn.DataParallel(model, device_ids=gpu_ids) # gpu_ids = [0,1,2,3] total 4 gpus
model.to(args.device)
criterion.to(args.device)
calculate 100 iteration model inference time:
torch.cuda.synchronize()
st = time.time()
outputs, loss = model(images, path, labels, criterion, gpu_nums)
torch.cuda.synchronize()
total_tm += time.time() - st
With single gpu:
100 iter model time 23s (around)
With 4 gpus:
100 iter model time 103s (around)
There is only 103 - 23*4 = 11s reduced.
Is there any wrong? Hope you can help me! I have been troubled by this for a long time. I know that there is something wrong that I didn't find out. This phenomenon has appeared in other models, so that I have always abandoned the use of parallelism.