How can I shuffle the labels of a dataset? - pytorch

I have downloaded the MNIST dataset, using the following command:
train_dataset = dsets.MNIST(root='./data',
train=True,
transform=transforms.ToTensor(),
download=True)
I now need to run some experiments on this dataset (MNIST), but shuffling the labels of the training set. How can I shuffle/reassign them randomly? I have tried the following:
train_dataset = dsets.MNIST(root='./data',
train=True,
transform=transforms.ToTensor(),
target_transform=lambda y: torch.randint(0, 10, (1,)).item(),
download=True)
But I have noticed that what comes after the lambda function makes the labels shuffle during the training process, e.g. they change at every epoch. This way, I won't reach 100% training accuracy, which is what I am aiming for. How can I shuffle these labels in a way that is completely random, making sure that these labels won't change during the training process?
Thank you!!

In case your goal is to create a random mapping of labels you would need to define the mapping before defining the target transform to keep the transform constant. Something like the following should do the trick
import random
label_mapping = list(range(10))
random.shuffle(label_mapping)
train_dataset = dsets.MNIST(root='./data',
train=True,
transform=transforms.ToTensor(),
target_transform=lambda y: label_mapping[y],
download=True)
In order to get a new shuffle each epoch you would want to redefine the label mapping, training dataset, and dataloader each epoch.
Update To instead generate a random label which is independent of the true label but consistent for a given index then you probably need to either do some very careful seeding or reimplement some functionality of the dataset class.
For example, the latter case might look something like this
import random
class RandomMNIST(dsets.MNIST):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.targets = [random.randint(0, 9) for _ in range(len(self.data))]
train_dataset = RandomMNIST(root='./data',
train=True,
transform=transforms.ToTensor(),
download=True)
or equivalently
import random
train_dataset = dsets.MNIST(root='./data',
train=True,
transform=transforms.ToTensor(),
download=True)
train_dataset.targets = [random.randint(0, 9) for _ in range(len(train_dataset))]

Related

Model overfits after first epoch

I'm trying to use hugging face's BERT-base-uncased model to train on emoji prediction on tweets, and it seems that after the first epoch, the model immediately starts to overfit. I have tried the following:
Increasing the training data (I increased this from 1x to 10x with no effect)
Changing the learning rate (no differences there)
Using different models from hugging face (the results were the same again)
Changing the batch size (went from 32, 72, 128, 256, 512, 1024)
Creating a model from scratch, but I ran into issues and decided to post here first to see if I was missing anything obvious.
At this point, I'm concerned that the individual tweets don't give enough information for the model to make a good guess, but wouldn't it be random in that case, rather than overfitting?
Also, training time seems to be ~4.5 hours on Colab's free GPUs, is there any way to speed that up? I tried their TPU, but it doesn't seem to be recognized.
This is what the data looks like
And this is my code below:
import pandas as pd
import json
import re
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.model_selection import train_test_split
import torch
from transformers import TrainingArguments, Trainer
from transformers import EarlyStoppingCallback
from sklearn.metrics import accuracy_score,precision_score, recall_score, f1_score
import numpy as np
# opening up the data and removing all symbols
df = pd.read_json('/content/drive/MyDrive/computed_results.json.bz2')
df['text_no_emoji'] = df['text_no_emoji'].apply(lambda text: re.sub(r'[^\w\s]', '', text))
# loading the tokenizer and the model from huggingface
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=5).to('cuda')
# test train split
train, test = train_test_split(df[['text_no_emoji', 'emoji_codes']].sample(frac=1), test_size=0.2)
# defining a dataset class that generates the encoder and labels on the fly to minimize memory usage
class Dataset(torch.utils.data.Dataset):
def __init__(self, input, labels=None):
self.input = input
self.labels = labels
def __getitem__(self, pos):
encoded = tokenizer(self.input[pos], truncation=True, max_length=15, padding='max_length')
label = self.labels[pos]
ret = {key: torch.tensor(val) for key, val in encoded.items()}
ret['labels'] = torch.tensor(label)
return ret
def __len__(self):
return len(self.labels)
# training and validation datasets are defined here
train_dataset = Dataset(train['text_no_emoji'].tolist(), train['emoji_codes'].tolist())
val_dataset = Dataset(train['text_no_emoji'].tolist(), test['emoji_codes'].tolist())
# defining the training arguments
args = TrainingArguments(
output_dir="output",
evaluation_strategy="epoch",
logging_steps = 10,
per_device_train_batch_size=1024,
per_device_eval_batch_size=1024,
num_train_epochs=5,
save_steps=3000,
seed=0,
load_best_model_at_end=True,
weight_decay=0.2,
)
# defining the model trainer
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=val_dataset
)
# Training the model
trainer.train()
Results: After this, the training generally stops pretty fast due to the early stopper
The dataset can be found here (39 Mb compressed)

Visualize the output of Vgg16 model by TSNE plot?

I need to visualize the output of Vgg16 model which classify 14 different classes.
I load the trained model and I did replace the classifier layer with the identity() layer but it doesn't categorize the output.
Here is the snippet:
the number of samples here is 1000 images.
epoch = 800
PATH = 'vgg16_epoch{}.pth'.format(epoch)
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
class Identity(nn.Module):
def __init__(self):
super(Identity, self).__init__()
def forward(self, x):
return x
model.classifier._modules['6'] = Identity()
model.eval()
logits_list = numpy.empty((0,4096))
targets = []
with torch.no_grad():
for step, (t_image, target, classess, image_path) in enumerate(test_loader):
t_image = t_image.cuda()
target = target.cuda()
target = target.data.cpu().numpy()
targets.append(target)
logits = model(t_image)
print(logits.shape)
logits = logits.data.cpu().numpy()
print(logits.shape)
logits_list = numpy.append(logits_list, logits, axis=0)
print(logits_list.shape)
tsne = TSNE(n_components=2, verbose=1, perplexity=10, n_iter=1000)
tsne_results = tsne.fit_transform(logits_list)
target_ids = range(len(targets))
plt.scatter(tsne_results[:,0],tsne_results[:,1],c = target_ids ,cmap=plt.cm.get_cmap("jet", 14))
plt.colorbar(ticks=range(14))
plt.legend()
plt.show()
here is what this script has been produced: I am not sure why I have all colors for each cluster!
The VGG16 outputs over 25k features to the classifier. I believe it's too much to t-SNE. It's a good idea to include a new nn.Linear layer to reduce this number. So, t-SNE may work better. In addition, I'd recommend you two different ways to get the features from the model:
The best way to get it regardless of the model is by using the register_forward_hook method. You may find a notebook here with an example.
If you don't want to use the register, I'd suggest this one. After loading your model, you may use the following class to extract the features:
class FeatNet (nn.Module):
def __init__(self, vgg):
super(FeatNet, self).__init__()
self.features = nn.Sequential(*list(vgg.children())[:-1]))
def forward(self, img):
return self.features(img)
Now, you just need to call FeatNet(img) to get the features.
To include the feature reducer, as I suggested before, you need to retrain your model doing something like:
class FeatNet (nn.Module):
def __init__(self, vgg):
super(FeatNet, self).__init__()
self.features = nn.Sequential(*list(vgg.children())[:-1]))
self.feat_reducer = nn.Sequential(
nn.Linear(25088, 1024),
nn.BatchNorm1d(1024),
nn.ReLU()
)
self.classifier = nn.Linear(1024, 14)
def forward(self, img):
x = self.features(img)
x_r = self.feat_reducer(x)
return self.classifier(x_r)
Then, you can run your model returning x_r, that is, the reduced features. As I told you, 25k features are too much for t-SNE. Another method to reduce this number is by using PCA instead of nn.Linear. In this case, you send the 25k features to PCA and then train t-SNE using the PCA's output. I prefer using nn.Linear, but you need to test to check which one you get a better result.

How to save best model in Keras based on AUC metric?

I would like to save the best model in Keras based on auc and I have this code:
def MyMetric(yTrue, yPred):
auc = tf.metrics.auc(yTrue, yPred)
return auc
best_model = [ModelCheckpoint(filepath='best_model.h5', monitor='MyMetric', save_best_only=True)]
train_history = model.fit([train_x],
[train_y], batch_size=batch_size, epochs=epochs, validation_split=0.05,
callbacks=best_model, verbose = 2)
SO my model runs nut I get this warning:
RuntimeWarning: Can save best model only with MyMetric available, skipping.
'skipping.' % (self.monitor), RuntimeWarning)
It would be great if any can tell me this is the right way to do it and if not what should I do?
You have to pass the Metric you want to monitor to model.compile.
https://keras.io/metrics/#custom-metrics
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=[MyMetric])
Also, tf.metrics.auc returns a tuple containing the tensor and update_op. Keras expects the custom metric function to return only a tensor.
def MyMetric(yTrue, yPred):
import tensorflow as tf
auc = tf.metrics.auc(yTrue, yPred)
return auc[0]
After this step, you will get errors about uninitialized values. Please see these threads:
https://github.com/keras-team/keras/issues/3230
How to compute Receiving Operating Characteristic (ROC) and AUC in keras?
You can define a custom metric that calls tensorflow to compute AUROC in the following way:
def as_keras_metric(method):
import functools
from keras import backend as K
import tensorflow as tf
#functools.wraps(method)
def wrapper(self, args, **kwargs):
""" Wrapper for turning tensorflow metrics into keras metrics """
value, update_op = method(self, args, **kwargs)
K.get_session().run(tf.local_variables_initializer())
with tf.control_dependencies([update_op]):
value = tf.identity(value)
return value
return wrapper
#as_keras_metric
def AUROC(y_true, y_pred, curve='ROC'):
return tf.metrics.auc(y_true, y_pred, curve=curve)
You then need to compile your model with this metric:
model.compile(loss=train_loss, optimizer='adam', metrics=['accuracy',AUROC])
Finally: Checkpoint the model in the following way:
model_checkpoint = keras.callbacks.ModelCheckpoint(path_to_save_model, monitor='val_AUROC',
verbose=0, save_best_only=True,
save_weights_only=False, mode='auto', period=1)
Be careful though: I believe the Validation AUROC is calculated batch wise and averaged; so might give some errors with checkpointing. A good idea might be to verify after model training finishes that the AUROC of the predictions of the trained model (computed with sklearn.metrics) matches what Tensorflow reports while training and checkpointing
Assuming you use TensorBoard, then you have a historical record—in the form of tfevents files—of all your metric calculations, for all your epochs; then a tf.keras.callbacks.Callback is what you want.
I use tf.keras.callbacks.ModelCheckpoint with save_freq: 'epoch' to save—as an h5 file or tf file—the weights for each epoch.
To avoid filling the hard-drive with model files, write a new Callback—or extend the ModelCheckpoint class's—on_epoch_end implementation:
def on_epoch_end(self, epoch, logs=None):
super(DropWorseModels, self).on_epoch_end(epoch, logs)
if epoch < self._keep_best:
return
model_files = frozenset(
filter(lambda filename: path.splitext(filename)[1] == SAVE_FORMAT_WITH_SEP,
listdir(self._model_dir)))
if len(model_files) < self._keep_best:
return
tf_events_logs = tuple(islice(log_parser(tfevents=path.join(self._log_dir,
self._split),
tag=self.monitor),
0,
self._keep_best))
keep_models = frozenset(map(self._filename.format,
map(itemgetter(0), tf_events_logs)))
if len(keep_models) < self._keep_best:
return
it_consumes(map(lambda filename: remove(path.join(self._model_dir, filename)),
model_files - keep_models))
Appendix (imports and utility function implementations):
from itertools import islice
from operator import itemgetter
from os import path, listdir, remove
from collections import deque
import tensorflow as tf
from tensorflow.core.util import event_pb2
def log_parser(tfevents, tag):
values = []
for record in tf.data.TFRecordDataset(tfevents):
event = event_pb2.Event.FromString(tf.get_static_value(record))
if event.HasField('summary'):
value = event.summary.value.pop(0)
if value.tag == tag:
values.append(value.simple_value)
return tuple(sorted(enumerate(values), key=itemgetter(1), reverse=True))
it_consumes = lambda it, n=None: deque(it, maxlen=0) if n is None \
else next(islice(it, n, n), None)
SAVE_FORMAT = 'h5'
SAVE_FORMAT_WITH_SEP = '{}{}'.format(path.extsep, SAVE_FORMAT)
For completeness, the rest of the class:
class DropWorseModels(tf.keras.callbacks.Callback):
"""
Designed around making `save_best_only` work for arbitrary metrics
and thresholds between metrics
"""
def __init__(self, model_dir, monitor, log_dir, keep_best=2, split='validation'):
"""
Args:
model_dir: directory to save weights. Files will have format
'{model_dir}/{epoch:04d}.h5'.
split: dataset split to analyse, e.g., one of 'train', 'test', 'validation'
monitor: quantity to monitor.
log_dir: the path of the directory where to save the log files to be
parsed by TensorBoard.
keep_best: number of models to keep, sorted by monitor value
"""
super(DropWorseModels, self).__init__()
self._model_dir = model_dir
self._split = split
self._filename = 'model-{:04d}' + SAVE_FORMAT_WITH_SEP
self._log_dir = log_dir
self._keep_best = keep_best
self.monitor = monitor
This has the added advantage of being able to save and delete multiple model files in a single Callback. You can easily extend with different thresholding support, e.g., to keep all model files with an AUC in threshold OR TP, FP, TN, FN within threshold.

batching huge data in tensorflow

I am trying to perform binary classification using the code/tutorial from
https://github.com/eisenjulian/nlp_estimator_tutorial/blob/master/nlp_estimators.py
print("Loading data...")
(x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size)
print(len(y_train), "train sequences")
print(len(y_test), "test sequences")
print("Pad sequences (samples x time)")
x_train = sequence.pad_sequences(x_train_variable,
maxlen=sentence_size,
padding='post',
value=0)
x_test = sequence.pad_sequences(x_test_variable,
maxlen=sentence_size,
padding='post',
value=0)
print("x_train shape:", x_train.shape)
print("x_test shape:", x_test.shape)
def train_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train))
dataset = dataset.shuffle(buffer_size=len(x_train_variable))
dataset = dataset.batch(100)
dataset = dataset.map(parser)
dataset = dataset.repeat()
iterator = dataset.make_one_shot_iterator()
return iterator.get_next()
def eval_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((x_test, x_len_test, y_test))
dataset = dataset.batch(100)
dataset = dataset.map(parser)
iterator = dataset.make_one_shot_iterator()
return iterator.get_next()
def cnn_model_fn(features, labels, mode, params):
input_layer = tf.contrib.layers.embed_sequence(
features['x'], vocab_size, embedding_size,
initializer=params['embedding_initializer'])
training = mode == tf.estimator.ModeKeys.TRAIN
dropout_emb = tf.layers.dropout(inputs=input_layer,
rate=0.2,
training=training)
conv = tf.layers.conv1d(
inputs=dropout_emb,
filters=32,
kernel_size=3,
padding="same",
activation=tf.nn.relu)
# Global Max Pooling
pool = tf.reduce_max(input_tensor=conv, axis=1)
hidden = tf.layers.dense(inputs=pool, units=250, activation=tf.nn.relu)
dropout_hidden = tf.layers.dropout(inputs=hidden,
rate=0.2,
training=training)
logits = tf.layers.dense(inputs=dropout_hidden, units=1)
# This will be None when predicting
if labels is not None:
labels = tf.reshape(labels, [-1, 1])
optimizer = tf.train.AdamOptimizer()
def _train_op_fn(loss):
return optimizer.minimize(
loss=loss,
global_step=tf.train.get_global_step())
return head.create_estimator_spec(
features=features,
labels=labels,
mode=mode,
logits=logits,
train_op_fn=_train_op_fn)
cnn_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn,
model_dir=os.path.join(model_dir, 'cnn'),
params=params)
train_and_evaluate(cnn_classifier)
The example here loads data from IMDB movie reviews. I have my own dataset in the form of text which is approx 2GB huge. Now in this example the line
(x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size) tries to load whole dataset in memory. If I try to do the same I run out of memory. How can I restructure this logic to read data in batches from my disk?
You want to change the dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train)) line. There are lots of ways of creating a dataset - from_tensor_slices is the easiest, but won't work on its own if you can't load the entire dataset to memory.
The best way depends on how you have the data stored, or how you want to store it/manipulate it. The simplest in my opinion with very little down-side (unless running on multiple GPUs) is to have the original dataset just give indices to data, and write a normal numpy function for loading the ith example.
dataset = tf.data.Dataset.from_tensor_slices(tf.range(epoch_size))
def tf_map_fn(i):
def np_map_fn(i):
return load_ith_example(i)
inp1, inp2 = tf.py_func(np_map_fn, (i,), Tout=(tf.float32, tf.float32), stateful=False)
# other preprocessing/data augmentation goes here.
# unbatched sizes
inp1.set_shape(shape1)
inp2.set_shape(shape2)
return inp1, inp2
dataset = dataset.repeat().shuffle(epoch_size).map(tf_map_fn, 8)
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(1) # start loading data as GPU trains on previous batch
inp1, inp2 = dataset.make_one_shot_iterator().get_next()
Here I assume your outputs are float32 tensors (Tout=...). set_shape calls aren't strictly necessary, but if you know the shape it'll do better error checks.
So long as your preprocessing doesn't take longer than your network to run, this should run just as fast as any other method on a single GPU machine.
The other obvious way is to convert your data to tfrecords, but that'll take up more space on disk and is more of a pain to manage if you ask me.

How to use pytorch DataLoader with a 3-D matrix for LSTM input?

I have a dataset of 3-D(time_stepinputsizetotal_num) matrix which is a .mat file. I want to use DataLoader to get a input dataset for LSTM which batch_size is 5. My code is as following:
file_path = "…/database/frameLength100/notOverlap/a.mat"
mat_data = s.loadmat(file_path)
tensor_data = torch.from_numpy(mat_data[‘a’]) #Tensor
class CustomDataset(Dataset):
def __init__(self, tensor_data):
self.tensor_data = tensor_data
def __getitem__(self, index):
data = self.tensor_data[index]
label = 1;
return data, label
def __len__(self):
return len(self.tensor_data)
custom_dataset = CustomDataset(tensor_data=tensor_data)
train_loader = DataLoader(dataset=custom_dataset, batch_size=5, shuffle=True)
I think the code is wrong but I have no idea how to correct it. What makes me confused is how how can I make DataLoader know which dimension is ‘total_num’ so that I get the dataset which batch size is 5.
If I understand correctly, you want the batching to happen along the total_num dimension, i. e. dimension 2.
You could simply use that the dimension to index your dataset, i.e. change __getitem__ to data = self.tensor_data[:, :, index], and accordingly in __len__, return self.tensor_data.size(2) instead of len(self.tensor_data). Each batch will then have size [time_step, inputsize, 5].

Resources