Run hyperparameter optimization on parallel gpus using tensorflow - multithreading

I have a training function that trains a tf model end-to-end here (contrived for illustration only):
def opt_fx(params, gpu):
os.environ["CUDA_VISIBLE_DEVICES"] = gpu
sess = tf.Session()
# Run some training on a particular gpu...
sess.run(...)
I want to run hyperparameter optimization across 20 trials using a model per gpu:
from threading import Thread
exp_trials = list(hyperparams.trials(num=20))
train_threads = []
for gpu_num, trial_params in zip(['0', '1', '2', '3']*5, exp_trials):
t = Thread(target=opt_fx, args=(trial_params, gpu_num,))
train_threads.append(t)
# Start the threads, and block on their completion.
for t in train_threads:
t.start()
for t in train_threads:
t.join()
This fails however... what's the right way to do this?

I'm not sure if this is the best approach, but what I ended up doing is define a graph per device and train each one in a separate session. This can be parallelized. I tried to reuse the graph in separate devices, but that didn't work. Here's how my version looks like in code (a complete example):
import threading
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
# Get the data
mnist = input_data.read_data_sets("data/mnist", one_hot=True)
train_x_all = mnist.train.images
train_y_all = mnist.train.labels
test_x = mnist.test.images
test_y = mnist.test.labels
# Define the graphs per device
devices = ['/gpu', '/cpu'] # just one GPU on this machine...
learning_rates = [0.01, 0.03]
jobs = []
for device, learning_rate in zip(devices, learning_rates):
with tf.Graph().as_default() as graph:
x = tf.placeholder(tf.float32, [None, 784], name='x')
y = tf.placeholder(tf.float32, [None, 10], name='y')
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
pred = tf.nn.softmax(tf.matmul(x, W) + b)
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1)), tf.float32))
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1), name='cost')
optimize = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost, name='optimize')
jobs.append(graph)
# Train a graph on a device
def train(device, graph):
print "Start training on %s" % device
with tf.Session(graph=graph) as session:
x = graph.get_tensor_by_name('x:0')
y = graph.get_tensor_by_name('y:0')
cost = graph.get_tensor_by_name('cost:0')
optimize = graph.get_operation_by_name('optimize')
session.run(tf.global_variables_initializer())
batch_size = 500
for epoch in range(5):
total_batch = int(train_x_all.shape[0] / batch_size)
for i in range(total_batch):
batch_x = train_x_all[i * batch_size:(i + 1) * batch_size]
batch_y = train_y_all[i * batch_size:(i + 1) * batch_size]
_, c = session.run([optimize, cost], feed_dict={x: batch_x, y: batch_y})
if i % 20 == 0:
print "Device %s: epoch #%d step=%d cost=%f" % (device, epoch, i, c)
# Start threads in parallel
train_threads = []
for i, graph in enumerate(jobs):
train_threads.append(threading.Thread(target=train, args=(devices[i], graph)))
for t in train_threads:
t.start()
for t in train_threads:
t.join()
Note that train function operates with tensors and operations from the graph in the context, i.e. each cost and optimize is different.
This produces the following output, which shows that two models are trained in parallel:
Start training on /gpu
Start training on /cpu
Device /cpu: epoch #0 step=0 cost=2.302585
Device /cpu: epoch #0 step=20 cost=1.788247
Device /cpu: epoch #0 step=40 cost=1.400490
Device /cpu: epoch #0 step=60 cost=1.271820
Device /gpu: epoch #0 step=0 cost=2.302585
Device /cpu: epoch #0 step=80 cost=1.128214
Device /gpu: epoch #0 step=20 cost=2.105802
Device /cpu: epoch #0 step=100 cost=0.927004
Device /cpu: epoch #1 step=0 cost=0.905336
Device /gpu: epoch #0 step=40 cost=1.908744
Device /cpu: epoch #1 step=20 cost=0.865687
Device /gpu: epoch #0 step=60 cost=1.808407
Device /cpu: epoch #1 step=40 cost=0.754765
Device /gpu: epoch #0 step=80 cost=1.676024
Device /cpu: epoch #1 step=60 cost=0.794201
Device /gpu: epoch #0 step=100 cost=1.513800
Device /gpu: epoch #1 step=0 cost=1.451422
Device /cpu: epoch #1 step=80 cost=0.786958
Device /gpu: epoch #1 step=20 cost=1.415125
Device /cpu: epoch #1 step=100 cost=0.643715
Device /cpu: epoch #2 step=0 cost=0.674683
Device /gpu: epoch #1 step=40 cost=1.273473
Device /cpu: epoch #2 step=20 cost=0.658424
Device /gpu: epoch #1 step=60 cost=1.300150
Device /cpu: epoch #2 step=40 cost=0.593681
Device /gpu: epoch #1 step=80 cost=1.242193
Device /cpu: epoch #2 step=60 cost=0.640543
Device /gpu: epoch #1 step=100 cost=1.105950
Device /gpu: epoch #2 step=0 cost=1.089900
Device /cpu: epoch #2 step=80 cost=0.664947
Device /gpu: epoch #2 step=20 cost=1.088389
Device /cpu: epoch #2 step=100 cost=0.535446
Device /cpu: epoch #3 step=0 cost=0.580295
Device /gpu: epoch #2 step=40 cost=0.983053
Device /cpu: epoch #3 step=20 cost=0.566510
Device /gpu: epoch #2 step=60 cost=1.044966
Device /cpu: epoch #3 step=40 cost=0.518787
Device /gpu: epoch #2 step=80 cost=1.025607
Device /cpu: epoch #3 step=60 cost=0.562461
Device /gpu: epoch #2 step=100 cost=0.897545
Device /gpu: epoch #3 step=0 cost=0.907381
Device /cpu: epoch #3 step=80 cost=0.600475
Device /gpu: epoch #3 step=20 cost=0.911914
Device /cpu: epoch #3 step=100 cost=0.477412
Device /cpu: epoch #4 step=0 cost=0.527233
Device /gpu: epoch #3 step=40 cost=0.827964
Device /cpu: epoch #4 step=20 cost=0.513356
Device /gpu: epoch #3 step=60 cost=0.897128
Device /cpu: epoch #4 step=40 cost=0.474257
Device /gpu: epoch #3 step=80 cost=0.898960
Device /cpu: epoch #4 step=60 cost=0.514083
Device /gpu: epoch #3 step=100 cost=0.774140
Device /gpu: epoch #4 step=0 cost=0.799004
Device /cpu: epoch #4 step=80 cost=0.559898
Device /gpu: epoch #4 step=20 cost=0.802869
Device /cpu: epoch #4 step=100 cost=0.440813
Device /gpu: epoch #4 step=40 cost=0.732562
Device /gpu: epoch #4 step=60 cost=0.801020
Device /gpu: epoch #4 step=80 cost=0.815830
Device /gpu: epoch #4 step=100 cost=0.692840
You can try it yourself with standard MNIST data.
It's not ideal if there are many hyperparameters to tune, but you should be able to make an outer loop that iterates over possible hyperparameter tuples, assigns a particular graph to a device and runs them as shown above.

For a problem like this I would typical use the multiprocessing library instead of threading because compared to training a network the overhead of multiprocessing is small but removes any GIL problems. I think that this is the primary problem with your code. You are setting the "CUDA_VISIBLE_DEVICES" environment variable for each thread, but each thread is still sharing the same environment because they are in the same process.
So what I normally do in Tensorflow==2.1 is pass a GPU id number to the worker process which can then run the following code to set the visible GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
my_gpu = gpus[gpu_id]
tf.config.set_visible_devices(my_gpu, 'GPU')
Tensorflow in that process will now only run on that one GPU
Sometimes the network you are training is small enough that you could actually run several at once on one GPU. To make sure that several can fit onto the GPU memory you can set the memory limit for each worker you start.
tf.config.set_logical_device_configuration(
my_gpu,
[tf.config.LogicalDeviceConfiguration(memory_limit=6000)]
)
But if you set the memory limit keep in mind that Tensorflow uses some extra memory outside of the limit for cuDNN or something so you need to have a bit of buffer memory as well for each session you run. Usually I just do trial an error to see what I can fit so sorry I don't have better numbers.

Related

Tensorflow running out of GPU memory between two models

Tensorflow is running out of memory between running two models. The batch size doesn't seem to make a difference. I've tried the clear session command seen in my example code below as well as del model and gc.collect, and tf.config.experimental.set_memory_growth(gpu, True).
A single model runs fine it's only after 1 or 2 where the memory runs out.
import tensorflow as tf
import numpy as np
X_train = np.random.rand(1000000, 768)
X_test = np.random.rand(100, 768)
y_train = np.random.randint(0, 4, size=1000000)
y_test = np.random.randint(0, 4, size=100)
print(X_train.shape, y_train.shape)
y_train = tf.keras.utils.to_categorical(y_train, num_classes=5, dtype='float32')
y_test = tf.keras.utils.to_categorical(y_test, num_classes=5, dtype='float32')
for i in range(10):
tf.keras.backend.clear_session()
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(768,), name='input'),
tf.keras.layers.Dense(768, activation='gelu', name='dense1'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(96, activation='gelu', name='dense2'),
tf.keras.layers.Dropout(0.05),
tf.keras.layers.Dense(5, activation='softmax', name='output')
])
opt = tf.keras.optimizers.Adam(learning_rate=0.0005) # 0.0005 was best so far.
es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)
model.compile(optimizer=opt,
loss='categorical_crossentropy',
metrics=['categorical_accuracy'])
model.fit(X_train, y_train, epochs=1, validation_split=0.1, callbacks=[es], batch_size=0)
_, accuracy = model.evaluate(X_test, y_test, verbose=2)
Here's the beginning out the output:
2022-09-01 18:26:08.068402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6010 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2060 SUPER, pci bus id: 0000:2b:00.0, compute capability: 7.5
2022-09-01 18:26:09.076943: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 2764800000 exceeds 10% of free system memory.
28125/28125 [==============================] - 55s 2ms/step - loss: 1.3878 - categorical_accuracy: 0.2506 - val_loss: 1.3863 - val_categorical_accuracy: 0.2521
4/4 - 0s - loss: 1.3873 - categorical_accuracy: 0.2400 - 84ms/epoch - 21ms/step
2022-09-01 18:27:05.890074: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 2764800000 exceeds 10% of free system memory.
28106/28125 [============================>.] - ETA: 0s - loss: 1.3874 - categorical_accuracy: 0.24992022-09-01 18:28:07.971774: W tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.91MiB (rounded to 2000128)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
Thanks for any help
edit: This is using tensorflow 2.9.1 on Windows 10.

Nearly Constant training and validation accuracy

I’m new to pytorch and my problem may be a little naive
I’m training a pretrained VGG16 network on my dataset which it’s size is near 33000 images in 8 classes with labels [1,2,…,8] and my classes are imbalanced. my problem is that during training, validation and training accuracy is low and doesn’t increase, is there any problem in my code?
if not, what do you suggest to improve training?
'''
import torch
import time
import torch.nn as nn
import numpy as np
from sklearn.model_selection import train_test_split
from torch.optim import Adam
import cv2
import torchvision.models as models
from classify_dataset import Classification_dataset
from torchvision import transforms
transform = transforms.Compose([transforms.Resize((224,224)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomVerticalFlip(p=0.5),
transforms.RandomRotation(degrees=45),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])
dataset = Classification_dataset(root_dir=r'//home/arisa/Desktop/Hamid/IQA/Hamid_Dataset',
csv_file=r'/home/arisa/Desktop/Hamid/IQA/new_label.csv',transform=transform)
target = dataset.labels - 1
train_indices, test_indices = train_test_split(np.arange(target.shape[0]), stratify=target)
test_dataset = torch.utils.data.Subset(dataset, indices=test_indices)
train_dataset = torch.utils.data.Subset(dataset, indices=train_indices)
class_sample_count = np.array([len(np.where(target[train_indices] == t)[0]) for t in np.unique(target)])
weight = 1. / class_sample_count
samples_weight = np.array([weight[t] for t in target[train_indices]])
samples_weight = torch.from_numpy(samples_weight)
samples_weight = samples_weight.double()
sampler = torch.utils.data.WeightedRandomSampler(samples_weight, len(samples_weight), replacement = True)
train_loader = torch.utils.data.DataLoader(train_dataset,
batch_size=64,
sampler=sampler)
test_loader = torch.utils.data.DataLoader(test_dataset,
batch_size=64,
shuffle=False)
for param in model.parameters():
param.requires_grad = False
num_ftrs = model.classifier[0].in_features
model.classifier = nn.Linear(num_ftrs,8)
optimizer = Adam(model.parameters(), lr = 0.0001 )
criterion = nn.CrossEntropyLoss()
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.01)
path = '/home/arisa/Desktop/Hamid/IQA/'
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)
def train_model(model, train_loader,valid_loader, optimizer, criterion, scheduler=None, num_epochs=10 ):
min_valid_loss = np.inf
model.train()
start = time.time()
TrainLoss = []
model = model.to(device)
for epoch in range(num_epochs):
total = 0
correct = 0
train_loss = 0
#lr_scheduler.step()
print('Epoch {}/{}'.format(epoch+1, num_epochs))
print('-' * 10)
train_loss = 0.0
for x,y in train_loader:
x = x.to(device)
#print(y.shape)
y = y.view(y.shape[0],).to(device)
y = y.to(device)
y -= 1
out = model(x)
loss = criterion(out, y)
optimizer.zero_grad()
loss.backward()
TrainLoss.append(loss.item()* y.shape[0])
train_loss += loss.item() * y.shape[0]
_,predicted = torch.max(out.data,1)
total += y.size(0)
correct += (predicted == y).sum().item()
optimizer.step()
lr_scheduler.step()
accuracy = 100*correct/total
valid_loss = 0.0
val_loss = []
model.eval()
val_correct = 0
val_total = 0
with torch.no_grad():
for x_val, y_val in test_loader:
x_val = x_val.to(device)
y_val = y_val.view(y_val.shape[0],).to(device)
y_val -= 1
target = model(x_val)
loss = criterion(target, y_val)
valid_loss += loss.item() * y_val.shape[0]
_,predicted = torch.max(target.data,1)
val_total += y_val.size(0)
val_correct += (predicted == y_val).sum().item()
val_loss.append(loss.item()* y_val.shape[0])
val_acc = 100*val_correct / val_total
print(f'Epoch {epoch + 1} \t\t Training Loss: {train_loss / len(train_loader)} \t\t Validation Loss: {valid_loss / len(test_loader)} \t\t Train Acc:{accuracy} \t\t Validation Acc:{val_acc}')
if min_valid_loss > (valid_loss / len(test_loader)):
print(f'Validation Loss Decreased({min_valid_loss:.6f}--->{valid_loss / len(test_loader):.6f}) \t Saving The Model')
min_valid_loss = valid_loss / len(test_loader)
state = {'state_dict': model.state_dict(),'optimizer': optimizer.state_dict(),}
torch.save(state,'/home/arisa/Desktop/Hamid/IQA/checkpoint.t7')
end = time.time()
print('TRAIN TIME:')
print('%.2gs'%(end-start))
train_model(model=model, train_loader=train_loader, optimizer=optimizer, criterion=criterion, valid_loader= test_loader,num_epochs=500 )
Thanks in advance
here is the result of 15 epoch
Epoch 1/500
----------
Epoch 1 Training Loss: 205.63448420514916 Validation Loss: 233.89266112356475 Train Acc:39.36360386127994 Validation Acc:24.142040038131555
Epoch 2/500
----------
Epoch 2 Training Loss: 199.05699240435197 Validation Loss: 235.08799531243065 Train Acc:41.90998291820601 Validation Acc:24.27311725452812
Epoch 3/500
----------
Epoch 3 Training Loss: 199.15626737127448 Validation Loss: 236.00033430619672 Train Acc:41.1035633416756 Validation Acc:23.677311725452814
Epoch 4/500
----------
Epoch 4 Training Loss: 199.02581041173886 Validation Loss: 233.60767459869385 Train Acc:41.86628530568466 Validation Acc:24.606768350810295
Epoch 5/500
----------
Epoch 5 Training Loss: 198.61493769454472 Validation Loss: 233.7503859202067 Train Acc:41.53656695665991 Validation Acc:25.0
Epoch 6/500
----------
Epoch 6 Training Loss: 198.71323942956585 Validation Loss: 234.17176149830675 Train Acc:41.639852222619474 Validation Acc:25.369399428026693
Epoch 7/500
----------
Epoch 7 Training Loss: 199.9395153770592 Validation Loss: 234.1744423635078 Train Acc:40.98041552456998 Validation Acc:24.84509056244042
Epoch 8/500
----------
Epoch 8 Training Loss: 199.3533399020355 Validation Loss: 235.4645173188412 Train Acc:41.26643626107337 Validation Acc:24.165872259294567
Epoch 9/500
----------
Epoch 9 Training Loss: 199.6451746921249 Validation Loss: 233.33387595956975 Train Acc:40.96452548365312 Validation Acc:24.59485224022879
Epoch 10/500
----------
Epoch 10 Training Loss: 197.9305159737011 Validation Loss: 233.76405122063377 Train Acc:41.8782028363723 Validation Acc:24.6186844613918
Epoch 11/500
----------
Epoch 11 Training Loss: 199.33247244055502 Validation Loss: 234.41085289463854 Train Acc:41.59218209986891 Validation Acc:25.119161105815063
Epoch 12/500
----------
Epoch 12 Training Loss: 199.87399289874256 Validation Loss: 234.23621463775635 Train Acc:41.028085647320545 Validation Acc:24.49952335557674
Epoch 13/500
----------
Epoch 13 Training Loss: 198.85540591944292 Validation Loss: 234.33149099349976 Train Acc:41.206848607635166 Validation Acc:24.857006673021925
Epoch 14/500
----------
Epoch 14 Training Loss: 199.92641723337513 Validation Loss: 233.37722391070741 Train Acc:41.15520597465539 Validation Acc:24.988083889418494
Epoch 15/500
----------
Epoch 15 Training Loss: 197.82172771698328 Validation Loss: 234.4943131533536 Train Acc:41.69943987605768 Validation Acc:24.380362249761678
You freezed your model through
for param in model.parameters():
param.requires_grad = False
which basically says "do not calculate any gradient for any weight" which is equivalent of not updating weights - hence no optimization
my problem was in model.train(). This phrase should be inside the training loop. but in my case I put it outside the training loop and when it comes to model.eval(), model maintained in this mode

How to extract cell state of LSTM model through model.fit()?

My LSTM model is like this, and I would like to get state_c
def _get_model(input_shape, latent_dim, num_classes):
inputs = Input(shape=input_shape)
lstm_lyr,state_h,state_c = LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)
fc_lyr = Dense(num_classes)(lstm_lyr)
soft_lyr = Activation('relu')(fc_lyr)
model = Model(inputs, [soft_lyr,state_c])
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
return model
model =_get_model((n_steps_in, n_features),latent_dim ,n_steps_out)
history = model.fit(X_train,Y_train)
But I canot extract the state_c from the history. How to return that?
I am unsure of what you mean by "How to get state_c", because your LSTM layer is already returning the state_c with the flag return_state=True. I assume you are trying to train the multi-output model in this case. Currently, you only have a single output but your model is compiled with multiple outputs.
Here is how you work with multi-output models.
from tensorflow.keras import layers, Model, utils
def _get_model(input_shape, latent_dim, num_classes):
inputs = layers.Input(shape=input_shape)
lstm_lyr,state_h,state_c = layers.LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)
fc_lyr = layers.Dense(num_classes)(lstm_lyr)
soft_lyr = layers.Activation('relu')(fc_lyr)
model = Model(inputs, [soft_lyr,state_c]) #<------- One input, 2 outputs
model.compile(optimizer='adam', loss='mse')
return model
#Dummy data
X = np.random.random((100,15,5))
y1 = np.random.random((100,4))
y2 = np.random.random((100,7))
model =_get_model((15, 5), 7 , 4)
model.fit(X, [y1,y2], epochs=4) #<--------- #One input, 2 outputs
Epoch 1/4
4/4 [==============================] - 2s 6ms/step - loss: 0.6978 - activation_9_loss: 0.2388 - lstm_9_loss: 0.4591
Epoch 2/4
4/4 [==============================] - 0s 6ms/step - loss: 0.6615 - activation_9_loss: 0.2367 - lstm_9_loss: 0.4248
Epoch 3/4
4/4 [==============================] - 0s 7ms/step - loss: 0.6349 - activation_9_loss: 0.2392 - lstm_9_loss: 0.3957
Epoch 4/4
4/4 [==============================] - 0s 8ms/step - loss: 0.6053 - activation_9_loss: 0.2392 - lstm_9_loss: 0.3661

Fine-Tuning DistilBertForSequenceClassification: Is not learning, why is loss not changing? Weights not updated?

I am relatively new to PyTorch and Huggingface-transformers and experimented with DistillBertForSequenceClassification on this Kaggle-Dataset.
from transformers import DistilBertForSequenceClassification
import torch.optim as optim
import torch.nn as nn
from transformers import get_linear_schedule_with_warmup
n_epochs = 5 # or whatever
batch_size = 32 # or whatever
bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
#bert_distil.classifier = nn.Sequential(nn.Linear(in_features=768, out_features=1), nn.Sigmoid())
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(bert_distil.parameters(), lr=0.1)
X_train = []
Y_train = []
for row in train_df.iterrows():
seq = tokenizer.encode(preprocess_text(row[1]['text']), add_special_tokens=True, pad_to_max_length=True)
X_train.append(torch.tensor(seq).unsqueeze(0))
Y_train.append(torch.tensor([row[1]['target']]).unsqueeze(0))
X_train = torch.cat(X_train)
Y_train = torch.cat(Y_train)
running_loss = 0.0
bert_distil.cuda()
bert_distil.train(True)
for epoch in range(n_epochs):
permutation = torch.randperm(len(X_train))
j = 0
for i in range(0,len(X_train), batch_size):
optimizer.zero_grad()
indices = permutation[i:i+batch_size]
batch_x, batch_y = X_train[indices], Y_train[indices]
batch_x.cuda()
batch_y.cuda()
outputs = bert_distil.forward(batch_x.cuda())
loss = criterion(outputs[0],batch_y.squeeze().cuda())
loss.requires_grad = True
loss.backward()
optimizer.step()
running_loss += loss.item()
j+=1
if j == 20:
#print(outputs[0])
print('[%d, %5d] running loss: %.3f loss: %.3f ' %
(epoch + 1, i*1, running_loss / 20, loss.item()))
running_loss = 0.0
j = 0
[1, 608] running loss: 0.689 loss: 0.687
[1, 1248] running loss: 0.693 loss: 0.694
[1, 1888] running loss: 0.693 loss: 0.683
[1, 2528] running loss: 0.689 loss: 0.701
[1, 3168] running loss: 0.690 loss: 0.684
[1, 3808] running loss: 0.689 loss: 0.688
[1, 4448] running loss: 0.689 loss: 0.692 etc...
Regardless on what I tried, loss did never decrease, or even increase, nor did the prediction get better. It seems to me that I forgot something so that weights are actually not updated. Someone has an idea?
O
what I tried
Different loss functions
BCE
CrossEntropy
even MSE-loss
One-Hot Encoding vs A single neuron output
Different learning rates, and optimizers
I even changed all the targets to only one single label, but even then, the network did'nt converge.
Looking at running loss and minibatch loss is easily misleading. You should look at epoch loss, because the inputs are the same for every loss.
Besides, there are some problems in your code, fixing all of them and the behavior is as expected: the loss slowly decreases after each epoch, and it can also overfit to a small minibatch. Please look at the code, changes include: using model(x) instead of model.forward(x), cuda() only called once, smaller learning rate, etc.
Tuning and fine-tuning ML models are difficult work.
n_epochs = 5
batch_size = 1
bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(bert_distil.parameters(), lr=1e-3)
X_train = []
Y_train = []
for row in train_df.iterrows():
seq = tokenizer.encode(row[1]['text'], add_special_tokens=True, pad_to_max_length=True)[:100]
X_train.append(torch.tensor(seq).unsqueeze(0))
Y_train.append(torch.tensor([row[1]['target']]))
X_train = torch.cat(X_train)
Y_train = torch.cat(Y_train)
running_loss = 0.0
bert_distil.cuda()
bert_distil.train(True)
for epoch in range(n_epochs):
permutation = torch.randperm(len(X_train))
for i in range(0,len(X_train), batch_size):
optimizer.zero_grad()
indices = permutation[i:i+batch_size]
batch_x, batch_y = X_train[indices].cuda(), Y_train[indices].cuda()
outputs = bert_distil(batch_x)
loss = criterion(outputs[0], batch_y)
loss.backward()
optimizer.step()
running_loss += loss.item()
print('[%d] epoch loss: %.3f' %
(epoch + 1, running_loss / len(X_train) * batch_size))
running_loss = 0.0
Output:
[1] epoch loss: 0.695
[2] epoch loss: 0.690
[3] epoch loss: 0.687
[4] epoch loss: 0.685
[5] epoch loss: 0.684
I would highlight two possible reasons for your "stable" results:
I agree that the learning rate is surely too high that prevents model from any significant updates.
But what is important to know is that based on the state-of-the-art papers finetuning has very marginal effect on the core NLP abilities of Transformers. For example, the paper says that finetuning only applies really small weight changes. Citing it: "Finetuning barely affects accuracy on NEL, COREF and REL indicating that those tasks are already sufficiently covered by pre-training". Several papers suggest that finetuning for classification tasks is basically waste of time. Thus, considering that DistilBert is actually a student model of BERT, maybe you won't get better results. Try pre-training with your data first. Generally, pre-training has a more significant impact.
I have got similar problem when I tried to use xxxForSequenceClassification to fine-tune my down-stream task.
At last, I changed xxxForSequenceClassification to xxxModel and added Dropout - FC - Softmax. Magically it's solved, loss decreased as expected.
I'm still trying to find out why.
Hope it may help you.
FYI, transformers verion: 3.5.0
Maybe the poor performance is due to gradients being applied to the BERT backbone. Validate it like so:
print([p.requires_grad for p in bert_distil.distilbert.parameters()])
As an alternative solution, try freezing the weights of your trained model:
for param in bert_distil.distilbert.parameters():
param.requires_grad = False
As you are trying to optimize the weights of a trained model during fine-tuning on your data, you face issues described, among other sources, in the ULMIfit (https://arxiv.org/abs/1801.06146) paper

what does it mean having a negative cost for my training set?

I'm trying to train my model and my cost output decreases each epoch till it reaches a values close to zero then goes to negative values
I'm wondering what is the meaning of having negative cost output?
Cost after epoch 0: 3499.608553
Cost after epoch 1: 2859.823284
Cost after epoch 2: 1912.205967
Cost after epoch 3: 1041.337282
Cost after epoch 4: 385.100483
Cost after epoch 5: 19.694999
Cost after epoch 6: 0.293331
Cost after epoch 7: 0.244265
Cost after epoch 8: 0.198684
Cost after epoch 9: 0.156083
Cost after epoch 10: 0.117224
Cost after epoch 11: 0.080965
Cost after epoch 12: 0.047376
Cost after epoch 13: 0.016184
Cost after epoch 14: -0.012692
Cost after epoch 15: -0.039486
Cost after epoch 16: -0.064414
Cost after epoch 17: -0.087688
Cost after epoch 18: -0.109426
Cost after epoch 19: -0.129873
Cost after epoch 20: -0.149069
Cost after epoch 21: -0.169113
Cost after epoch 22: -0.184217
Cost after epoch 23: -0.200351
Cost after epoch 24: -0.215847
Cost after epoch 25: -0.230574
Cost after epoch 26: -0.245604
Cost after epoch 27: -0.259469
Cost after epoch 28: -0.272469
Cost after epoch 29: -0.284447
I'm training using tensorflow it's a simple neural network with 2 hidden layers
,learning_rate =0.0001, number_of_epoch=30, mini-batch_size=50, train-test-ratio=69/29 and all the data set is of 101434 training examples
Cost is computes using cross entropy equation
tf.nn.sigmoid_cross_entropy_with_logits(logits=Z3, labels=Y)
It means the labels are not in the format in which the cost function expects them to be.
Each label that is passed to sigmoid_cross_entropy_with_logits should be 0 or 1 (for binary classifcation) or a vector containing 0's and 1's (for more than 2 classes). Otherwise, it won't work as expected.
For n classes, the output layer should have n units, and the labels should be encoded as such before passing them to sigmoid_cross_entropy_with_logits:
Y = tf.one_hot(Y, n)
This assumes that Y is a list or one-dimensional array of labels ranging from 0 to n-1.

Resources