I am using this table btw
input = np.array([
[[313, 1], #HCL
[323, 1],
[333, 1],
[343, 1]],
[[313, 10e-3], #Ortho
[323, 10e-3],
[333, 10e-3],
[343, 10e-3]],
[[313, 10e-3], #Para
[323, 10e-3],
[333, 10e-3],
[343, 10e-3]]
], dtype='float32') target = np.array([[[14.76, 16.42, 18.08, 23.41]],
[[5.87, 11.14, 13.20, 25.72]],
[[2.73, 4.42, 8.04, 13.68]]], dtype='float32') loss_fn = F.mse_loss loss = loss_fn(model(input), target)
Output:
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: UserWarning: Using a target size (torch.Size([3, 1, 4])) that is different to the input size (torch.Size([3, 4, 4])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
def fit(num_epochs, model, loss_fn, opt):
for epoch in range(num_epochs):
for xb, yb in train_dl:
pred = model(xb)
loss = loss_fn(preds, yb)
loss.backward()
opt.step()
opt.zero_grad()
if (epoch+1) % 10 == 0:
print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, loss.item()))
fit(100, model, loss_fn, opt)
Output:
Epoch [10/100], Loss: 25.1053
Epoch [20/100], Loss: 25.1050
Epoch [30/100], Loss: 25.1047
Epoch [40/100], Loss: 25.1043
Epoch [50/100], Loss: 25.1040
Epoch [60/100], Loss: 25.1036
Epoch [70/100], Loss: 25.1033
Epoch [80/100], Loss: 25.1030
Epoch [90/100], Loss: 25.1026
Epoch [100/100], Loss: 25.1023
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:5: UserWarning: Using a target size (torch.Size([3, 1, 4])) that is different to the input size (torch.Size([3, 4, 4])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
"""
I'm wondering if the shape mismatching gives an error in the fit function. Also I would like some general tips on how to make these matrices. So I don't run into this, maybe like a general rule of thumb. Or things I can do when I do run into this problem, because I don't wanna not be able to do anything once this happens.
EDIT: I fixed the error while in the fit function. But the warning persists. I am not sure what to do, because this gives very wrong data.
Related
I'm running a multiclass classification problem using the below resnet model:
resnet = tf.keras.applications.ResNet50(
include_top=False ,
weights='imagenet' ,
input_shape=(96, 96, 3) ,
pooling="avg"
)
for layer in resnet.layers:
layer.trainable = True
model_resnet = tf.keras.Sequential()
model_resnet.add(resnet)
model_resnet.add(tf.keras.layers.Flatten())
model_resnet.add(tf.keras.layers.Dense(8, activation='softmax',name='output') )
model_resnet.compile( loss="sparse_categorical_crossentropy" , optimizer=tf.keras.optimizers.Adam(learning_rate=0.001) ,metrics=['accuracy'])
I also used a train and a test generator as below:
train_generator=img_gen.flow_from_dataframe(dataframe=train_dataset,x_col="file_loc",y_col='expr',target_size=(96, 96),batch_size=91,class_mode="raw")
test_generator=img_gen.flow_from_dataframe(dataframe=test_dataset,x_col="file_loc",target_size=(96, 96),batch_size=93,y_col=None,shuffle=False,class_mode=None)
when I am running the code below I get the wanted results and everything works fine
model_resnet.fit_generator(train_generator,
steps_per_epoch=STEP_SIZE_TRAIN_resnet,
epochs=20
)
I wanted to compute the validation accuracy of every epoch so I wrote something like this
model_path = f"/content/weights" + "{val_accuracy:.4f}.hdf5"
checkpoint = tf.keras.callbacks.ModelCheckpoint(
model_path,
monitor='val_accuracy',
save_best_only=True,
mode='max',
verbose=1
)
history = model_resnet.fit_generator(
train_generator,
epochs=5,
steps_per_epoch=STEP_SIZE_TRAIN_resnet,
validation_data=test_generator,
validation_steps=STEP_SIZE_TEST_resnet,
max_queue_size=1,
shuffle=True,
callbacks=[checkpoint],
verbose=1
)
The problem is that for every epoch the validation loss and validation accuracy remain zero even though the training loss and accuracy change. I ran this code for over 20 epochs and it doesn't change at all. I can't find what am I doing wrong since without this it works perfectly,does anyone have any idea?
Epoch 1: val_accuracy improved from -inf to 0.00000, saving model to /content/weights0.0000.hdf5
500/500 [==============================] - 30s 60ms/step - loss: 1.0213 - accuracy: 0.6546 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/5
500/500 [==============================] - ETA: 0s - loss: 0.9644 - accuracy: 0.6672
Epoch 2: val_accuracy did not improve from 0.00000
500/500 [==============================] - 29s 58ms/step - loss: 0.9644 - accuracy: 0.6672 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Edit: I didn't specify the test labels of the test dataset because I used to compute the accuracy score as below:
y_pred = model_resnet.predict(test_generator)
y_pred_max = np.argmax(y_pred, axis=1)
y_true = test_dataset["expr"].to_numpy()
print("accuracy",accuracy_score(y_true, y_pred_max))
I changed the test_generator as below:
test_generator=img_gen.flow_from_dataframe(dataframe=test_dataset,x_col="file_loc",target_size=(96, 96),batch_size=93,y_col='expr',shuffle=False,class_mode=None)
but nothing has changed, it still results in zero
As #Dr.Snoopy said, the problems were that I didn't specify the test labels in these generator (which are required to compute accuracy) and I had different class modes in the generator,the correct was "raw" in both.
I am trying to implement a neural network approximating the logical XOR function, however, the network only converges when using a batch size of 1.
I don't understand why: when I use gradient accumulation with multiple mini-batches of size 1, the convergence is very smooth, but mini-batches of size 2 or more don't work at all.
This issue arises, whatever the learning rate, and I have the same issue with another problem(more complex) than XOR.
I join my code for reference:
import numpy as np
import torch.nn as nn
import torch
import torch.optim as optim
import copy
#very simple network
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(2,3,True)
self.fc1 = nn.Linear(3,1, True)
def forward(self, x):
x = torch.sigmoid(self.fc(x))
x = self.fc1(x)
return x
def data(n): # return n sets of random XOR inputs and output
inputs = np.random.randint(0,2,2*n)
inputs = np.reshape(inputs,(-1,2))
outputs = np.logical_xor(inputs[:,0], inputs[:,1])
return torch.tensor(inputs, dtype = torch.float32),torch.tensor(outputs, dtype = torch.float32)
N = 4
net = Net() # first network, is updated with minibatches of size N
net1 = copy.deepcopy(net) # second network, updated with N minibatches of size 1
inputs = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype = torch.float32)
labels = torch.tensor([0,1,1,0], dtype = torch.float32)
optimizer = optim.SGD(net.parameters(), lr=0.01)
optimizer1 = optim.SGD(net1.parameters(), lr=0.01)
running_loss = 0
running_loss1 = 0
for epoch in range(25000): # loop over the dataset multiple times
# get the inputs; data is a list of [inputs, labels]
input, labels = data(N)
# zero the parameter gradients
optimizer.zero_grad()
optimizer1.zero_grad()
# forward + backward + optimize
loss1_total = 0
for i in range(N):
outputs1 = net1(input[i])
loss1 = (outputs1-labels[i]).pow(2)/N # I divide by N to get the effective mean
loss1.backward()
loss1_total += loss1.item()
outputs = net(input)
loss = (outputs-labels).pow(2).mean()
loss.backward()
# optimization
optimizer.step()
optimizer1.step()
# print statistics
running_loss += loss.item()
running_loss1 += loss1_total
if epoch % 1000 == 999: # print every 1000 mini-batches
print(f'[{epoch + 1}, loss: {running_loss/1000 :.3f}, loss1: {running_loss1/1000 :.3f}')
running_loss1 = 0.0
running_loss = 0.0
print('Finished Training')
# exemples of data and outputs for reference ; network 2 always converge to the sub-optimal point(0.5,0.5)
datatest = data(4)
outputs = net(datatest[0])
outputs1 = net1(datatest[0])
inputs = datatest[0]
labels = datatest[1]
print("input",inputs)
print("target",labels)
print("net output",outputs)
print("net output",outputs1)
[EDIT] Improved readability and updated the code
result :
[1000, loss: 0.259, loss1: 0.258
[2000, loss: 0.252, loss1: 0.251
[3000, loss: 0.251, loss1: 0.250
[4000, loss: 0.252, loss1: 0.250
[5000, loss: 0.251, loss1: 0.249
[6000, loss: 0.251, loss1: 0.247
[7000, loss: 0.252, loss1: 0.246
[8000, loss: 0.251, loss1: 0.244
[9000, loss: 0.252, loss1: 0.241
[10000, loss: 0.251, loss1: 0.236
[11000, loss: 0.252, loss1: 0.230
[12000, loss: 0.252, loss1: 0.221
[13000, loss: 0.250, loss1: 0.208
[14000, loss: 0.251, loss1: 0.193
[15000, loss: 0.251, loss1: 0.175
[16000, loss: 0.251, loss1: 0.152
[17000, loss: 0.252, loss1: 0.127
[18000, loss: 0.251, loss1: 0.099
[19000, loss: 0.251, loss1: 0.071
[20000, loss: 0.251, loss1: 0.048
[21000, loss: 0.251, loss1: 0.029
[22000, loss: 0.251, loss1: 0.016
[23000, loss: 0.250, loss1: 0.008
[24000, loss: 0.251, loss1: 0.004
[25000, loss: 0.251, loss1: 0.002
Finished Training
input tensor([[1., 0.],
[0., 0.],
[0., 0.],
[0., 0.]])
target tensor([1., 0., 0., 0.])
net output tensor([[0.4686],
[0.4472],
[0.4472],
[0.4472]], grad_fn=<AddmmBackward0>)
net1 output tensor([[0.9665],
[0.0193],
[0.0193],
[0.0193]], grad_fn=<AddmmBackward0>)
Please, could you explain me why this strange phenomena is appearing ? I searched for a long time on the net, without success...
Excuse me if my question is not well formatted, it is the first time I ask a question on stack overflow.
EDIT :
I found, comparing accumulated gradients of size 1 minibatches and gradients from minibatches of size N, that the computed gradients are mostly the same, only small(but noticeable) differences appear probably due to approximation errors, so my implementation looks fine at first sight. I still don't get where does this strong convergence property of minibatches of size 1 come from.
The problem lies in the way you define labels / compute the loss in
loss = (outputs-labels).pow(2).mean()
We have labels.shape = [4] but outputs.shape =[4, 1]. This due to the broadcasting, the difference
(outputs - labels).shape = [4, 4]
which means we compute all pairwise differences between outputs and labels (and then take their 2nd power and average them), which basically means that no meaningful supervision takes place.
The quick way to fix that here would be adding a dummy dimension here:
loss = (outputs-labels[:, None]).pow(2).mean()
but the clean way would be doing it the correct way right from the start, that is defining your labels in a ways such that labels.shape = [_, 1]:
labels = torch.tensor([[0], [1], [1], [0]], dtype=torch.float32)
(and similar in your data() function).
It seems there is a minor issue with the dimensions of labels and outputs.
This:
labels = torch.tensor([0,1,1,0], dtype = torch.float32)
Needs to become this:
labels = torch.tensor([[0],[1],[1],[0]], dtype = torch.float32)
Otherwise, the mismatch between the model output and the labels messes up the loss in the minibatch example.
This can be fixed in data(n), if you add an extra dimension to outputs:
outputs = np.logical_xor(inputs[:,0], inputs[:,1]).reshape((n, 1))
After fixing that, there will be a floating-point underflow issue as well. The gradient accumulation method divides then sums gradients, but in the minibatch method first sums then divides the values. Mathematically they are the same, but in practice, there will be drift between them in long run.
Check this example:
x = np.array([0.00649802, 0.24420964, 0.05081264,])
(x/3).sum() - x.mean()
# -1.3877787807814457e-17
I'm training a binary classification model on a series of images.
The model was derived from resnet18 in torchvision and I made the last FC as nn.Linear(512, 1)
The loss function is BCELoss
However, the model doesn't show any sign of converging even after 5000 iterations.
I'm suspecting I might do something wrong in the training stage? But I can't find where's the bug.
Here's my code:
Model:
## Model
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models
resnet18 = models.resnet18(pretrained= True)
resnet18.fc = nn.Linear(512, 1)
Parameters, loss, optimizers:
## parameter
epochs = 200
learning_rate = 0.1
momen = 0.9
batch = 8
criterion = nn.BCELoss()
resnet18.to(device)
opt = optim.SGD(resnet18.parameters(), lr = learning_rate, momentum = momen)
Dataloaders:
# Generators
training_set = Dataset(X_train)
training_generator = torch.utils.data.DataLoader(training_set, batch_size= batch, shuffle=True)
validation_set = Dataset(X_test)
validation_generator = torch.utils.data.DataLoader(validation_set, batch_size=1, shuffle=False)
Training:
# training
history = []
for t in range(epochs):
for i, data in enumerate(training_generator, 0):
inputs, labels = data
# check if input size == batch size #
if inputs.shape[0] < batch:
break
# print("labels", labels, labels.dtype)
# move data to GPU #
inputs, labels = inputs.to(device), labels.to(device)
opt.zero_grad()
# Prediction #
y_pred = resnet18(inputs).view(batch,)
y_pred = (y_pred > 0).float().requires_grad_()
# print("y_pred", y_pred, y_pred.dtype)
# Calculating loss #
loss = criterion(y_pred, labels.view(batch,))
loss.backward()
opt.step()
if i % 10 == 0:
history.append(loss.item())
print("Epoch: {}, iter: {}, loss: {}".format(t, i, loss.item())
torch.save(resnet18, 'trained_resnet18.pt')
Edit:
The loss values are like this:
Epoch: 3, iter: 310, loss: 0.0
Epoch: 3, iter: 320, loss: 37.5
Epoch: 3, iter: 330, loss: 37.5
Epoch: 3, iter: 340, loss: 0.0
Epoch: 3, iter: 350, loss: 37.5
Epoch: 3, iter: 360, loss: 50.0
Epoch: 3, iter: 370, loss: 37.5
Epoch: 3, iter: 380, loss: 25.0
Epoch: 3, iter: 390, loss: 12.5
I belive the error lies in the following line:
y_pred = (y_pred > 0).float().requires_grad_()
You try to binarize the model prediction in a weird way, I suggest do the following instead:
y_pred = torch.sigmoid(y_pred)
And pass this to the loss function.
Explanation
The output of the model can be any value, but we want to normalize that values to reside in the [0,1] range. This is exactly what the sigmoid function does. Once we have the values in the range of [0,1] the comparison with the binary labels will make sense, closer to 1 will be "1" and the opposite.
You can refer to the following link: https://www.youtube.com/watch?v=WsFasV46KgQ
I am relatively new to PyTorch and Huggingface-transformers and experimented with DistillBertForSequenceClassification on this Kaggle-Dataset.
from transformers import DistilBertForSequenceClassification
import torch.optim as optim
import torch.nn as nn
from transformers import get_linear_schedule_with_warmup
n_epochs = 5 # or whatever
batch_size = 32 # or whatever
bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
#bert_distil.classifier = nn.Sequential(nn.Linear(in_features=768, out_features=1), nn.Sigmoid())
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(bert_distil.parameters(), lr=0.1)
X_train = []
Y_train = []
for row in train_df.iterrows():
seq = tokenizer.encode(preprocess_text(row[1]['text']), add_special_tokens=True, pad_to_max_length=True)
X_train.append(torch.tensor(seq).unsqueeze(0))
Y_train.append(torch.tensor([row[1]['target']]).unsqueeze(0))
X_train = torch.cat(X_train)
Y_train = torch.cat(Y_train)
running_loss = 0.0
bert_distil.cuda()
bert_distil.train(True)
for epoch in range(n_epochs):
permutation = torch.randperm(len(X_train))
j = 0
for i in range(0,len(X_train), batch_size):
optimizer.zero_grad()
indices = permutation[i:i+batch_size]
batch_x, batch_y = X_train[indices], Y_train[indices]
batch_x.cuda()
batch_y.cuda()
outputs = bert_distil.forward(batch_x.cuda())
loss = criterion(outputs[0],batch_y.squeeze().cuda())
loss.requires_grad = True
loss.backward()
optimizer.step()
running_loss += loss.item()
j+=1
if j == 20:
#print(outputs[0])
print('[%d, %5d] running loss: %.3f loss: %.3f ' %
(epoch + 1, i*1, running_loss / 20, loss.item()))
running_loss = 0.0
j = 0
[1, 608] running loss: 0.689 loss: 0.687
[1, 1248] running loss: 0.693 loss: 0.694
[1, 1888] running loss: 0.693 loss: 0.683
[1, 2528] running loss: 0.689 loss: 0.701
[1, 3168] running loss: 0.690 loss: 0.684
[1, 3808] running loss: 0.689 loss: 0.688
[1, 4448] running loss: 0.689 loss: 0.692 etc...
Regardless on what I tried, loss did never decrease, or even increase, nor did the prediction get better. It seems to me that I forgot something so that weights are actually not updated. Someone has an idea?
O
what I tried
Different loss functions
BCE
CrossEntropy
even MSE-loss
One-Hot Encoding vs A single neuron output
Different learning rates, and optimizers
I even changed all the targets to only one single label, but even then, the network did'nt converge.
Looking at running loss and minibatch loss is easily misleading. You should look at epoch loss, because the inputs are the same for every loss.
Besides, there are some problems in your code, fixing all of them and the behavior is as expected: the loss slowly decreases after each epoch, and it can also overfit to a small minibatch. Please look at the code, changes include: using model(x) instead of model.forward(x), cuda() only called once, smaller learning rate, etc.
Tuning and fine-tuning ML models are difficult work.
n_epochs = 5
batch_size = 1
bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(bert_distil.parameters(), lr=1e-3)
X_train = []
Y_train = []
for row in train_df.iterrows():
seq = tokenizer.encode(row[1]['text'], add_special_tokens=True, pad_to_max_length=True)[:100]
X_train.append(torch.tensor(seq).unsqueeze(0))
Y_train.append(torch.tensor([row[1]['target']]))
X_train = torch.cat(X_train)
Y_train = torch.cat(Y_train)
running_loss = 0.0
bert_distil.cuda()
bert_distil.train(True)
for epoch in range(n_epochs):
permutation = torch.randperm(len(X_train))
for i in range(0,len(X_train), batch_size):
optimizer.zero_grad()
indices = permutation[i:i+batch_size]
batch_x, batch_y = X_train[indices].cuda(), Y_train[indices].cuda()
outputs = bert_distil(batch_x)
loss = criterion(outputs[0], batch_y)
loss.backward()
optimizer.step()
running_loss += loss.item()
print('[%d] epoch loss: %.3f' %
(epoch + 1, running_loss / len(X_train) * batch_size))
running_loss = 0.0
Output:
[1] epoch loss: 0.695
[2] epoch loss: 0.690
[3] epoch loss: 0.687
[4] epoch loss: 0.685
[5] epoch loss: 0.684
I would highlight two possible reasons for your "stable" results:
I agree that the learning rate is surely too high that prevents model from any significant updates.
But what is important to know is that based on the state-of-the-art papers finetuning has very marginal effect on the core NLP abilities of Transformers. For example, the paper says that finetuning only applies really small weight changes. Citing it: "Finetuning barely affects accuracy on NEL, COREF and REL indicating that those tasks are already sufficiently covered by pre-training". Several papers suggest that finetuning for classification tasks is basically waste of time. Thus, considering that DistilBert is actually a student model of BERT, maybe you won't get better results. Try pre-training with your data first. Generally, pre-training has a more significant impact.
I have got similar problem when I tried to use xxxForSequenceClassification to fine-tune my down-stream task.
At last, I changed xxxForSequenceClassification to xxxModel and added Dropout - FC - Softmax. Magically it's solved, loss decreased as expected.
I'm still trying to find out why.
Hope it may help you.
FYI, transformers verion: 3.5.0
Maybe the poor performance is due to gradients being applied to the BERT backbone. Validate it like so:
print([p.requires_grad for p in bert_distil.distilbert.parameters()])
As an alternative solution, try freezing the weights of your trained model:
for param in bert_distil.distilbert.parameters():
param.requires_grad = False
As you are trying to optimize the weights of a trained model during fine-tuning on your data, you face issues described, among other sources, in the ULMIfit (https://arxiv.org/abs/1801.06146) paper
def myloss(y_true, y_pred):
b = k.constant([1, 1, 1, 50, 50, 50], shape=[6, 1])
return (k.mean(k.sqrt(k.dot(k.square(y_pred - y_true), b)))
This is our loss function and we got this result.
2800/2799 [==============================] - 245s - loss: 204.2003 - soft_acc: 0.5136 - val_loss: 64.3844 - val_soft_acc: 0.4648
We tried changing the learning rates and optimiser but the loss didn't improve
we refered to this link
Keras Extremely High Loss
epoch 1/200 ===========================] - 254s - loss: 4.0631 - rmse: 5.1670 - val_loss: 4.6882 - val_rmse: 4.7807
and added logarithmic error and got the above loss value. How to reduce the loss further?
I tried normalising the data. It worked for me.