I am a bit confused and hope someone can help me.
I am currently experimenting with supervised learning. And I think I have a basic misunderstanding about the input and output of LSTMs.
When I have a sequence of 10 observations,
And I split it into trains = 1,2,3,4,5,6,7,8
Also, test = 9,10.
And I transform it into a supervised problem like:
Xtrain= [(1,2)(2,3)(3,4)(4,5)(5,6)]
Ytrain= [(3,4)(4,5)(5,6)(6,7)(7,8)]
And
Xtest= [(7,8)]
So the model is made to predict the next two observations from the previous two.
prediction <- predict(Xtest)
Is this illegal for a train/test split ? Am I correct that I can than evaluate the prediction output from xtest against the actual test set containing [(9,10)]
Or should I stop training at xtrain =[(4,5)] and ytrain = [(6,7)] to get some space between training and testing, since the last observations from y training in my example are used for the prediction
?
Related
I'm trying to solve a multilabel classification task of 10 classes with a relatively balanced training set consists of ~25K samples and an evaluation set consists of ~5K samples.
I'm using the huggingface:
model = transformers.BertForSequenceClassification.from_pretrained(...
and obtain quite nice results (ROC AUC = 0.98).
However, I'm witnessing some odd behavior which I don't seem to make sense of -
I add the following lines of code:
for param in model.bert.parameters():
param.requires_grad = False
while making sure that the other layers of the model are learned, that is:
[param[0] for param in model.named_parameters() if param[1].requires_grad == True]
gives
['classifier.weight', 'classifier.bias']
Training the model when configured like so, yields some embarrassingly poor results (ROC AUC = 0.59).
I was working under the assumption that an out-of-the-box pre-trained BERT model (without any fine-tuning) should serve as a relatively good feature extractor for the classification layers. So, where do I got it wrong?
From my experience, you are going wrong in your assumption
an out-of-the-box pre-trained BERT model (without any fine-tuning) should serve as a relatively good feature extractor for the classification layers.
I have noticed similar experiences when trying to use BERT's output layer as a word embedding value with little-to-no fine-tuning, which also gave very poor results; and this also makes sense, since you effectively have 768*num_classes connections in the simplest form of output layer. Compared to the millions of parameters of BERT, this gives you an almost negligible amount of control over intense model complexity. However, I also want to cautiously point to overfitted results when training your full model, although I'm sure you are aware of that.
The entire idea of BERT is that it is very cheap to fine-tune your model, so to get ideal results, I would advise against freezing any of the layers. The one instance in which it can be helpful to disable at least partial layers would be the embedding component, depending on the model's vocabulary size (~30k for BERT-base).
I think the following will help in demystifying the odd behavior I reported here earlier –
First, as it turned out, when freezing the BERT layers (and using an out-of-the-box pre-trained BERT model without any fine-tuning), the number of training epochs required for the classification layer is far greater than that needed when allowing all layers to be learned.
For example,
Without freezing the BERT layers, I’ve reached:
ROC AUC = 0.98, train loss = 0.0988, validation loss = 0.0501 # end of epoch 1
ROC AUC = 0.99, train loss = 0.0484, validation loss = 0.0433 # end of epoch 2
Overfitting, train loss = 0.0270, validation loss = 0.0423 # end of epoch 3
Whereas, when freezing the BERT layers, I’ve reached:
ROC AUC = 0.77, train loss = 0.2509, validation loss = 0.2491 # end of epoch 10
ROC AUC = 0.89, train loss = 0.1743, validation loss = 0.1722 # end of epoch 100
ROC AUC = 0.93, train loss = 0.1452, validation loss = 0.1363 # end of epoch 1000
The (probable) conclusion that arises from these results is that working with an out-of-the-box pre-trained BERT model as a feature extractor (that is, freezing its layers) while learning only the classification layer suffers from underfitting.
This is demonstrated in two ways:
First, after running 1000 epochs, the model still hasn’t finished learning (the training loss is still higher than the validation loss).
Second, after running 1000 epochs, the loss values are still higher than the values achieved with the non-freeze version as early as the 1’st epoch.
To sum it up, #dennlinger, I think I completely agree with you on this:
The entire idea of BERT is that it is very cheap to fine-tune your model, so to get ideal results, I would advise against freezing any of the layers.
I have some questions about fine-tuning causal language model using transformers and PyTorch.
My main goal is to fine-tune XLNet. However, I found the most of posts online was targeting at text classification, like this post. I was wondering, is there any way to fine-tune the model, without using the run_language_model.py from transformers' GitHub?
Here is a piece of my code trying to fine-tune XLNet:
model = XLNetLMHeadModel.from_pretrained("xlnet-base-cased")
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased", do_lower_case=True)
LOSS = torch.nn.CrossEntrypoLoss()
batch_texts = ["this is sentence 1", "i have another sentence like this", "the final sentence"]
encodings = tokenizer.encode_plus(batch_texts, add_special_tokens=True,
return_tensors=True, return_attention_mask=True)
outputs = model(encodings["input_ids"], encodings["attention_mask"])
loss = LOSS(outputs[0], target_ids)
loss.backward()
# ignoring the rest of codes...
I got stuck at the last two lines. At first, when using this LM model, it seems I don't have any labels as the supervised learning usually do; Second, as the language model which is to minimize the loss (cross-entropy here), I need a target_ids to compute the loss and perplexity with input_ids.
Here are my follow-up questions:
How should I deal with this labels during the model fitting?
Should I set something like target_ids=encodings["input_ids"].copy() to compute cross-entropy loss and perplexity?
If not, how should set this target_ids?
From the perplexity page from transformers' documentation, how should I adapt its method for non-fixed length of input text?
I saw another post from the documentation saying that it requires padding text for causal language modeling. However, from the link in 3), there is no such sign for padding text. Which one should I follow?
Any suggestions and advice will be appreciated!
When fine-tuning a model with a language-model head, the labels are the next tokens themselves (you predict the next words). Huggingface's library makes a lot of things very easy to do by hiding most of the complexity of the process within their methods, which is very nice when you want to do something standard. But if you want to do something special, or if you want to learn and understand the details, I suggest to go down implementing the training loop directly in pytorch; coding the low-level stuff is the best way to learn.
For this case, here are a draft to get started; the training loop is far from being complete, but it must be adapted to each specific case anyway, so I hope these few lines may help to start...
model = GPT2LMHeadModel.from_pretrained('distilgpt2')
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
# our input:
s = tokenizer.encode('In winter, the weather is',return_tensors='pt')
# we want to fine-tune to force a fake output as follows:
ss = tokenizer.encode('warm and hot',return_tensors='pt')
# forward pass:
outputs = model(s)
# check that the outout logits are given for every input token:
print(outputs.logits.size())
# we're gonna train on the token that follows the last input one
# so we extract just the last logit:
lasty = outputs.logits[0,-1].view(1,-1)
# prepare backprop:
lossfct = torch.nn.CrossEntropyLoss()
optimizer = transformers.AdamW(model.parameters(), lr=5e-5)
# just take the first next token (you should repeat this for the next ones)
labels = ss[0][0].view(1)
loss = lossfct(lasty,labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# finetunening done: you may check the answer is already different:
y = model.generate(s)
sy = tokenizer.decode(y[0])
print(sy)
I am trying to make a face recognition model with PyTorch. My model performs well with both loss scores for training and validation close to 0. The problem is when I tested it using same input, it gives me different results like below:
and below
The left side is the true label and the right one is the prediction output. I set random seed and ran model.eval() before feeding the input to the model.
Does anyone know how to solve this?
Please advise
I am working on a time-series prediction problem using GradientBoostingRegressor, and I think I'm seeing significant overfitting, as evidenced by a significantly better RMSE for training than for prediction. In order to examine this, I'm trying to use sklearn.model_selection.cross_validate, but I'm having problems understanding the result.
First: I was calculating RMSE by fitting to all my training data, then "predicting" the training data outputs using the fitted model and comparing those with the training outputs (the same ones I used for fitting). The RMSE that I observe is the same order of magnitude the predicted values and, more important, it's in the same ballpark as the RMSE I get when I submit my predicted results to Kaggle (although the latter is lower, reflecting overfitting).
Second, I use the same training data, but apply sklearn.model_selection.cross_validate as follows:
cross_validate( predictor, features, targets, cv = 5, scoring = "neg_mean_squared_error" )
I figure the neg_mean_squared_error should be the square of my RMSE. Accounting for that, I still find that the error reported by cross_validate is one or two orders of magnitude smaller than the RMSE I was calculating as described above.
In addition, when I modify my GradientBoostingRegressor max_depth from 3 to 2, which I would expect reduces overfitting and thus should improve the CV error, I find that the opposite is the case.
I'm keenly interested to use Cross Validation so I don't have to validate my hyperparameter choices by using up Kaggle submissions, but given what I've observed, I'm not clear that the results will be understandable or useful.
Can someone explain how I should be using Cross Validation to get meaningful results?
I think there is a conceptual problem here.
If you want to compute the error of a prediction you should not use the training data. As the name says theese type of data are used only in training, for evaluating accuracy scores you ahve to use data that the model has never seen.
About cross-validation I can tell that it's an approach to find the best training/testing set. The process is as follows: you divide your data into n groups and you do various iterating changing the testing group you pick. If you have n groups you will do n iteration and each time the training and testing set will be different. It's more understamdable in the image below.
Basically what you should do it's kile this:
Train the model using months from 0 to 30 (for example)
See the predictions made with months from 31 to 35 as input.
If the input has to be the same lenght divide feature in half (should be 17 months).
I hope I understood correctly, othewise comment.
I am facing with a weird issue. I use NaiveBayesClassifier from nltk.classify to categorize text and my problem is that it is displaying an incredible accuracy of 0.9966. I am sure this cannot be real, still I see no errors in my code. My input is huge, 40.000 sentences used for training and 80.000 used for testing.
I am building a set of train features made up of all the negative/positive/neutral labeled training text
trainFeats = negFeats + posFeats + neutralFeats
and a set of test features made up of all the negative/positive/neutral labeled training text
testFeats = negFeats + posFeats + neutralFeats
Afterwards I train the classifier on the trainFeats
classifier = NaiveBayesClassifier.train(trainFeats)
and test it on all the testFeats
print 'accuracy:', nltk.classify.util.accuracy(classifier, testFeats)
Is this a normal result, should I take things for granted? Because it behaves strikingly good. Thanks!