fine tune causal language model using transformers and pytorch - python-3.x

I have some questions about fine-tuning causal language model using transformers and PyTorch.
My main goal is to fine-tune XLNet. However, I found the most of posts online was targeting at text classification, like this post. I was wondering, is there any way to fine-tune the model, without using the run_language_model.py from transformers' GitHub?
Here is a piece of my code trying to fine-tune XLNet:
model = XLNetLMHeadModel.from_pretrained("xlnet-base-cased")
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased", do_lower_case=True)
LOSS = torch.nn.CrossEntrypoLoss()
batch_texts = ["this is sentence 1", "i have another sentence like this", "the final sentence"]
encodings = tokenizer.encode_plus(batch_texts, add_special_tokens=True,
return_tensors=True, return_attention_mask=True)
outputs = model(encodings["input_ids"], encodings["attention_mask"])
loss = LOSS(outputs[0], target_ids)
loss.backward()
# ignoring the rest of codes...
I got stuck at the last two lines. At first, when using this LM model, it seems I don't have any labels as the supervised learning usually do; Second, as the language model which is to minimize the loss (cross-entropy here), I need a target_ids to compute the loss and perplexity with input_ids.
Here are my follow-up questions:
How should I deal with this labels during the model fitting?
Should I set something like target_ids=encodings["input_ids"].copy() to compute cross-entropy loss and perplexity?
If not, how should set this target_ids?
From the perplexity page from transformers' documentation, how should I adapt its method for non-fixed length of input text?
I saw another post from the documentation saying that it requires padding text for causal language modeling. However, from the link in 3), there is no such sign for padding text. Which one should I follow?
Any suggestions and advice will be appreciated!

When fine-tuning a model with a language-model head, the labels are the next tokens themselves (you predict the next words). Huggingface's library makes a lot of things very easy to do by hiding most of the complexity of the process within their methods, which is very nice when you want to do something standard. But if you want to do something special, or if you want to learn and understand the details, I suggest to go down implementing the training loop directly in pytorch; coding the low-level stuff is the best way to learn.
For this case, here are a draft to get started; the training loop is far from being complete, but it must be adapted to each specific case anyway, so I hope these few lines may help to start...
model = GPT2LMHeadModel.from_pretrained('distilgpt2')
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
# our input:
s = tokenizer.encode('In winter, the weather is',return_tensors='pt')
# we want to fine-tune to force a fake output as follows:
ss = tokenizer.encode('warm and hot',return_tensors='pt')
# forward pass:
outputs = model(s)
# check that the outout logits are given for every input token:
print(outputs.logits.size())
# we're gonna train on the token that follows the last input one
# so we extract just the last logit:
lasty = outputs.logits[0,-1].view(1,-1)
# prepare backprop:
lossfct = torch.nn.CrossEntropyLoss()
optimizer = transformers.AdamW(model.parameters(), lr=5e-5)
# just take the first next token (you should repeat this for the next ones)
labels = ss[0][0].view(1)
loss = lossfct(lasty,labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# finetunening done: you may check the answer is already different:
y = model.generate(s)
sy = tokenizer.decode(y[0])
print(sy)

Related

Question about Training and Testing in supervised learning

I am a bit confused and hope someone can help me.
I am currently experimenting with supervised learning. And I think I have a basic misunderstanding about the input and output of LSTMs.
When I have a sequence of 10 observations,
And I split it into trains = 1,2,3,4,5,6,7,8
Also, test = 9,10.
And I transform it into a supervised problem like:
Xtrain= [(1,2)(2,3)(3,4)(4,5)(5,6)]
Ytrain= [(3,4)(4,5)(5,6)(6,7)(7,8)]
And
Xtest= [(7,8)]
So the model is made to predict the next two observations from the previous two.
prediction <- predict(Xtest)
Is this illegal for a train/test split ? Am I correct that I can than evaluate the prediction output from xtest against the actual test set containing [(9,10)]
Or should I stop training at xtrain =[(4,5)] and ytrain = [(6,7)] to get some space between training and testing, since the last observations from y training in my example are used for the prediction
?

Which huggingface model is the best for sentence as input and a word from that sentence as the output?

What would be the best huggingface model to fine-tune for this type of task:
Example input 1:
If there's one person you don't want to interrupt in the middle of a sentence it's a judge.
Example output 1:
sentence
Example input 2:
A good baker will rise to the occasion, it's the yeast he can do.
Example output 2:
yeast
Architecture
This looks like a Question Answering type of task, where the input is a sentence and the output is a span from the input sentence.
In transformers this corresponds to the AutoModelForQuestionAnswering class.
See the following illustration from the original BERT paper:
The only difference you have is that the input will be compsed of the "question" only.
In other words, you won't have a Question, a [SEP] token, & a Paragraph, as shown in the figure.
Without knowing too much about your task, you might want to model this as a Token Classification type of task instead.
Here, your output would be labelled as some positive tag and the rest of the words labelled as some other negative tag.
If this makes more sense for you have a look at the AutoModelForTokenClassification class.
I will base the rest of my discussion on question-answering, but these concepts can be easily adapted.
Model
Since it seems that you're dealing with English sentences, you can probably use a pre-trained model such as bert-base-uncased.
Depending on the data distribution, your choice of language model can change.
Not sure what the task you're doing is, but unless there's some fine-tuned model available which is doing your task (you can try searching the HuggingFace model hub), you're going to have to fine-tune your own model.
To do so you need to have a dataset composed of sentences labelled with start & end indices corresponding to the answer span.
See the documentation for more information on how to train.
Evaluation
Once you have a fine-tuned model you just need to run your test sentences through the model to extract answers.
The following code, adapted from the HuggingFace documentation, does that:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import torch
model = AutoModelForQuestionAnswering.from_pretrained(name)
tokenizer = AutoTokenizer.from_pretrained(name)
input = "A good baker will rise to the occasion, it's the yeast he can do."
inputs = tokenizer(input, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
outputs = model(**inputs)
start_scores = outputs.start_logits
end_scores = outputs.end_logits
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores) + 1
answer = tokenizer.convert_tokens_to_string(
tokenizer.convert_ids_to_tokens(input_ids[start_index:end_index])
) # "yeast", hopefully!

How to get the probability of a particular token(word) in a sentence given the context

I'm trying to calculate the probability or any type of score for words in a sentence using NLP. I've tried this approach with GPT2 model using Huggingface Transformers library, but, I couldn't get satisfactory results due to the model's unidirectional nature which for me didn't seem to predict within context. So I was wondering whether there is a way, to calculate the above said using BERT since it's Bidirectional.
I've found this post relatable, which I randomly saw the other day but didn't see any answer which would be useful for me as well.
Hope I will be able to receive ideas or a solution for this. Any help is appreciated. Thank you.
BERT is trained as a masked language model, i.e., it is trained to predict tokens that were replaced by a [MASK] token.
from transformers import AutoTokenizer, BertForMaskedLM
tok = AutoTokenizer.from_pretrained("bert-base-cased")
bert = BertForMaskedLM.from_pretrained("bert-base-cased")
input_idx = tok.encode(f"The {tok.mask_token} were the best rock band ever.")
logits = bert(torch.tensor([input_idx]))[0]
prediction = logits[0].argmax(dim=1)
print(tok.convert_ids_to_tokens(prediction[2].numpy().tolist()))
It prints token no. 11581 which is:
Beatles
To get a normalized probability distribution over BERT's vocabulary, you can normalize the logits using the softmax function, i.e., F.softmax(logits, dim=1), (assuming standart import torch.nn.fucntional as F).
The tricky thing is that words might be split into multiple subwords. You can simulate that by adding multiple [MASK] tokens, but then you have a problem with how to compare the scores of prediction so different lengths reliably. I would probably average the probabilities, but maybe there is a better way.

How to make OneClassSVM model more accurate? (Scikit-learn)

I have been attempting to classify an author using multiple texts written by this author, which I would then use to find similarities in other texts to identify that author in the test group.
I have been successful with some of the predictions, however I am still getting results where it failed to predict the author.
I have done pre-processing the texts beforehand with stemming, tokenizing, stop words, removing punctuation etc. in an attempt to make it more accurate.
I am unfamiliar with how exactly the OneClassSVM parameters work. What parameters could I use to best suit my problem and how could I make my model more accurate in it's predictions?
Here is what I have so far:
vectorizer = TfidfVectorizer()
author_corpus = self.pre_process(author_corpus)
test_corpus = self.pre_process(test_corpus)
train = author_corpus
test = test_corpus
train_vectors = vectorizer.fit_transform(train)
test_vectors = vectorizer.transform(test)
model = OneClassSVM(kernel='linear', gamma='auto', nu=0.01)
model.fit(train_vectors)
test_predictions = model.predict(test_vectors)
print(test_predictions[:10])
print(model.score_samples(test_vectors)[:10])
You can use a SVM, but deep learning is really well-suited for this. I did a Kaggle competition with classifying documents that was amazing for this.
If you don't think you have a big enough dataset, you might want to just take a text classifier model and re-train the last layer on your author, then fine-tune the rest of the model.
I’ve heard positive things about Andrew Ng’s deep learning class on Coursera. I learned all I know about AI using the Microsoft Professional Certification in AI on edx.

Keras: Attention Mechanism For Text Summarization

I'm trying to implement Attention mechanism in order to produce abstractive text summarization using Keras by taking a lot of help from this GitHub thread where there is a lot of informative discussion about the implementation. I'm struggling to understand certain very basic bits of the code and what will I need to modify to successfully get the output. I know that attention is the weighted sum of the context vector generated through all hidden states of all the previous timestamps and that is what we are trying to do below.
Data:
I got the BBC news dataset consists of news text and the headlines for various categories such as Politics, Entertainment, and Sports.
Parameters:
n_embeddings = 64
vocab_size = len(voabulary)+1
max_len = 200
rnn_size = 64
Code:
_input = Input(shape=(max_len,), dtype='int32')
embedding = Embedding(input_dim=vocab_size, output_dim=n_embeddings, input_length=max_len)(_input)
activations = LSTM(rnn_size, return_sequences=True)(embedding)
# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)
# apply the attention
sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=1))(sent_representation)
probabilities = Dense(max_len, activation='softmax')(sent_representation)
model = Model(input=_input, output=probabilities)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[])
print(model.summary())
My Questions:
The thread linked is trying to use Attention for classification whereas I want to generate a text sequence (summary) so how should I utilize the sent_probabilites and decode to generate the summary?
What is RepeatVector used here for? Is it for getting the activation and attention probability of each word at timestamp T?
I didn't find much explanation of what Permute layer does?
what is Lambda(lambda xin: K.sum(xin, axis=1))(sent_representation) for?
How does model.fit() look like? I have created the padded sequence of fixed length of X and y.
I would really appreciate any help you could provide. Thanks a lot in advance.
I have found tht most research tend to use tensorflow for this task, as it is much more easy to implement seq2seq model using tensorflow than using keras model
This blog goes into details on how to build a text summarizer in an extremely optimized manner, it explains one dongjun-Lee , one of the most easy efficient and easy implementations
Code can be found here, implemented on google colab,
If you liked the idea you can check all of the blog series.
Hope this helps

Resources