I am experimenting on the use of transformer embeddings in sentence classification tasks without finetuning them. I have used BERT embeddings and those experiments gave me very good results. Now I want to use GPT-2 embeddings (without fine-tuning). So I have two questions,
Can I use GPT-2 embeddings like that (because I know Gpt-2 is
trained on the left to right)
Is there any example uses of GPT-2 in
classification tasks other than generation tasks?
If I can use GPT-2embeddings, how should I do it?
I basically solved the problem. Here I used embeddings extracted from GPT-2.
So yes, we can use the final token of the GPT-2 embedding sequence as the class token. Because of the self-attention mechanism from left-to-right, the final token can represent the sequential information.
Please check the following GitHub issue for an implementation that uses GPT-2 embeddings. github issue
I conducted experiments comparing GPT-2 embedding with RoBERTa embedding. I got better results only with RoBERTa embedding and not GPT-2.
Related
I have trained word embeddings using Fasttext - train.unsuperwised.
Is there a way to autotune the hyperparameters for this? Documentation gives autotuning for supervised training but I am not sure how supervised training can be done for embeddings.
You can used the supervised mode for embeddings, if you have target labels to predict per input text. But then the embeddings will be optimized for that classification purpose, rather than the more general usefulness people usually expect from unsupervised training.
Because such metaparameter optimization ("autotune") only makes sense if testing the results against a goal with clear right/wrong answers, it likely only works for the supervised mode, as shown by the docs.
If you're using the (normal, unsupervised) word-vectors for some other downstream task of your own, and you can create a repeatable evaluation for that task, you should write your own code to perform a search for the best metaparameters.
I am using a model consisting of an embedding layer and an LSTM to perform sequence labelling, in pytorch + torchtext. I have already tokenised the sentences.
If I use self-trained or other pre-trained word embedding vectors, this is straightforward.
But if I use the Huggingface transformers BertTokenizer.from_pretrained and BertModel.from_pretrained there is a '[CLS]' and '[SEP]' token added to the beginning and end of the sentence, respectively. So the output of the model becomes a sequence that is two elements longer than the label/target sequence.
What I am unsure of is:
Are these two tags needed for the BertModel to embed each token of a sentence "correctly"?
If they are needed, can I take them out after the BERT embedding layer, before the input to the LSTM, so that the lengths are correct in the output?
Yes, BertModel needed them since without those special symbols added, the output representations would be different. However, my experience says, if you fine-tune BertModel on the labeling task without [CLS] and [SEP] token added, then you may not see a significant difference. If you use BertModel to extract fixed word features, then you better add those special symbols.
Yes, you can take out the embedding of those special symbols. In fact, this is a general idea for sequence labeling or tagging tasks.
I suggest taking a look at some sequence labeling or tagging examples using BERT to become confident about your modeling decisions. You can find NER tagging example using Huggingface transformers here.
I would like to do some supervised binary classification tasks with sentences, and have been using spaCy because of its ease of use. I used spaCy to convert the text into vectors, and then fed the vectors to a machine learning model (e.g. XGBoost) to perform the classfication. However, the results have not been very satisfactory.
In spaCy, it is easy to load a model (e.g. BERT / Roberta / XLNet) to convert words / sentences to nlp objects. Directly calling the vector of the object will however will default to an average of the token vectors.
Here are two questions:
1) Can we do better than simply getting the average of token vectors, like having context/order-aware sentence vectors using spaCy? For example, can we extract the sentence embedding from the previous layer of the BERT transformer instead of the final token vectors in spaCy?
2) Would it be better to directly use spaCy to train the downstream binary classification task? For example, here discusses how to add a text classifier to a spaCy model. Or is it generally better to apply more powerful machine learning models like XGBoost?
Thanks in advance!
I found this being discussed in the page below. Maybe it helps.
"Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much."
https://github.com/huggingface/transformers/issues/1950
I would like to fine-tuning BERT for a specific domain on unlabeled data and get the output layer to check the similarity between them. How can I do it? Do I need to fine-tuning first a classifier task (or question answer, etc..) and get the embeddings? Or can I just use a pre-trained Bert model without task and fine-tuning with my own data?
There is no need to fine-tune for classification, especially if you do not have any supervised classification dataset.
You should continue training BERT the same unsupervised way it was originally trained, i.e., continue "pre-training" using the masked-language-model objective and next sentence prediction. Hugginface's implementation contains class BertForPretraining for this.
Is Google's pretrained word2vec model CBO or skipgram.
We load pretrained model by:
from gensim.models.keyedvectors as word2vec
model= word2vec.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')
How can we specifically load pretrained CBOW or skipgram model ?
The GoogleNews word-vectors were trained by Google, using a proprietary corpus, but they're never explicitly described all the training-parameters used. (It's not encoded in the file.)
It's been asked a number of times on the Google Group devoted to the word2vec-toolkit code, without a definitive answer. For example, there's a response from word2vec author Mikolov that he doesn't remember the training parameters. Elsewhere, another poster thinks one of the word2vec papers implies skip-gram was used – but as that passage doesn't precisely match other aspects (like vocabulary-size) of the released GoogleNews vectors, I wouldn't be completely confident of that.
As Google hasn't been clear, and in any case hasn't released alternate versions based on different training modes, if you want to run any tests or make any conclusions about the different modes, you'll have to use other vector-sets, or train your own vectors in varying ways.
Late to the party, but Mikolov describes the hyperparameters here. The Google News pretrained vectors were trained using CBOW. I believe that's the only option for you to load; there is no pretrained skip-gram version available.