I do sequence classification with Keras, using an RNN and embeddings. My sequences are a bit weird. I have words mixed with special symbols. Words are associated with fixed, pre-trained embeddings, but the special symbol embeddings have to be modified during training.
In an Embedding layer during learning, how can I keep some embeddings fixed while updating others? Is there a way to mask those indices which shouldn't be modified? Or is this a case for a custom Embedding layer?
I do not believe that this is achievable with the existing Embedding layer. To get around it I would just create a custom layer that builds two embedding layers internally, and only puts the embedding matrix of one of them into the trainable_parameters.
Related
I am using a model consisting of an embedding layer and an LSTM to perform sequence labelling, in pytorch + torchtext. I have already tokenised the sentences.
If I use self-trained or other pre-trained word embedding vectors, this is straightforward.
But if I use the Huggingface transformers BertTokenizer.from_pretrained and BertModel.from_pretrained there is a '[CLS]' and '[SEP]' token added to the beginning and end of the sentence, respectively. So the output of the model becomes a sequence that is two elements longer than the label/target sequence.
What I am unsure of is:
Are these two tags needed for the BertModel to embed each token of a sentence "correctly"?
If they are needed, can I take them out after the BERT embedding layer, before the input to the LSTM, so that the lengths are correct in the output?
Yes, BertModel needed them since without those special symbols added, the output representations would be different. However, my experience says, if you fine-tune BertModel on the labeling task without [CLS] and [SEP] token added, then you may not see a significant difference. If you use BertModel to extract fixed word features, then you better add those special symbols.
Yes, you can take out the embedding of those special symbols. In fact, this is a general idea for sequence labeling or tagging tasks.
I suggest taking a look at some sequence labeling or tagging examples using BERT to become confident about your modeling decisions. You can find NER tagging example using Huggingface transformers here.
are the text emebddings also fine-tuned when fine-tuning for classification task? Or up to which layer are the encodings fine-tuned (sencond last layer)?
If you are using the original BERT repository published by Google, all layers are trainable; meaning: no freezing at all. You can check that by printing tf.trainable_variables().
I'm trying different word embeddings methods, in order to pick the approache that works the best for me. I tried word2vec and FastText. Now, I would like to try Glove. In both word2vec and FastText, there is two versions: Skip-gram (predict context from word) and CBOW (predict word from context). But in Glove python package, there is no parameter that enables you to choose whether you want to use skipg-gram or Cbow.
Given that Glove does not work the same way as w2v, I'm wondering: Does it make sense to talk about skip-gram and cbow when using The Glove method ?
Thanks in Advance
Not really, skip-gram and CBOW are simply the names of the two Word2vec models. They are shallow neural networks that generate word embeddings by predicting a context from a word and vice versa, and then treating the output of the hidden layer as the vector/representation. GloVe uses a different approach, making use of the global statistics of a corpus by training on a co-occurrence matrix, rather than local context windows.
When I read the paper "Convolutional Neural Networks for Sentence Classification"-Yoon Kim-New York University, I noticed that the paper implemented the "CNN-non-static" model--A model with pre-trained vectors from word2vec,and all words— including the unknown ones that are randomly initialized, and the pre-trained vectors are fine-tuned for each task.
So I just do not understand how the pre-trained vectors are fine-tuned for each task. Cause as far as I know, the input vectors, which are converted from strings by word2vec.bin(pre-trained), just like image matrix, which can not change during training CNN. So, if they can, HOW? Please help me out, Thanks a lot in advance!
The word embeddings are weights of the neural network, and can therefore be updated during backpropagation.
E.g. http://sebastianruder.com/word-embeddings-1/ :
Naturally, every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as Embedding Layer.
How do you write a simple sequence copy task in keras using the LSTM architecture without an Embedding layer? I already have the word vectors.
If you say that you have word vectors, I guess you have a dictionary to map a word to its vector representation (calculated from word2vec, GloVe...).
Using this dictionary, you replace all words in your sequence by their corresponding vectors. You will also need to make all your sequences the same length, since LSTMs need all the input sequences to be of constant length. So you will need to determine a max_length value and trim all sequences that are longer and pad all sequences that are shorter with zeros (see Keras pad_sequences function).
Then you can pass the vectorized sequences directly to the LSTM layer of your neural network. Since the LSTM layer is the first layer of the network, you will need to define the input shape, which in your case is (max_length, embedding_dim, ).
This ways you skip the Embedding layer and use your own precomputed word vectors instead.
I had same problem after searching in Keras at "Stacked LSTM for sequence classification" part , I found following code might be useful:
model = Sequential()
model.add(LSTM(3NumberOfLSTM, return_sequences=True,
input_shape=(YourSequenceLenght, YourWord2VecLenght)))