I am trying to compute a "summary embedding vector" using the huggingface pre-trained BART model. After I feed in a text to summarise, can I take the last hidden state of the decoder from the pre-trained BART model and normalise that vector to compute a "summary embedding vector"? Or I should take the encoder last hidden state of my input and normalise that instead? From my understanding, the encoder is useful to extract information about the input, and hence taking the last hidden state of encoder would give me a representation of the text I wish to summarise. Does anyone have any suggestions? Thanks a lot!
Related
I am looking to perform a binary classification on each frame of a video. I have ground truth labels and the feature vectors for each frame of several videos, but am not sure how I could approach this as a time series problem to train an RNN.
How could I use the predictions on previous frames in a video to help inform future predictions? I think this could be accomplished with a one-to-one RNN, though I can't find any relevant examples.
Your task is quite similar to named-entity recognition task in NLP. In named-entity recognition you have a sequence of words and for each word you want to determine if a word is a named entity (say a person or company name) or not.
A many-to-many RNN architecture should be suitable for this task.
Please check for example the following article:
https://towardsdatascience.com/named-entity-recognition-ner-using-keras-bidirectional-lstm-28cd3f301f54
Instead of words you have video frames features, so you would need to replace the first Embedding layer with a Dense layer. Also in the above article N classes are used as labels and not just binary. So, as a started point you can try something like:
model = Sequential()
model.add(TimeDistributed(Dense(16, activation="relu")))
model.add(LSTM(32, return_sequences=True)
model.add(TimeDistributed(Dense(1, activation="relu")))
model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])
The model would take a sequence of video frame features as input and sequence of ground truth labels as output for each video.
I have collected a dataset of paragraphs summaries, where the summary may or may not correspond to the paragraph it is paired with. I also have the labels of whether a summary corresponds to the paragraph or not (1 if it is a corresponding pair, and 0 if it is not).
I would like to use the pretrained Pegasus_large model in Huggingface (off-the-shelf) and train it on this downstream classification task.
Since Pegasus does not have any CLS token, I was thinking of possible ways of doing this.
I want to concatenate the paragraph and summary together, pass it through the pretrained Pegasus encoder only, and then pool over the final hidden layer outputs of the encoder. If I use the Huggingface PegasusModel (the one without and summary generation head), it expects me to provide decoder_input_ids, which I assume are the true tokens (labels) when pegasus is trained as a seq2seq model for summary generation. However, since I am not training my model to generate summaries, and would like the encoder representation only, I am not sure what to put as my decoder_input_ids.
My questions are: 1. Am I right in assuming the decoder_input_ids are only used for training the model for sequence generation, and 2. How should I get the last hidden layer outputs without having any decoder_input_ids?
I have a dataset of utterances and corresponding sentiment label. I want to use an embedding of the sentiment label as an additional input to BERT (To simplify things, you can say that I want to initialize the embeddings for some tokens in my BERT model). There are 6-7 unique labels. I planned to use static embeddings like GloVe to map the label to an embedding, but this will not be compatible with BERT, which expects the input embedding to be of size 768. How can I generate static embeddings of my labels?
You can try sbert to generate embedding for both of your sentence and label of given dimension size.
Here is the library - https://www.sbert.net/
everyone! I was reading about Bert and wanted to do text classification with its word embeddings. I came across this line of code:
pooled_output, sequence_output = self.bert_layer([input_word_ids, input_mask, segment_ids])
and then:
clf_output = sequence_output[:, 0, :]
out = Dense(1, activation='sigmoid')(clf_output)
But I can't understand the use of pooled output. Doesn't sequence output contain all the information including the word embedding of ['CLS']? If so, why do we have pooled output?
Thanks in advance!
Sequence output is the sequence of hidden-states (embeddings) at the output of the last layer of the BERT model. It includes the embedding of the [CLS] token. Hence, for the sentence "You are on Stackoverflow", it gives 5 embeddings: one embedding for each of the four words (assuming the word "Stackoverflow" was tokenized into a single token) along with the embedding of the [CLS] token.
Pooled output is the embedding of the [CLS] token (from Sequence output), further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining. For further details, please refer to the BERT original paper.
If you have given a sequence, "You are on StackOverflow". The sequence_output will give 768 embeddings of these four words. But, the pooled output will just give you one embedding of 768, it will pool the embeddings of these four words.
As a number of other answers have pointed out, sequence_output is token-level with 2 dimensions - the first dimension corresponds to the number of tokens in the input text.
pooled_output is one-dimensional and seems to be some sort of a higher-order context embedding for the input text.
I initially felt they should contain practically the same information (or that sequence_output should contain more given the additional n_token dimension), but right now I'm training a semantic similarity model based on Bert and am seeing definitively better results using both sequence_output and pooled_output in the model, compared to using just sequence_output.
When I read the paper "Convolutional Neural Networks for Sentence Classification"-Yoon Kim-New York University, I noticed that the paper implemented the "CNN-non-static" model--A model with pre-trained vectors from word2vec,and all words— including the unknown ones that are randomly initialized, and the pre-trained vectors are fine-tuned for each task.
So I just do not understand how the pre-trained vectors are fine-tuned for each task. Cause as far as I know, the input vectors, which are converted from strings by word2vec.bin(pre-trained), just like image matrix, which can not change during training CNN. So, if they can, HOW? Please help me out, Thanks a lot in advance!
The word embeddings are weights of the neural network, and can therefore be updated during backpropagation.
E.g. http://sebastianruder.com/word-embeddings-1/ :
Naturally, every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as Embedding Layer.